Multi-Dimensional Behavioral Evaluation of Agentic Stock Prediction Systems Using LLM Judges with Closed-Loop Reinforcement Learning Feedback

Authors: Mohammad Al Ridhawi, Mahtab Haj Ali, Hussein Al Osman (2026) Source: arXiv:2605.05739 (preprint, công bố 07/05/2026) Tag: moi:2026-05-16 #llm #reinforcement-learning #sac #behavioral-eval #preprint

Ý tưởng cốt lõi

Các hệ thống "agentic" dự báo giá cổ phiếu thường ra một chuỗi quyết định phụ thuộc nhau — phát hiện regime, định tuyến pathway giữa các module, kích hoạt rule kiểm soát RL — nhưng chất lượng từng bước bị che lấp bởi metric tổng hợp như MAPE hay directional accuracy. Paper đề xuất một framework behavioral evaluation: log toàn bộ trace quyết định, nhóm thành các episode 5 ngày, và chấm điểm theo 6 chiều — regime detection, routing, adaptation, risk calibration, strategy coherence, error recovery — bằng ensemble 3 LLM judge (GPT 5.4, Claude 4.6 Opus, Gemini 3.1 Pro). Trên 420 episode, perturbation test cho thấy khi tác giả intentionally bóp méo một chiều, điểm chiều đó giảm 1.6-2.4 trong khi các chiều khác chỉ giảm trung bình 0.32 — tức judges thực sự phân biệt được. Liên đồng giữa các judge đạt Krippendorff's α = 0.85 (rất cao).

Composite behavioral score tương quan với Sharpe 20 ngày realized ở ρ = 0.72 — không hoàn hảo nhưng đủ mạnh để dùng làm proxy training signal. Phần thú vị nhất: tác giả "đóng vòng" framework bằng cách convert điểm thiếu hụt mỗi chiều thành một penalty term có credit-assigned, cộng vào reward của Soft Actor-Critic (SAC). Sau 3 chu kỳ fine-tune ngắn (chỉ trên validation period), trên test 2017-2025 hold-out: MAPE 1-ngày giảm từ 0.61% xuống 0.54% (giảm 11.5% tương đối, p < 0.001, Cohen's d = 0.31), directional accuracy tăng từ 71% lên 74%, và Sharpe tăng 18% (95% bootstrap CI [8.2%, 27.4%]) — với phần lớn cải thiện tập trung ở episode high-volatility, nơi hệ thống gốc behavior kém nhất.

Đây là một sample-efficient approach để align RL agent với "common sense" của LLM judge — đặc biệt phù hợp khi reward thị trường thưa thớt và noisy. Tác giả cẩn trọng nhấn mạnh kết quả từ backtest, chưa bao hàm hiệu ứng live deployment (slippage, market impact, regime shift sau khi triển khai).

Ứng dụng giao dịch chính

Framework có thể áp dụng theo 2 mức:

Mức 1 — Diagnostic (không cần fine-tune): chạy 3 LLM judge lên trace của bot/agent hiện tại, dùng 6 chiều như scorecard hàng tuần. Phát hiện bot bị suy yếu ở chiều nào sớm hơn nhiều tuần so với khi nhìn vào Sharpe drawdown.

Mức 2 — Closed-loop fine-tune: nếu agent là RL (SAC, PPO, DDPG), cộng vào reward một penalty:

reward_t_modified = reward_market_t - λ × Σ_d max(0, threshold_d - score_d_t)

Trong đó score_d_t là điểm chiều d ở episode chứa step t (gán credit ngược về từng step thuộc episode), threshold_d là mức tối thiểu chấp nhận (paper dùng 7/10), λ ~ 0.3-0.5 trade-off market vs behavior.

Sáu chiều scorecard quan trọng nhất:

Regime detection: agent có nhận diện sớm chuyển regime không (vd: từ trending sang choppy)?
Routing: chọn đúng module/sub-strategy cho regime?
Adaptation: thay đổi tham số (vol target, stop, leverage) khi vol thay đổi?
Risk calibration: position size theo Kelly / vol-targeting đúng không?
Strategy coherence: không tự mâu thuẫn (vd: long vì momentum nhưng đặt stop dưới swing low ngược chiều)?
Error recovery: sau một losing streak có "tilt" — overtrading, doubling down không?

Áp dụng đa thị trường

VN30F (Hợp đồng tương lai chỉ số Việt Nam)

Bot VN30F intraday thường suffer regime detection kém — VN30F có regime sáng (lỏng), trưa (rỗng), chiều (tăng dần). LLM judge có thể được prompt với:

Context: thời gian phiên + spread + volume + ATR 14 ngày.
Trace: hành động và lý do (entry, exit, hold, position resize).
Đánh giá: agent có nhận ra phiên hiện tại là "lunch noise" và giảm leverage không?

Ưu điểm áp dụng VN30F:

Volume thấp ban đầu phiên → episode 5 ngày cho đủ trace để judge phân biệt.
LLM (đặc biệt Claude/GPT) hiểu khá tốt context tiếng Việt + ticker HOSE — paper dùng English, ta chỉ cần dịch prompt.

Cảnh báo:

Latency: judge ensemble cost ~$0.5-1 mỗi episode (3 LLM × prompt 4k token), không thể chạy real-time. Dùng cho weekly review + monthly fine-tune.
Dữ liệu VN30F backtest dài (5+ năm) chỉ có từ 2018, nên hold-out test giới hạn 2024-2026. Phải bootstrap CI cẩn thận, không dùng MAPE đơn lẻ.

US equity futures (ES, NQ, RTY, YM, MNQ)

Đây là môi trường lý tưởng cho framework: dữ liệu dồi dào, regime nhiều (FOMC, NFP, earnings season), behavior judge có nhiều ví dụ training trong news/research. Ứng dụng cụ thể:

Trader thuật toán trên ES/NQ với SAC controller → chèn LLM penalty vào reward.
Episode boundary tự nhiên: 1 tuần giao dịch (5 ngày × 6.5h = 32.5h dữ liệu).
Đánh giá routing module — vd: paper Mesfin (cùng repo này) cho thấy intraday OHLCV alone không edge → agent nên route sang regime-conditioning (VIX, MOVE, news) thay vì lặp lại OHLCV signal.

Crypto spot (BTC, ETH, altcoins)

Crypto 24/7 không có "session" tự nhiên — episode 5 ngày cũng phù hợp (~120 funding cycle). Behavior dimension thêm:

Funding awareness: agent có giảm size khi funding extreme không?
Liquidity-time-of-day awareness: trade trong Asian session vol thấp có handicap không?

LLM judge có thể đánh giá agent có overreaction với tin tức crypto-Twitter/news không — đặc biệt giá trị khi prompt với context "BTC vừa break ATH +5% trong 1h, hỏi tại sao agent vẫn long size full?".

Crypto perpetual futures

Tương tự crypto spot nhưng thêm chiều liquidation calibration:

Agent có tracking liquidation cluster để né cascade không?
Sau một liquidation event, agent có "tilt" và oversize không?
Funding flip detection: từ +0.05% → -0.02%, agent có giảm size hoặc đổi chiều?

LLM judge đặc biệt mạnh ở việc nhận xét strategy coherence: vd agent vào long với lý do "funding âm, mean-reversion" nhưng vẫn đặt stop ngắn theo trend-following — đây là logic inconsistency mà rule-based check khó phát hiện.

Cân nhắc cross-market chung

Composite score có ý nghĩa tương đối, không tuyệt đối — so sánh với baseline của chính bot bạn qua thời gian, không so điểm tuyệt đối giữa 2 bot khác.
Ensemble 3 LLM giảm bias từng model nhưng cost x3. Nếu budget hạn chế, dùng 1 model (Claude Opus) + 2 prompt variant (zero-shot, few-shot) làm proxy ensemble.
Credit assignment ngược về step trong episode: paper dùng exponential decay (gần cuối episode credit nhiều hơn). Khi áp dụng vào RL khác (PPO, A3C), nhớ điều chỉnh γ discount cho phù hợp.
ρ = 0.72 với Sharpe đủ tốt cho fine-tune signal nhưng không đủ để dùng làm metric duy nhất. Vẫn cần Sharpe/Calmar/PSR truyền thống cho final go/no-go.
Risk: LLM judge có thể "drift" giữa các phiên bản model (GPT-5.4 → GPT-5.5). Pin version trong training run, log judge metadata.

Minh họa Python

Code dưới minh họa pipeline đơn giản: chạy 3 LLM judge giả lập (replace bằng API thật) trên trace agent, tính composite score, sau đó cộng penalty vào reward step.

python

# LLM-judge behavioral evaluation cho agent trading
# Theo Al Ridhawi, Ali, Osman (2026), arXiv:2605.05739
# Yêu cầu: numpy, pandas, anthropic/openai (khi dùng thật)

import json
import numpy as np
import pandas as pd
from dataclasses import dataclass
from typing import Callable


DIMENSIONS = [
    "regime_detection",
    "routing",
    "adaptation",
    "risk_calibration",
    "strategy_coherence",
    "error_recovery",
]


@dataclass
class TraceStep:
    """Một bước trong episode: state observation, action, reward thị trường."""

    timestamp: pd.Timestamp
    state_summary: str
    action: str
    reasoning: str
    market_reward: float


def build_judge_prompt(episode: list[TraceStep], dim: str) -> str:
    """Sinh prompt cho LLM judge để chấm 1 chiều."""
    trace_text = "\n".join(
        f"[{s.timestamp}] state={s.state_summary} action={s.action} why={s.reasoning} r={s.market_reward:+.4f}"
        for s in episode
    )
    return f"""Bạn là chuyên gia đánh giá agent giao dịch. Chấm chiều "{dim}" trên thang 0-10
dựa trên trace dưới đây (1 episode 5 ngày). Trả JSON: {{"score": float, "reason": "..."}}.

{trace_text}
"""


def mock_llm_call(prompt: str, judge_name: str) -> dict:
    """
    PLACEHOLDER: thay bằng anthropic.messages.create / openai.chat.completions.create.
    Mock trả về score ngẫu nhiên có bias để demo chạy được.
    """
    rng = np.random.default_rng(hash(prompt + judge_name) % (2**32))
    return {"score": float(rng.uniform(5.0, 9.0)), "reason": f"mock-{judge_name}"}


def evaluate_episode(
    episode: list[TraceStep], judges: list[str]
) -> pd.DataFrame:
    """Trả về DataFrame [dim × judge] điểm từng chiều của từng judge."""
    scores = {dim: {} for dim in DIMENSIONS}
    for dim in DIMENSIONS:
        prompt = build_judge_prompt(episode, dim)
        for j in judges:
            resp = mock_llm_call(prompt, j)
            scores[dim][j] = resp["score"]
    return pd.DataFrame(scores).T  # rows=dim, cols=judge


def composite_score(scores_df: pd.DataFrame, weights: dict | None = None) -> float:
    """Composite = trung bình các chiều, ensemble = mean across judges."""
    if weights is None:
        weights = {d: 1.0 for d in DIMENSIONS}
    per_dim = scores_df.mean(axis=1)  # ensemble across judges
    w = np.array([weights[d] for d in per_dim.index])
    return float((per_dim.values * w).sum() / w.sum())


def behavior_penalty(
    scores_df: pd.DataFrame,
    threshold: float = 7.0,
    lambda_pen: float = 0.4,
) -> float:
    """
    Penalty âm để cộng vào reward step thuộc episode:
        penalty = - λ × Σ_d max(0, threshold - score_d)
    """
    per_dim = scores_df.mean(axis=1)
    deficits = np.maximum(0.0, threshold - per_dim.values)
    return -float(lambda_pen * deficits.sum())


def apply_credit_assignment(
    rewards: pd.Series, penalty: float, decay: float = 0.95
) -> pd.Series:
    """
    Trải penalty ngược về các step trong episode với exponential decay
    (gần cuối episode bị trừ nhiều hơn — chính là nơi quyết định
    quan trọng nhất thường nằm).
    """
    n = len(rewards)
    weights = np.array([decay ** (n - 1 - i) for i in range(n)])
    weights /= weights.sum()
    return rewards + penalty * weights


def episode_to_modified_rewards(
    episode: list[TraceStep],
    judges: list[str] = None,
    threshold: float = 7.0,
    lambda_pen: float = 0.4,
) -> pd.Series:
    """End-to-end: chấm episode, tính penalty, áp credit assignment."""
    if judges is None:
        judges = ["GPT-5.4", "Claude-4.6", "Gemini-3.1"]
    scores = evaluate_episode(episode, judges)
    pen = behavior_penalty(scores, threshold=threshold, lambda_pen=lambda_pen)
    raw = pd.Series(
        [s.market_reward for s in episode],
        index=[s.timestamp for s in episode],
    )
    return apply_credit_assignment(raw, penalty=pen)


if __name__ == "__main__":
    # Sinh 1 episode giả: 5 ngày × 6 quyết định/ngày = 30 step
    rng = np.random.default_rng(0)
    base = pd.Timestamp("2026-05-05", tz="UTC")
    episode = [
        TraceStep(
            timestamp=base + pd.Timedelta(hours=4 * i),
            state_summary=f"vol_5d={rng.uniform(0.5,3.0):.2f}% regime={'trend' if i%3 else 'chop'}",
            action=np.random.choice(["LONG", "SHORT", "FLAT", "ADD", "REDUCE"]),
            reasoning="momentum>2σ" if i % 2 else "mean_revert<-1σ",
            market_reward=float(rng.normal(0.0001, 0.005)),
        )
        for i in range(30)
    ]
    scores = evaluate_episode(episode, judges=["GPT-5.4", "Claude-4.6", "Gemini-3.1"])
    print("Per-dimension ensemble scores:")
    print(scores.mean(axis=1))
    print(f"\nComposite score: {composite_score(scores):.2f}")
    modified = episode_to_modified_rewards(episode)
    print(f"\nTotal raw reward: {sum(s.market_reward for s in episode):+.4f}")
    print(f"Total modified reward (after behavior penalty): {modified.sum():+.4f}")

Multi-Dimensional Behavioral Evaluation of Agentic Stock Prediction Systems Using LLM Judges with Closed-Loop Reinforcement Learning Feedback ​

Ý tưởng cốt lõi ​

Ứng dụng giao dịch chính ​

Áp dụng đa thị trường ​

VN30F (Hợp đồng tương lai chỉ số Việt Nam) ​

US equity futures (ES, NQ, RTY, YM, MNQ) ​

Crypto spot (BTC, ETH, altcoins) ​

Crypto perpetual futures ​

Cân nhắc cross-market chung ​

Minh họa Python ​