RL Cryptocurrency Portfolio Management with SAC & DDPG

Source: arXiv:2511.20678 (preprint, công bố 16/11/2025) Tag: moi:2026-05-16 #reinforcement-learning #crypto #portfolio #sac #ddpg #preprint

Ý tưởng cốt lõi

Paper áp dụng Deep Reinforcement Learning (DRL) vào bài toán dynamic portfolio management trên 4 đồng tiền số chủ chốt: BTC, ETH, LTC, DOGE. Hai thuật toán RL được benchmark:

DDPG (Deep Deterministic Policy Gradient): actor-critic deterministic, hiệu quả cho continuous action space (portfolio weights là continuous) nhưng unstable trong môi trường noisy như crypto.
SAC (Soft Actor-Critic): stochastic, entropy-regularized — khuyến khích exploration và stability. Đây là thuật toán "đương đại" được ưa chuộng từ 2018 cho mọi continuous-control RL task.

Đối thủ baseline: Markowitz mean-variance optimization (classical 1952) — đại diện cho approach "non-learning".

Kết quả:

SAC vượt trội trên hầu hết các metric: Sharpe ratio, Sortino ratio, maximum drawdown, VaR, CVaR, cumulative portfolio value.
DDPG hoạt động OK nhưng kém ổn định hơn SAC, đặc biệt trong giai đoạn high vol.
Markowitz baseline thua xa cả hai — bằng chứng nữa cho thấy mean-variance không phù hợp với non-stationary, fat-tailed asset như crypto.

Cấu trúc agent:

State: historical price returns, technical indicators (RSI, MACD, BB), volatility estimates, recent portfolio weights.
Action: continuous vector trong simplex (4 weight cộng = 1, không short).
Reward: Sharpe-like reward với drawdown penalty.
Training: rolling out-of-sample validation, multi-cycle data 2017-2024.

Đóng góp lớn nhất không phải là "SAC > DDPG" (đã biết từ literature general RL) mà là xác nhận kết luận đó trong context crypto portfolio, cùng với benchmark cụ thể có thể replicate.

Ứng dụng giao dịch chính

Thực hành cho retail/prop trader:

SAC làm baseline default cho mọi RL portfolio task: không lý do để dùng DDPG vanilla nữa khi SAC đã prove tốt hơn trên cùng task.
Reward function design quan trọng hơn thuật toán: paper dùng Sharpe-like với DD penalty — đây là chuẩn industry. Tránh dùng raw P&L làm reward (RL agent sẽ học chiến lược high-vol high-leverage).
Out-of-sample rigorously: train 2017-2022 → test 2023-2024 trong paper. Khi self-replicate, đảm bảo có held-out post-FTX period (2022-Q4 onwards) — đó là test ác liệt nhất.
Universe size matter: 4 coin có thể chưa đủ diversify. Production setup thường dùng top 10-30 coin. Nhưng action space tăng → training time tăng exponentially → cần technique như action masking hoặc hierarchical RL.
Live deployment caveat: paper benchmark là backtest. Live có thêm: latency, partial fill, exchange downtime — chuẩn bị margin của error.

Áp dụng đa thị trường

Crypto spot (BTC, ETH, top altcoins)

Đây là sản phẩm chính của paper. Khuyến nghị implementation:
- Universe: top 10-20 coin theo market cap, refresh hàng tháng.
- Rebalance frequency: daily (cao hơn paper sẽ overhead-heavy với gas/fees on-chain).
- Cost realistic: 1-5 bps mỗi rebalance cho top coin trên Binance, có thể 10-30 bps cho altcoin.
- SAC + entropy temperature alpha cần tune carefully — too high = noise, too low = exploit suboptimal local.

Crypto perpetual futures

Tự nhiên mở rộng: universe = BTC perp, ETH perp, SOL perp, etc. Action space mở rộng allow long/short.
Funding rate là feature mạnh mà paper không include — khi tự build, luôn add funding rate vào state.
Liquidation risk: cần custom reward — terminal penalty cho liquidation (e.g. -100 reward).

US equity futures (ES, NQ, RTY)

Có thể adapt — universe 4-6 equity index futures. Nhưng:
- Vol crypto vs equity khác hẳn → reward function cần re-calibrate.
- Cross-asset correlation US futures cao (60-80%) → ít diversification benefit so với crypto 4-coin (~ 50-70%).
Khuyến nghị: extend universe sang bond futures (ZN, ZB) + FX futures (6E, 6J) để tăng dispersion.

VN30F (Hợp đồng tương lai chỉ số)

VN30F là single asset → portfolio task không trực tiếp apply.
Indirect use: dùng RL agent để size dynamics trên VN30F + cổ phiếu thành phần (HPG, VHM, VPB...). Đây là 31-asset portfolio (1 futures + 30 stocks).
Data limitation: chỉ 8 năm — khuyến nghị transfer learning từ US/crypto agent.

Cân nhắc cross-market chung

Action space càng lớn càng khó train: 4 coin trong paper là minimal. 20+ coin cần kỹ thuật advanced (PCA-based action compression, hierarchical RL).
Reward shaping là chỗ tạo edge thực sự — không phải algorithm choice.
Distribution shift giữa training period và live trading: monitor portfolio behavior daily so với expectation.

Minh họa Python

python

import numpy as np
import pandas as pd

class CryptoPortfolioEnv:
    """
    Continuous-action portfolio env trên N coin.
    Inspired by paper Hoque-arXiv 2511.20678 setup.
    """

    def __init__(self, returns_df: pd.DataFrame,
                 features_df: pd.DataFrame,
                 transaction_cost: float = 0.001,
                 dd_penalty: float = 5.0,
                 lookback: int = 30):
        """
        returns_df: shape (T, N) — daily returns của N coin.
        features_df: shape (T, F) — features (technical indicators, regime).
        """
        self.rets = returns_df.values
        self.features = features_df.values
        self.tc = transaction_cost
        self.dd_pen = dd_penalty
        self.lookback = lookback
        self.N = returns_df.shape[1]
        self.reset()

    def reset(self):
        self.t = self.lookback
        self.weights = np.ones(self.N) / self.N  # equal weight start
        self.equity = 1.0
        self.peak_equity = 1.0
        self.returns_history = []
        return self._get_state()

    def _get_state(self) -> np.ndarray:
        recent_rets_flat = self.rets[self.t - self.lookback:self.t].flatten()
        current_features = self.features[self.t]
        return np.concatenate([recent_rets_flat, current_features, self.weights])

    def step(self, action: np.ndarray):
        # Project action onto simplex (softmax)
        target_weights = np.exp(action) / np.exp(action).sum()

        # Transaction cost on rebalance
        cost = self.tc * np.abs(target_weights - self.weights).sum()

        # Portfolio return next bar
        port_ret = (target_weights * self.rets[self.t]).sum() - cost
        self.equity *= (1 + port_ret)
        self.peak_equity = max(self.peak_equity, self.equity)
        self.returns_history.append(port_ret)
        self.weights = target_weights

        # Reward: Sharpe + DD penalty
        if len(self.returns_history) >= 20:
            recent = np.array(self.returns_history[-20:])
            sharpe = recent.mean() / (recent.std() + 1e-8)
        else:
            sharpe = port_ret
        dd = (self.equity - self.peak_equity) / self.peak_equity
        reward = sharpe + self.dd_pen * dd

        self.t += 1
        done = (self.t >= len(self.rets) - 1)
        return self._get_state(), reward, done, {'equity': self.equity}


# Train với SAC (stable-baselines3):
# from stable_baselines3 import SAC
# env = CryptoPortfolioEnv(returns_df, features_df)
# model = SAC('MlpPolicy', env, learning_rate=3e-4,
#             ent_coef='auto', verbose=1)
# model.learn(total_timesteps=500_000)

RL Cryptocurrency Portfolio Management with SAC & DDPG ​

Ý tưởng cốt lõi ​

Ứng dụng giao dịch chính ​

Áp dụng đa thị trường ​

Crypto spot (BTC, ETH, top altcoins) ​

Crypto perpetual futures ​

US equity futures (ES, NQ, RTY) ​

VN30F (Hợp đồng tương lai chỉ số) ​

Cân nhắc cross-market chung ​

Minh họa Python ​