Exploring Cutting-Edge DRL Algorithms for Quantitative Finance

cover
8 Jun 2024

Authors:

(1) Xiao-Yang Liu, Hongyang Yang, Columbia University (xl2427,hy2500@columbia.edu);

(2) Jiechao Gao, University of Virginia (jg5ycn@virginia.edu);

(3) Christina Dan Wang (Corresponding Author), New York University Shanghai (christina.wang@nyu.edu).

Abstract and 1 Introduction

2 Related Works and 2.1 Deep Reinforcement Learning Algorithms

2.2 Deep Reinforcement Learning Libraries and 2.3 Deep Reinforcement Learning in Finance

3 The Proposed FinRL Framework and 3.1 Overview of FinRL Framework

3.2 Application Layer

3.3 Agent Layer

3.4 Environment Layer

3.5 Training-Testing-Trading Pipeline

4 Hands-on Tutorials and Benchmark Performance and 4.1 Backtesting Module

4.2 Baseline Strategies and Trading Metrics

4.3 Hands-on Tutorials

4.4 Use Case I: Stock Trading

4.5 Use Case II: Portfolio Allocation and 4.6 Use Case III: Cryptocurrencies Trading

5 Ecosystem of FinRL and Conclusions, and References

We review the state-of-the-art DRL algorithms, relevant opensource libraries, and applications of DRL in quantitative finance.

2.1 Deep Reinforcement Learning Algorithms

Many DRL algorithms have been developed. They fall into three categories: value based, policy based, and actor-critic based.

A value based algorithm estimates a state-action value function that guides the optimal policy. Q-learning [49] approximates a Qvalue (expected return) by iteratively updating a Q-table, which works for problems with small discrete state spaces and action spaces. Researchers proposed to utilize deep neural networks for approximating Q-value functions, e.g., deep Q-network (DQN) and its variants double DQN and dueling DQN [1].

A policy based algorithm directly updates the parameters of a policy through policy gradient [45]. Instead of value estimation, policy gradient uses a neural network to model the policy directly, whose input is a state and output is a probability distribution according to which the agent takes an action at the input state.

An actor-critic based algorithm combines the advantages of value based and policy based algorithms. It updates two neural networks, namely, an actor network updates the policy (probability distribution) while a critic network estimates the state-action value function. During the training process, the actor network takes actions and the critic network evaluates those actions. The state-of-art actor-critic based algorithms are deep deterministic policy gradient (DDPG), proximal policy optimization (PPO), asynchronous advantage actor critic (A3C), advantage actor critic (A2C), soft actor-critic (SAC), multi-agent DDPG, and twin-delayed DDPG (TD3) [1].

This paper is available on arxiv under CC BY 4.0 DEED license.