该部分是蘑菇书的扩展内容,整理&总结&解读强化学习领域的经典论文。主要有DQN类、策略梯度类、模仿学习类、分布式强化学习、多任务强化学习、探索策略、分层强化学习以及其他技巧等方向的论文。后续会配有视频解读(与WhalePaper合作),会陆续上线Datawhale B站公众号。
每周更新5篇左右的论文,欢迎关注。
如果在线阅读Markdown文件有问题(例如公式编译错误、图片显示较慢等),请下载到本地阅读,或观看PDF文件夹中的同名文件。
转发请加上链接&来源Easy RL项目
类别 | 论文题目 | 原文链接 | 视频解读 |
---|---|---|---|
Value-based | Playing Atari with Deep Reinforcement Learning (DQN) [Markdown] [PDF] | https://arxiv.org/abs/1312.5602 | |
DRQN: Deep Recurrent Q-Learning for Partially Observable MDPs [Markdown] [PDF] | https://arxiv.org/abs/1507.06527 | ||
Dueling Network Architectures for Deep Reinforcement Learning (Dueling DQN) [Markdown] [PDF] | https://arxiv.org/abs/1511.06581 | ||
Deep Reinforcement Learning with Double Q-learning (Double DQN) [Markdown] [PDF] | https://arxiv.org/abs/1509.06461 | ||
NoisyDQN | https://arxiv.org/pdf/1706.10295.pdf | ||
QRDQN | https://arxiv.org/pdf/1710.10044.pdf | ||
CQL | https://arxiv.org/pdf/2006.04779.pdf | ||
Prioritized Experience Replay (PER) [Markdown] [PDF] | https://arxiv.org/abs/1511.05952 | ||
Rainbow: Combining Improvements in Deep Reinforcement Learning (Rainbow) [Markdown] [PDF] | https://arxiv.org/abs/1710.02298 | ||
A Distributional Perspective on Reinforcement Learning (C51) [Markdown] [PDF] | https://arxiv.org/abs/1707.06887 | ||
Policy -based | Asynchronous Methods for Deep Reinforcement Learning (A3C) [Markdown] [PDF] | https://arxiv.org/abs/1602.01783 | |
Trust Region Policy Optimization (TRPO) [Markdown] [PDF] | https://arxiv.org/abs/1502.05477 | ||
High-Dimensional Continuous Control Using Generalized Advantage Estimation (GAE) [Markdown] [PDF] | https://arxiv.org/abs/1506.02438 | ||
Proximal Policy Optimization Algorithms (PPO) [Markdown] [PDF] | https://arxiv.org/abs/1707.06347 | ||
Emergence of Locomotion Behaviours in Rich Environments (PPO-Penalty) [Markdown] [PDF] | https://arxiv.org/abs/1707.02286 | ||
Scalable trust-region method for deep reinforcement learning using Kronecker-factored approximation (ACKTP) [Markdown] [PDF] | https://arxiv.org/abs/1708.05144 | ||
Sample Efficient Actor-Critic with Experience Replay (ACER) | https://arxiv.org/abs/1611.01224 | ||
Deterministic Policy Gradient Algorithms (DPG) [Markdown] [PDF] | http://proceedings.mlr.press/v32/silver14.pdf | ||
Continuous Control With Deep Reinforcement Learning (DDPG) | https://arxiv.org/abs/1509.02971 | ||
Addressing Function Approximation Error in Actor-Critic Methods (TD3) [Markdown] [PDF] | https://arxiv.org/abs/1802.09477 | ||
Q-Prop: Sample-Efficient Policy Gradient with An Off-Policy Critic (Q-Prop) | https://arxiv.org/abs/1611.02247 | ||
Action-depedent Control Variates for Policy Optimization via Stein’s Identity (Stein Control Variates) [Markdown] [PDF] | https://arxiv.org/abs/1710.11198 | ||
The Mirage of Action-Dependent Baselines in Reinforcement Learning [Markdown] [PDF] | https://arxiv.org/abs/1802.10031 | ||
Bridging the Gap Between Value and Policy Based Reinforcement Learning (PCL) [Markdown] [PDF] | https://arxiv.org/abs/1702.08892 | ||
MaxEntropy RL | Soft Q learning | https://arxiv.org/abs/1702.08165 | |
Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor (SAC) [Markdown] [PDF] | https://arxiv.org/abs/1801.01290 | ||
Multi-Agent | IQL | https://web.media.mit.edu/~cynthiab/Readings/tan-MAS-reinfLearn.pdf | |
VDN | https://arxiv.org/abs/1706.05296 | ||
QTRAN | http://proceedings.mlr.press/v97/son19a/son19a.pdf | ||
QMIX | https://arxiv.org/abs/1803.11485 | ||
Weighted QMIX | https://arxiv.org/abs/2006.10800 | ||
COMA | https://ojs.aaai.org/index.php/AAAI/article/download/11794/11653 | ||
MAPPO | https://arxiv.org/abs/2103.01955 | ||
MADDPG | |||
Sparse reward | Hierarchical DQN | https://arxiv.org/abs/1604.06057 | |
ICM | https://arxiv.org/pdf/1705.05363.pdf | ||
HER | https://arxiv.org/pdf/1707.01495.pdf | ||
Imitation Learning | GAIL | https://arxiv.org/abs/1606.03476 | |
TD3+BC | https://arxiv.org/pdf/2106.06860.pdf | ||
Model based | Dyna Q | https://arxiv.org/abs/1801.06176 |