# OpenAI-Spinning-Up-Papers-Notes **Repository Path**: qiao54007/open-ai-spinning-up-papers-notes ## Basic Information - **Project Name**: OpenAI-Spinning-Up-Papers-Notes - **Description**: 强化学习论文精读仓库:基于OpenAI Spinning Up关键论文清单,提供论文解析、笔记和代码实现,帮助深入理解强化学习核心算法与实战应用。 - **Primary Language**: Python - **License**: MIT - **Default Branch**: master - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 1 - **Created**: 2025-08-04 - **Last Updated**: 2025-08-27 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # 强化学习关键论文精读 (A Reading List for Key Papers in RL) 本仓库整理了 OpenAI Spinning Up 推荐的强化学习关键论文,旨在为 RL 学习者和研究者提供一份结构化的阅读路线图。 您可以通过本仓库: - **跟踪阅读进度**:使用 `状态` 列标记每篇论文的阅读情况。 - **记录和沉淀思考**:为每篇论文创建独立的笔记文件,并链接到主列表。 - **快速概览**:清晰地查看各个子领域的核心论文和贡献。 ## 目录 - [Model-Free RL](#model-free-rl) - [Exploration](#exploration) - [Transfer and Multitask RL](#transfer-and-multitask-rl) - [Hierarchy](#hierarchy) - [Memory](#memory) - [Model-Based RL](#model-based-rl) - [Meta-RL](#meta-rl) - [Scaling RL](#scaling-rl) - [RL in the Real World](#rl-in-the-real-world) - [Safety](#safety) - [Imitation Learning and Inverse Reinforcement Learning](#imitation-learning-and-inverse-reinforcement-learning) - [Reproducibility, Analysis, and Critique](#reproducibility-analysis-and-critique) - [Bonus: Classic Papers in RL Theory or Review](#bonus-classic-papers-in-rl-theory-or-review) ## 如何使用 1. **Fork 本仓库**到您自己的 Gitee 账号。 2. **Clone 仓库**到本地。 3. 当您开始阅读一篇论文时,在 `README.md` 中将其**状态**从 `⬜️ 未读` 修改为 `⏳ 在读`。 4. 在 `notes/` 文件夹下,复制 `template.md` 并重命名为 `[编号]-[关键词].md` (例如 `001-dqn.md`),开始记录您的笔记。 5. 完成阅读和笔记后,将**状态**修改为 `✅ 已读`。 6. `commit` 和 `push` 您的更改。 --- ## Model-Free RL ### a. Deep Q-Learning | # | 论文标题 (Title) | 作者 & 年份 | 核心贡献/算法 | 状态 | 我的笔记 | |:-:|:---|:---|:---|:--------:|:---:| | 1 | [Playing Atari with Deep Reinforcement Learning](https://www.cs.toronto.edu/~vmnih/docs/dqn.pdf) | Mnih et al, 2013 | DQN | ✅ 已读 | [笔记](./notes/001-dqn.md) | | 2 | [Deep Recurrent Q-Learning for Partially Observable MDPs](https://arxiv.org/abs/1507.06527) | Hausknecht & Stone, 2015 | DRQN | ✅ 已读 | [笔记](./notes/002-drqn.md) | | 3 | [Dueling Network Architectures for Deep Reinforcement Learning](https://arxiv.org/abs/1511.06581) | Wang et al, 2015 | Dueling DQN | ⬜️ 未读 | [笔记](./notes/003-dueling-dqn.md) | | 4 | [Deep Reinforcement Learning with Double Q-learning](https://arxiv.org/abs/1509.06461) | Hasselt et al, 2015 | Double DQN | ⬜️ 未读 | [笔记](./notes/004-double-dqn.md) | | 5 | [Prioritized Experience Replay](https://arxiv.org/abs/1511.05952) | Schaul et al, 2015 | PER | ⬜️ 未读 | [笔记](./notes/005-per.md) | | 6 | [Rainbow: Combining Improvements in Deep Reinforcement Learning](https://arxiv.org/abs/1710.02298) | Hessel et al, 2017 | Rainbow DQN | ⬜️ 未读 | [笔记](./notes/006-rainbow.md) | ### b. Policy Gradients | # | 论文标题 (Title) | 作者 & 年份 | 核心贡献/算法 | 状态 | 我的笔记 | |:-:|:---|:---|:---|:---:|:---:| | 7 | [Asynchronous Methods for Deep Reinforcement Learning](https://arxiv.org/abs/1602.01783) | Mnih et al, 2016 | A3C | ⬜️ 未读 | [笔记](./notes/007-a3c.md) | | 8 | [Trust Region Policy Optimization](https://arxiv.org/abs/1502.05477) | Schulman et al, 2015 | TRPO | ⬜️ 未读 | [笔记](./notes/008-trpo.md) | | 9 | [High-Dimensional Continuous Control Using Generalized Advantage Estimation](https://arxiv.org/abs/1506.02438) | Schulman et al, 2015 | GAE | ⬜️ 未读 | [笔记](./notes/009-gae.md) | | 10 | [Proximal Policy Optimization Algorithms](https://arxiv.org/abs/1707.06347) | Schulman et al, 2017 | PPO | ⬜️ 未读 | [笔记](./notes/010-ppo.md) | | 11 | [Emergence of Locomotion Behaviours in Rich Environments](https://arxiv.org/abs/1707.02286) | Heess et al, 2017 | PPO-Penalty | ⬜️ 未读 | [笔记](./notes/011-ppo-locomotion.md) | | 12 | [Scalable trust-region method for deep reinforcement learning using Kronecker-factored approximation](https://arxiv.org/abs/1708.05144) | Wu et al, 2017 | ACKTR | ⬜️ 未读 | [笔记](./notes/012-acktr.md) | | 13 | [Sample Efficient Actor-Critic with Experience Replay](https://arxiv.org/abs/1611.01224) | Wang et al, 2016 | ACER | ⬜️ 未读 | [笔记](./notes/013-acer.md) | | 14 | [Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor](https://arxiv.org/abs/1801.01290) | Haarnoja et al, 2018 | SAC | ⬜️ 未读 | [笔记](./notes/014-sac.md) | ### c. Deterministic Policy Gradients | # | 论文标题 (Title) | 作者 & 年份 | 核心贡献/算法 | 状态 | 我的笔记 | |:-:|:---|:---|:---|:---:|:---:| | 15 | [Deterministic Policy Gradient Algorithms](http://proceedings.mlr.press/v32/silver14.pdf) | Silver et al, 2014 | DPG | ⬜️ 未读 | [笔记](./notes/015-dpg.md) | | 16 | [Continuous Control With Deep Reinforcement Learning](https://arxiv.org/abs/1509.02971) | Lillicrap et al, 2015 | DDPG | ⬜️ 未读 | [笔记](./notes/016-ddpg.md) | | 17 | [Addressing Function Approximation Error in Actor-Critic Methods](https://arxiv.org/abs/1802.09477) | Fujimoto et al, 2018 | TD3 | ⬜️ 未读 | [笔记](./notes/017-td3.md) | ### d. Distributional RL | # | 论文标题 (Title) | 作者 & 年份 | 核心贡献/算法 | 状态 | 我的笔记 | |:-:|:---|:---|:---|:---:|:---:| | 18 | [A Distributional Perspective on Reinforcement Learning](https://arxiv.org/abs/1707.06887) | Bellemare et al, 2017 | C51 | ⬜️ 未读 | [笔记](./notes/018-c51.md) | | 19 | [Distributional Reinforcement Learning with Quantile Regression](https://arxiv.org/abs/1710.10044) | Dabney et al, 2017 | QR-DQN | ⬜️ 未读 | [笔记](./notes/019-qr-dqn.md) | | 20 | [Implicit Quantile Networks for Distributional Reinforcement Learning](https://arxiv.org/abs/1806.06923) | Dabney et al, 2018 | IQN | ⬜️ 未读 | [笔记](./notes/020-iqn.md) | | 21 | [Dopamine: A Research Framework for Deep Reinforcement Learning](https://openreview.net/forum?id=ByG_3s09KX) | Anonymous, 2018 | Dopamine Framework | ⬜️ 未读 | [笔记](./notes/021-dopamine.md) | ### e. Policy Gradients with Action-Dependent Baselines | # | 论文标题 (Title) | 作者 & 年份 | 核心贡献/算法 | 状态 | 我的笔记 | |:-:|:---|:---|:---|:---:|:---:| | 22 | Q-Prop: Sample-Efficient Policy Gradient with An Off-Policy Critic | Gu et al, 2016 | Q-Prop | ⬜️ 未读 | [笔记](./notes/022-q-prop.md) | | 23 | Action-depedent Control Variates for Policy Optimization via Stein’s Identity | Liu et al, 2017 | Stein Control Variates | ⬜️ 未读 | [笔记](./notes/023-stein-control-variates.md) | | 24 | The Mirage of Action-Dependent Baselines in Reinforcement Learning | Tucker et al, 2018 | Critique of Baselines | ⬜️ 未读 | [笔记](./notes/024-mirage-baselines.md) | ### f. Path-Consistency Learning | # | 论文标题 (Title) | 作者 & 年份 | 核心贡献/算法 | 状态 | 我的笔记 | |:-:|:---|:---|:---|:---:|:---:| | 25 | Bridging the Gap Between Value and Policy Based Reinforcement Learning | Nachum et al, 2017 | PCL | ⬜️ 未读 | [笔记](./notes/025-pcl.md) | | 26 | Trust-PCL: An Off-Policy Trust Region Method for Continuous Control | Nachum et al, 2017 | Trust-PCL | ⬜️ 未读 | [笔记](./notes/026-trust-pcl.md) | ### g. Other Directions for Combining Policy-Learning and Q-Learning | # | 论文标题 (Title) | 作者 & 年份 | 核心贡献/算法 | 状态 | 我的笔记 | |:-:|:---|:---|:---|:---:|:---:| | 27 | Combining Policy Gradient and Q-learning | O’Donoghue et al, 2016 | PGQL | ⬜️ 未读 | [笔记](./notes/027-pgql.md) | | 28 | The Reactor: A Fast and Sample-Efficient Actor-Critic Agent for Reinforcement Learning | Gruslys et al, 2017 | Reactor | ⬜️ 未读 | [笔记](./notes/028-reactor.md) | | 29 | Interpolated Policy Gradient | Gu et al, 2017 | IPG | ⬜️ 未读 | [笔记](./notes/029-ipg.md) | | 30 | Equivalence Between Policy Gradients and Soft Q-Learning | Schulman et al, 2017 | Theoretical Link | ⬜️ 未读 | [笔记](./notes/030-pg-soft-q-equivalence.md) | ### h. Evolutionary Algorithms | # | 论文标题 (Title) | 作者 & 年份 | 核心贡献/算法 | 状态 | 我的笔记 | |:-:|:---|:---|:---|:---:|:---:| | 31 | [Evolution Strategies as a Scalable Alternative to Reinforcement Learning](https://arxiv.org/abs/1703.03864) | Salimans et al, 2017 | ES | ⬜️ 未读 | [笔记](./notes/031-es.md) | ## Exploration ### a. Intrinsic Motivation | # | 论文标题 (Title) | 作者 & 年份 | 核心贡献/算法 | 状态 | 我的笔记 | |:-:|:---|:---|:---|:---:|:---:| | 32 | [VIME: Variational Information Maximizing Exploration](https://arxiv.org/abs/1605.09674) | Houthooft et al, 2016 | VIME | ⬜️ 未读 | [笔记](./notes/032-vime.md) | | 33 | [Unifying Count-Based Exploration and Intrinsic Motivation](https://arxiv.org/abs/1606.01868) | Bellemare et al, 2016 | CTS-based Pseudocounts | ⬜️ 未读 | [笔记](./notes/033-cts-pseudocounts.md) | | 34 | [Count-Based Exploration with Neural Density Models](https://arxiv.org/abs/1703.01310) | Ostrovski et al, 2017 | PixelCNN-based Pseudocounts | ⬜️ 未读 | [笔记](./notes/034-pixelcnn-pseudocounts.md) | | 35 | [#Exploration: A Study of Count-Based Exploration for Deep Reinforcement Learning](https://arxiv.org/abs/1611.04717) | Tang et al, 2016 | Hash-based Counts | ⬜️ 未读 | [笔记](./notes/035-hash-counts.md) | | 36 | [EX2: Exploration with Exemplar Models for Deep Reinforcement Learning](https://arxiv.org/abs/1703.01260) | Fu et al, 2017 | EX2 | ⬜️ 未读 | [笔记](./notes/036-ex2.md) | | 37 | [Curiosity-driven Exploration by Self-supervised Prediction](https://arxiv.org/abs/1705.05363) | Pathak et al, 2017 | ICM | ⬜️ 未读 | [笔记](./notes/037-icm.md) | | 38 | [Large-Scale Study of Curiosity-Driven Learning](https://arxiv.org/abs/1808.04355) | Burda et al, 2018 | Analysis of Curiosity | ⬜️ 未读 | [笔记](./notes/038-curiosity-study.md) | | 39 | [Exploration by Random Network Distillation](https://arxiv.org/abs/1810.12894) | Burda et al, 2018 | RND | ⬜️ 未读 | [笔记](./notes/039-rnd.md) | ### b. Unsupervised RL | # | 论文标题 (Title) | 作者 & 年份 | 核心贡献/算法 | 状态 | 我的笔记 | |:-:|:---|:---|:---|:---:|:---:| | 40 | [Variational Intrinsic Control](https://arxiv.org/abs/1611.07507) | Gregor et al, 2016 | VIC | ⬜️ 未读 | [笔记](./notes/040-vic.md) | | 41 | [Diversity is All You Need: Learning Skills without a Reward Function](https://arxiv.org/abs/1802.06070) | Eysenbach et al, 2018 | DIAYN | ⬜️ 未读 | [笔记](./notes/041-diayn.md) | | 42 | [Variational Option Discovery Algorithms](https://arxiv.org/abs/1807.10299) | Achiam et al, 2018 | VALOR | ⬜️ 未读 | [笔记](./notes/042-valor.md) | ## Transfer and Multitask RL | # | 论文标题 (Title) | 作者 & 年份 | 核心贡献/算法 | 状态 | 我的笔记 | |:-:|:---|:---|:---|:---:|:---:| | 43 | [Progressive Neural Networks](https://arxiv.org/abs/1606.04671) | Rusu et al, 2016 | Progressive Networks | ⬜️ 未读 | [笔记](./notes/043-progressive-nets.md) | | 44 | [Universal Value Function Approximators](https://arxiv.org/abs/1506.07265) | Schaul et al, 2015 | UVFA | ⬜️ 未读 | [笔记](./notes/044-uvfa.md) | | 45 | [Reinforcement Learning with Unsupervised Auxiliary Tasks](https://arxiv.org/abs/1611.05397) | Jaderberg et al, 2016 | UNREAL | ⬜️ 未读 | [笔记](./notes/045-unreal.md) | | 46 | [The Intentional Unintentional Agent](https://arxiv.org/abs/1706.05223) | Cabi et al, 2017 | IU Agent | ⬜️ 未读 | [笔记](./notes/046-iu-agent.md) | | 47 | [PathNet: Evolution Channels Gradient Descent in Super Neural Networks](https://arxiv.org/abs/1701.08734) | Fernando et al, 2017 | PathNet | ⬜️ 未读 | [笔记](./notes/047-pathnet.md) | | 48 | [Mutual Alignment Transfer Learning](https://arxiv.org/abs/1707.07907) | Wulfmeier et al, 2017 | MATL | ⬜️ 未读 | [笔记](./notes/048-matl.md) | | 49 | [Learning an Embedding Space for Transferable Robot Skills](https://arxiv.org/abs/1803.04883) | Hausman et al, 2018 | Transferable Skills | ⬜️ 未读 | [笔记](./notes/049-transferable-skills.md) | | 50 | [Hindsight Experience Replay](https://arxiv.org/abs/1707.01495) | Andrychowicz et al, 2017 | HER | ⬜️ 未读 | [笔记](./notes/050-her.md) | ## Hierarchy | # | 论文标题 (Title) | 作者 & 年份 | 核心贡献/算法 | 状态 | 我的笔记 | |:-:|:---|:---|:---|:---:|:---:| | 51 | [Strategic Attentive Writer for Learning Macro-Actions](https://arxiv.org/abs/1606.04695) | Vezhnevets et al, 2016 | STRAW | ⬜️ 未读 | [笔记](./notes/051-straw.md) | | 52 | [FeUdal Networks for Hierarchical Reinforcement Learning](https://arxiv.org/abs/1703.01161) | Vezhnevets et al, 2017 | Feudal Networks | ⬜️ 未读 | [笔记](./notes/052-feudal-nets.md) | | 53 | [Data-Efficient Hierarchical Reinforcement Learning](https://arxiv.org/abs/1805.08296) | Nachum et al, 2018 | HIRO | ⬜️ 未读 | [笔记](./notes/053-hiro.md) | ## Memory | # | 论文标题 (Title) | 作者 & 年份 | 核心贡献/算法 | 状态 | 我的笔记 | |:-:|:---|:---|:---|:---:|:---:| | 54 | [Model-Free Episodic Control](https://arxiv.org/abs/1606.04460) | Blundell et al, 2016 | MFEC | ⬜️ 未读 | [笔记](./notes/054-mfec.md) | | 55 | [Neural Episodic Control](https://arxiv.org/abs/1703.01988) | Pritzel et al, 2017 | NEC | ⬜️ 未读 | [笔记](./notes/055-nec.md) | | 56 | [Neural Map: Structured Memory for Deep Reinforcement Learning](https://arxiv.org/abs/1702.08360) | Parisotto & Salakhutdinov, 2017 | Neural Map | ⬜️ 未读 | [笔记](./notes/056-neural-map.md) | | 57 | [Unsupervised Predictive Memory in a Goal-Directed Agent](https://arxiv.org/abs/1803.10760) | Wayne et al, 2018 | MERLIN | ⬜️ 未读 | [笔记](./notes/057-merlin.md) | | 58 | [Relational Recurrent Neural Networks](https://arxiv.org/abs/1806.01822) | Santoro et al, 2018 | RMC | ⬜️ 未读 | [笔记](./notes/058-rmc.md) | ## Model-Based RL ### a. Model is Learned | # | 论文标题 (Title) | 作者 & 年份 | 核心贡献/算法 | 状态 | 我的笔记 | |:-:|:---|:---|:---|:---:|:---:| | 59 | [Imagination-Augmented Agents for Deep Reinforcement Learning](https://arxiv.org/abs/1707.06203) | Weber et al, 2017 | I2A | ⬜️ 未读 | [笔记](./notes/059-i2a.md) | | 60 | [Neural Network Dynamics for Model-Based Deep Reinforcement Learning with Model-Free Fine-Tuning](https://arxiv.org/abs/1708.02596) | Nagabandi et al, 2017 | MBMF | ⬜️ 未读 | [笔记](./notes/060-mbmf.md) | | 61 | [Model-Based Value Expansion for Efficient Model-Free Reinforcement Learning](https://arxiv.org/abs/1803.00101) | Feinberg et al, 2018 | MVE | ⬜️ 未读 | [笔记](./notes/061-mve.md) | | 62 | [Sample-Efficient Reinforcement Learning with Stochastic Ensemble Value Expansion](https://arxiv.org/abs/1807.01675) | Buckman et al, 2018 | STEVE | ⬜️ 未读 | [笔记](./notes/062-steve.md) | | 63 | [Model-Ensemble Trust-Region Policy Optimization](https://arxiv.org/abs/1802.10592) | Kurutach et al, 2018 | ME-TRPO | ⬜️ 未读 | [笔记](./notes/063-me-trpo.md) | | 64 | [Model-Based Reinforcement Learning via Meta-Policy Optimization](https://arxiv.org/abs/1809.05214) | Clavera et al, 2018 | MB-MPO | ⬜️ 未读 | [笔记](./notes/064-mb-mpo.md) | | 65 | [Recurrent World Models Facilitate Policy Evolution](https://arxiv.org/abs/1809.01999) | Ha & Schmidhuber, 2018 | World Models | ⬜️ 未读 | [笔记](./notes/065-world-models.md) | ### b. Model is Given | # | 论文标题 (Title) | 作者 & 年份 | 核心贡献/算法 | 状态 | 我的笔记 | |:-:|:---|:---|:---|:---:|:---:| | 66 | [Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm](https://arxiv.org/abs/1712.01815) | Silver et al, 2017 | AlphaZero | ⬜️ 未读 | [笔记](./notes/066-alphazero.md) | | 67 | [Thinking Fast and Slow with Deep Learning and Tree Search](https://arxiv.org/abs/1705.08439) | Anthony et al, 2017 | ExIt | ⬜️ 未读 | [笔记](./notes/067-exit.md) | ## Meta-RL | # | 论文标题 (Title) | 作者 & 年份 | 核心贡献/算法 | 状态 | 我的笔记 | |:-:|:---|:---|:---|:---:|:---:| | 68 | [RL^2: Fast Reinforcement Learning via Slow Reinforcement Learning](https://arxiv.org/abs/1611.02779) | Duan et al, 2016 | RL^2 | ⬜️ 未读 | [笔记](./notes/068-rl2.md) | | 69 | [Learning to Reinforcement Learn](https://arxiv.org/abs/1611.05763) | Wang et al, 2016 | Learning to RL | ⬜️ 未读 | [笔记](./notes/069-learning-to-rl.md) | | 70 | [Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks](https://arxiv.org/abs/1703.03400) | Finn et al, 2017 | MAML | ⬜️ 未读 | [笔记](./notes/070-maml.md) | | 71 | [A Simple Neural Attentive Meta-Learner](https://arxiv.org/abs/1707.03141) | Mishra et al, 2018 | SNAIL | ⬜️ 未读 | [笔记](./notes/071-snail.md) | ## Scaling RL | # | 论文标题 (Title) | 作者 & 年份 | 核心贡献/算法 | 状态 | 我的笔记 | |:-:|:---|:---|:---|:---:|:---:| | 72 | [Accelerated Methods for Deep Reinforcement Learning](https://arxiv.org/abs/1803.02811) | Stooke & Abbeel, 2018 | Parallelization Analysis | ⬜️ 未读 | [笔记](./notes/072-accelerated-methods.md) | | 73 | [IMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-Learner Architectures](https://arxiv.org/abs/1802.01561) | Espeholt et al, 2018 | IMPALA | ⬜️ 未读 | [笔记](./notes/073-impala.md) | | 74 | [Distributed Prioritized Experience Replay](https://arxiv.org/abs/1803.00933) | Horgan et al, 2018 | Ape-X | ⬜️ 未读 | [笔记](./notes/074-apex.md) | | 75 | [Recurrent Experience Replay in Distributed Reinforcement Learning](https://openreview.net/forum?id=r1lyTjAqYX) | Anonymous, 2018 | R2D2 | ⬜️ 未读 | [笔记](./notes/075-r2d2.md) | | 76 | [RLlib: Abstractions for Distributed Reinforcement Learning](https://arxiv.org/abs/1712.09381) | Liang et al, 2017 | RLlib Library | ⬜️ 未读 | [笔记](./notes/076-rllib.md) | ## RL in the Real World | # | 论文标题 (Title) | 作者 & 年份 | 核心贡献/算法 | 状态 | 我的笔记 | |:-:|:---|:---|:---|:---:|:---:| | 77 | [Benchmarking Reinforcement Learning Algorithms on Real-World Robots](https://arxiv.org/abs/1809.10093) | Mahmood et al, 2018 | Robot Benchmark | ⬜️ 未读 | [笔记](./notes/077-robot-benchmark.md) | | 78 | [Learning Dexterous In-Hand Manipulation](https://arxiv.org/abs/1808.00177) | OpenAI, 2018 | Dexterous Manipulation | ⬜️ 未读 | [笔记](./notes/078-dexterous-manipulation.md) | | 79 | [QT-Opt: Scalable Deep Reinforcement Learning for Vision-Based Robotic Manipulation](https://arxiv.org/abs/1806.10293) | Kalashnikov et al, 2018 | QT-Opt | ⬜️ 未读 | [笔记](./notes/079-qt-opt.md) | | 80 | [Horizon: Facebook’s Open Source Applied Reinforcement Learning Platform](https://arxiv.org/abs/1811.00260) | Gauci et al, 2018 | Horizon Platform | ⬜️ 未读 | [笔记](./notes/080-horizon.md) | ## Safety | # | 论文标题 (Title) | 作者 & 年份 | 核心贡献/算法 | 状态 | 我的笔记 | |:-:|:---|:---|:---|:---:|:---:| | 81 | [Concrete Problems in AI Safety](https://arxiv.org/abs/1606.06565) | Amodei et al, 2016 | Safety Problems Taxonomy | ⬜️ 未读 | [笔记](./notes/081-concrete-problems-safety.md) | | 82 | [Deep Reinforcement Learning From Human Preferences](https://arxiv.org/abs/1706.03741) | Christiano et al, 2017 | LFP | ⬜️ 未读 | [笔记](./notes/082-lfp.md) | | 83 | [Constrained Policy Optimization](https://arxiv.org/abs/1705.10528) | Achiam et al, 2017 | CPO | ⬜️ 未读 | [笔记](./notes/083-cpo.md) | | 84 | [Safe Exploration in Continuous Action Spaces](https://arxiv.org/abs/1801.08757) | Dalal et al, 2018 | DDPG+Safety Layer | ⬜️ 未读 | [笔记](./notes/084-ddpg-safety.md) | | 85 | [Trial without Error: Towards Safe Reinforcement Learning via Human Intervention](https://arxiv.org/abs/1707.05173) | Saunders et al, 2017 | HIRL | ⬜️ 未读 | [笔记](./notes/085-hirl.md) | | 86 | [Leave No Trace: Learning to Reset for Safe and Autonomous Reinforcement Learning](https://arxiv.org/abs/1711.06782) | Eysenbach et al, 2017 | Leave No Trace | ⬜️ 未读 | [笔记](./notes/086-leave-no-trace.md) | ## Imitation Learning and Inverse Reinforcement Learning | # | 论文标题 (Title) | 作者 & 年份 | 核心贡献/算法 | 状态 | 我的笔记 | |:-:|:---|:---|:---|:---:|:---:| | 87 | [Modeling Purposeful Adaptive Behavior with the Principle of Maximum Causal Entropy](https://www.cs.cmu.edu/~bziebart/publications/ziebart-thesis.pdf) | Ziebart 2010 | Maximum Entropy IRL | ⬜️ 未读 | [笔记](./notes/087-maxent-irl.md) | | 88 | [Guided Cost Learning: Deep Inverse Optimal Control via Policy Optimization](https://arxiv.org/abs/1603.00448) | Finn et al, 2016 | GCL | ⬜️ 未读 | [笔记](./notes/088-gcl.md) | | 89 | [Generative Adversarial Imitation Learning](https://arxiv.org/abs/1606.03476) | Ho & Ermon, 2016 | GAIL | ⬜️ 未读 | [笔记](./notes/089-gail.md) | | 90 | [DeepMimic: Example-Guided Deep Reinforcement Learning of Physics-Based Character Skills](https://arxiv.org/abs/1804.02717) | Peng et al, 2018 | DeepMimic | ⬜️ 未读 | [笔记](./notes/090-deepmimic.md) | | 91 | [Variational Discriminator Bottleneck: Improving Imitation Learning, Inverse RL, and GANs...](https://arxiv.org/abs/1810.00821) | Peng et al, 2018 | VAIL | ⬜️ 未读 | [笔记](./notes/091-vail.md) | | 92 | [One-Shot High-Fidelity Imitation: Training Large-Scale Deep Nets with RL](https://arxiv.org/abs/1810.10543) | Le Paine et al, 2018 | MetaMimic | ⬜️ 未读 | [笔记](./notes/092-metamimic.md) | ## Reproducibility, Analysis, and Critique | # | 论文标题 (Title) | 作者 & 年份 | 核心贡献/算法 | 状态 | 我的笔记 | |:-:|:---|:---|:---|:---:|:---:| | 93 | [Benchmarking Deep Reinforcement Learning for Continuous Control](https://arxiv.org/abs/1604.06778) | Duan et al, 2016 | rllab | ⬜️ 未读 | [笔记](./notes/093-rllab.md) | | 94 | [Reproducibility of Benchmarked Deep Reinforcement Learning Tasks for Continuous Control](https://arxiv.org/abs/1708.04133) | Islam et al, 2017 | Reproducibility Analysis | ⬜️ 未读 | [笔记](./notes/094-reproducibility-continuous.md) | | 95 | [Deep Reinforcement Learning that Matters](https://arxiv.org/abs/1709.06560) | Henderson et al, 2017 | Reproducibility Analysis | ⬜️ 未读 | [笔记](./notes/095-rl-that-matters.md) | | 96 | [Where Did My Optimum Go?: An Empirical Analysis of Gradient Descent Optimization in PG](https://arxiv.org/abs/1810.02276) | Henderson et al, 2018 | Analysis of PG | ⬜️ 未读 | [笔记](./notes/096-optimum-go.md) | | 97 | [Are Deep Policy Gradient Algorithms Truly Policy Gradient Algorithms?](https://arxiv.org/abs/1811.02553) | Ilyas et al, 2018 | Critique of PG | ⬜️ 未读 | [笔记](./notes/097-are-dpg-truly-pg.md) | | 98 | [Simple Random Search Provides a Competitive Approach to Reinforcement Learning](https://arxiv.org/abs/1803.07055) | Mania et al, 2018 | Random Search | ⬜️ 未读 | [笔记](./notes/098-random-search.md) | ## Bonus: Classic Papers in RL Theory or Review | # | 论文标题 (Title) | 作者 & 年份 | 核心贡献/算法 | 状态 | 我的笔记 | |:--:|:---|:---|:---|:---:|:---:| | 99 | [Policy Gradient Methods for Reinforcement Learning with Function Approximation](http://papers.nips.cc/paper/1713-policy-gradient-methods-for-reinforcement-learning-with-function-approximation.pdf) | Sutton et al, 2000 | Policy Gradient Theorem | ⬜️ 未读 | [笔记](./notes/099-policy-gradient-theorem.md) | | 100| [An Analysis of Temporal-Difference Learning with Function Approximation](http://www.mit.edu/~jnt/Papers/J006-96-td-fa.pdf) | Tsitsiklis & Van Roy, 1997 | TD Convergence Analysis | ⬜️ 未读 | [笔记](./notes/100-td-analysis.md) | | 101| [Reinforcement Learning of Motor Skills with Policy Gradients](https://www.ias.informatik.tu-darmstadt.de/uploads/Team/JanPeters/peters-2008-icml-tutorial.pdf) | Peters & Schaal, 2008 | Review of Policy Gradients | ⬜️ 未读 | [笔记](./notes/101-pg-review.md) | | 102| [Approximately Optimal Approximate Reinforcement Learning](http://www.machinelearning.org/archive/icml2002/papers/054.pdf) | Kakade & Langford, 2002 | Monotonic Improvement Theory | ⬜️ 未读 | [笔记](./notes/102-aoarl.md) | | 103| [A Natural Policy Gradient](https://papers.nips.cc/paper/2073-a-natural-policy-gradient.pdf) | Kakade, 2002 | Natural Policy Gradient | ⬜️ 未读 | [笔记](./notes/103-npg.md) | | 104| [Algorithms for Reinforcement Learning](https://sites.ualberta.ca/~szepesva/papers/RLAlgsInMDPs.pdf) | Szepesvari, 2009 | Foundational RL Algorithms | ⬜️ 未读 | [笔记](./notes/104-rl-algorithms.md) |