Actor-critic algorithm
The actor-critic algorithm (AC) is a family of reinforcement learning (RL) algorithms that combine policy-based and value-based RL algorithms. It consists of two main components: an "actor" that determines which actions to take according to a policy function, and a "critic" that evaluates those actions according to a value function.[1] Some AC algorithms are on-policy, some are off-policy. Some apply to either continuous or discrete action spaces. Some work in both cases.
The AC algorithms are one of the main algorithm families used in modern RL.[2]
Overview
The actor-critic method belongs to the family of policy gradient methods but addresses their high variance issue by incorporating a value function approximator (the critic). The actor uses a policy function , while the critic estimates either the value function , the action-value Q-function , the advantage function , or any combination thereof.
The actor is a parameterized function , where are the parameters of the actor. The actor takes as argument the state of the environment and produces a probability distribution .
If the action space is discrete, then . If the action space is continuous, then .
The goal of policy optimization is to find some that maximizes the expected episodic reward :where is the discount factor, is the reward at step , and is the time-horizon (which can be infinite).
The goal of policy gradient method is to optimize by gradient ascent.
Variants
- Advantage Actor-Critic (A2C): Uses the advantage function instead of TD error.[3]
- Asynchronous Advantage Actor-Critic (A3C): Parallel and asynchronous version of A2C.[3]
- Soft Actor-Critic (SAC): Incorporates entropy maximization for improved exploration.[4]
- Deep Deterministic Policy Gradient (DDPG): Specialized for continuous action spaces.[5]
- Generalized Advantage Estimation (GAE): introduces a hyperparameter that smoothly interpolates between Monte Carlo returns (, high variance, no bias) and 1-step TD learning (, low variance, high bias). This hyperparameter can be adjusted to pick the optimal bias-variance trade-off in advantage estimation. It uses an exponentially decaying average of n-step returns with being the decay strength.[6]
See also
References
- ^ Konda, Vijay; Tsitsiklis, John (1999). "Actor-Critic Algorithms". Advances in Neural Information Processing Systems. 12. MIT Press.
- ^ Arulkumaran, Kai; Deisenroth, Marc Peter; Brundage, Miles; Bharath, Anil Anthony (2017-11). "Deep Reinforcement Learning: A Brief Survey". IEEE Signal Processing Magazine. 34 (6): 26–38. doi:10.1109/MSP.2017.2743240. ISSN 1053-5888.
{{cite journal}}
: Check date values in:|date=
(help) - ^ a b Mnih, Volodymyr; Badia, Adrià Puigdomènech; Mirza, Mehdi; Graves, Alex; Lillicrap, Timothy P.; Harley, Tim; Silver, David; Kavukcuoglu, Koray (2016-06-16), Asynchronous Methods for Deep Reinforcement Learning, doi:10.48550/arXiv.1602.01783
- ^ Haarnoja, Tuomas; Zhou, Aurick; Hartikainen, Kristian; Tucker, George; Ha, Sehoon; Tan, Jie; Kumar, Vikash; Zhu, Henry; Gupta, Abhishek (2019-01-29), Soft Actor-Critic Algorithms and Applications, doi:10.48550/arXiv.1812.05905
- ^ Lillicrap, Timothy P.; Hunt, Jonathan J.; Pritzel, Alexander; Heess, Nicolas; Erez, Tom; Tassa, Yuval; Silver, David; Wierstra, Daan (2019-07-05), Continuous control with deep reinforcement learning, doi:10.48550/arXiv.1509.02971
- ^ Schulman, John; Moritz, Philipp; Levine, Sergey; Jordan, Michael; Abbeel, Pieter (2018-10-20), High-Dimensional Continuous Control Using Generalized Advantage Estimation, doi:10.48550/arXiv.1506.02438
- Konda, Vijay R.; Tsitsiklis, John N. (2003-01). "On Actor-Critic Algorithms". SIAM Journal on Control and Optimization. 42 (4): 1143–1166. doi:10.1137/S0363012901385691. ISSN 0363-0129.
{{cite journal}}
: Check date values in:|date=
(help) - Sutton, Richard S.; Barto, Andrew G. (2018). Reinforcement learning: an introduction. Adaptive computation and machine learning series (2 ed.). Cambridge, Massachusetts: The MIT Press. ISBN 978-0-262-03924-6.
- Bertsekas, Dimitri P. (2019). Reinforcement learning and optimal control (2 ed.). Belmont, Massachusetts: Athena Scientific. ISBN 978-1-886529-39-7.
- Grossi, Csaba (2010). Algorithms for Reinforcement Learning. Synthesis Lectures on Artificial Intelligence and Machine Learning (1 ed.). Cham: Springer International Publishing. ISBN 978-3-031-00423-0.