Actor-critic algorithm

The actor-critic algorithm (AC) is a family of reinforcement learning (RL) algorithms that combine policy-based and value-based RL algorithms. It consists of two main components: an "actor" that determines which actions to take according to a policy function, and a "critic" that evaluates those actions according to a value function.^[1] Some AC algorithms are on-policy, some are off-policy. Some apply to either continuous or discrete action spaces. Some work in both cases.

The AC algorithms are one of the main algorithm families used in modern RL.^[2]

Overview

The actor-critic method belongs to the family of policy gradient methods but addresses their high variance issue by incorporating a value function approximator (the critic). The actor uses a policy function $\pi (a|s)$ , while the critic estimates either the value function $V(s)$ , the action-value Q-function $Q(s,a)$ , the advantage function $A(s,a)$ , or any combination thereof.

The actor is a parameterized function $\pi _{\theta }$ , where $\theta$ are the parameters of the actor. The actor takes as argument the state of the environment $s$ and produces a probability distribution $\pi _{\theta }(\cdot |s)$ .

If the action space is discrete, then $\sum _{a}\pi _{\theta }(a|s)=1$ . If the action space is continuous, then $\int _{a}\pi _{\theta }(a|s)da=1$ .

The goal of policy optimization is to find some $\theta$ that maximizes the expected episodic reward $J(\theta )$ : $J(\theta )=\mathbb {E} _{\pi _{\theta }}[\sum _{t=0}^{T}\gamma ^{t}r_{t}]$ where $\gamma$ is the discount factor, $r_{t}$ is the reward at step $t$ , and $T$ is the time-horizon (which can be infinite).

The goal of policy gradient method is to optimize $J(\theta )$ by gradient ascent.

Variants

Advantage Actor-Critic (A2C): Uses the advantage function instead of TD error.^[3]
Asynchronous Advantage Actor-Critic (A3C): Parallel and asynchronous version of A2C.^[3]
Soft Actor-Critic (SAC): Incorporates entropy maximization for improved exploration.^[4]
Deep Deterministic Policy Gradient (DDPG): Specialized for continuous action spaces.^[5]
Generalized Advantage Estimation (GAE): introduces a hyperparameter $\lambda$ that smoothly interpolates between Monte Carlo returns ( $\lambda =1$ , high variance, no bias) and 1-step TD learning ( $\lambda =0$ , low variance, high bias). This hyperparameter can be adjusted to pick the optimal bias-variance trade-off in advantage estimation. It uses an exponentially decaying average of n-step returns with $\lambda$ being the decay strength.^[6]

References

^ Konda, Vijay; Tsitsiklis, John (1999). "Actor-Critic Algorithms". Advances in Neural Information Processing Systems. 12. MIT Press.
^ Arulkumaran, Kai; Deisenroth, Marc Peter; Brundage, Miles; Bharath, Anil Anthony (2017-11). "Deep Reinforcement Learning: A Brief Survey". IEEE Signal Processing Magazine. 34 (6): 26–38. doi:10.1109/MSP.2017.2743240. ISSN 1053-5888. {{cite journal}}: Check date values in: |date= (help)
^ ^a ^b Mnih, Volodymyr; Badia, Adrià Puigdomènech; Mirza, Mehdi; Graves, Alex; Lillicrap, Timothy P.; Harley, Tim; Silver, David; Kavukcuoglu, Koray (2016-06-16), Asynchronous Methods for Deep Reinforcement Learning, doi:10.48550/arXiv.1602.01783
^ Haarnoja, Tuomas; Zhou, Aurick; Hartikainen, Kristian; Tucker, George; Ha, Sehoon; Tan, Jie; Kumar, Vikash; Zhu, Henry; Gupta, Abhishek (2019-01-29), Soft Actor-Critic Algorithms and Applications, doi:10.48550/arXiv.1812.05905
^ Lillicrap, Timothy P.; Hunt, Jonathan J.; Pritzel, Alexander; Heess, Nicolas; Erez, Tom; Tassa, Yuval; Silver, David; Wierstra, Daan (2019-07-05), Continuous control with deep reinforcement learning, doi:10.48550/arXiv.1509.02971
^ Schulman, John; Moritz, Philipp; Levine, Sergey; Jordan, Michael; Abbeel, Pieter (2018-10-20), High-Dimensional Continuous Control Using Generalized Advantage Estimation, doi:10.48550/arXiv.1506.02438

Konda, Vijay R.; Tsitsiklis, John N. (2003-01). "On Actor-Critic Algorithms". SIAM Journal on Control and Optimization. 42 (4): 1143–1166. doi:10.1137/S0363012901385691. ISSN 0363-0129. {{cite journal}}: Check date values in: |date= (help)
Sutton, Richard S.; Barto, Andrew G. (2018). Reinforcement learning: an introduction. Adaptive computation and machine learning series (2 ed.). Cambridge, Massachusetts: The MIT Press. ISBN 978-0-262-03924-6.
Bertsekas, Dimitri P. (2019). Reinforcement learning and optimal control (2 ed.). Belmont, Massachusetts: Athena Scientific. ISBN 978-1-886529-39-7.
Grossi, Csaba (2010). Algorithms for Reinforcement Learning. Synthesis Lectures on Artificial Intelligence and Machine Learning (1 ed.). Cham: Springer International Publishing. ISBN 978-3-031-00423-0.

[1] Konda, Vijay; Tsitsiklis, John (1999). "Actor-Critic Algorithms". Advances in Neural Information Processing Systems. 12. MIT Press.

[2] Arulkumaran, Kai; Deisenroth, Marc Peter; Brundage, Miles; Bharath, Anil Anthony (2017-11). "Deep Reinforcement Learning: A Brief Survey". IEEE Signal Processing Magazine. 34 (6): 26–38. doi:10.1109/MSP.2017.2743240. ISSN 1053-5888. {{cite journal}}: Check date values in: |date= (help)

[:0-3] Mnih, Volodymyr; Badia, Adrià Puigdomènech; Mirza, Mehdi; Graves, Alex; Lillicrap, Timothy P.; Harley, Tim; Silver, David; Kavukcuoglu, Koray (2016-06-16), Asynchronous Methods for Deep Reinforcement Learning, doi:10.48550/arXiv.1602.01783

[4] Haarnoja, Tuomas; Zhou, Aurick; Hartikainen, Kristian; Tucker, George; Ha, Sehoon; Tan, Jie; Kumar, Vikash; Zhu, Henry; Gupta, Abhishek (2019-01-29), Soft Actor-Critic Algorithms and Applications, doi:10.48550/arXiv.1812.05905

[5] Lillicrap, Timothy P.; Hunt, Jonathan J.; Pritzel, Alexander; Heess, Nicolas; Erez, Tom; Tassa, Yuval; Silver, David; Wierstra, Daan (2019-07-05), Continuous control with deep reinforcement learning, doi:10.48550/arXiv.1509.02971

[6] Schulman, John; Moritz, Philipp; Levine, Sergey; Jordan, Michael; Abbeel, Pieter (2018-10-20), High-Dimensional Continuous Control Using Generalized Advantage Estimation, doi:10.48550/arXiv.1506.02438

[1]

[2]

[3]

[4]

[5]

[6]

Overview

Variants

See also

References