Jump to content

Actor-critic algorithm

From Wikipedia, the free encyclopedia
This is an old revision of this page, as edited by Cosmia Nebula (talk | contribs) at 03:52, 21 January 2025 (init). The present address (URL) is a permanent link to this revision, which may differ significantly from the current revision.
(diff) ← Previous revision | Latest revision (diff) | Newer revision → (diff)

The actor-critic algorithm (AC) is a family of reinforcement learning (RL) algorithms that combine policy-based and value-based RL algorithms. It consists of two main components: an "actor" that determines which actions to take according to a policy function, and a "critic" that evaluates those actions according to a value function.[1] Some AC algorithms are on-policy, some are off-policy. Some apply to either continuous or discrete action spaces. Some work in both cases.

The AC algorithms are one of the main algorithm families used in modern RL.[2]

Overview

The actor-critic method belongs to the family of policy gradient methods but addresses their high variance issue by incorporating a value function approximator (the critic). The actor uses a policy function , while the critic estimates either the value function , the action-value Q-function , the advantage function , or any combination thereof.

The actor is a parameterized function , where are the parameters of the actor. The actor takes as argument the state of the environment and produces a probability distribution .

If the action space is discrete, then . If the action space is continuous, then .

The goal of policy optimization is to find some that maximizes the expected episodic reward :where is the discount factor, is the reward at step , and is the time-horizon (which can be infinite).

The goal of policy gradient method is to optimize by gradient ascent.

Variants

  • Advantage Actor-Critic (A2C): Uses the advantage function instead of TD error.[3]
  • Asynchronous Advantage Actor-Critic (A3C): Parallel and asynchronous version of A2C.[3]
  • Soft Actor-Critic (SAC): Incorporates entropy maximization for improved exploration.[4]
  • Deep Deterministic Policy Gradient (DDPG): Specialized for continuous action spaces.[5]
  • Generalized Advantage Estimation (GAE): introduces a hyperparameter that smoothly interpolates between Monte Carlo returns (, high variance, no bias) and 1-step TD learning (, low variance, high bias). This hyperparameter can be adjusted to pick the optimal bias-variance trade-off in advantage estimation. It uses an exponentially decaying average of n-step returns with being the decay strength.[6]

See also

References

  1. ^ Konda, Vijay; Tsitsiklis, John (1999). "Actor-Critic Algorithms". Advances in Neural Information Processing Systems. 12. MIT Press.
  2. ^ Arulkumaran, Kai; Deisenroth, Marc Peter; Brundage, Miles; Bharath, Anil Anthony (2017-11). "Deep Reinforcement Learning: A Brief Survey". IEEE Signal Processing Magazine. 34 (6): 26–38. doi:10.1109/MSP.2017.2743240. ISSN 1053-5888. {{cite journal}}: Check date values in: |date= (help)
  3. ^ a b Mnih, Volodymyr; Badia, Adrià Puigdomènech; Mirza, Mehdi; Graves, Alex; Lillicrap, Timothy P.; Harley, Tim; Silver, David; Kavukcuoglu, Koray (2016-06-16), Asynchronous Methods for Deep Reinforcement Learning, doi:10.48550/arXiv.1602.01783
  4. ^ Haarnoja, Tuomas; Zhou, Aurick; Hartikainen, Kristian; Tucker, George; Ha, Sehoon; Tan, Jie; Kumar, Vikash; Zhu, Henry; Gupta, Abhishek (2019-01-29), Soft Actor-Critic Algorithms and Applications, doi:10.48550/arXiv.1812.05905
  5. ^ Lillicrap, Timothy P.; Hunt, Jonathan J.; Pritzel, Alexander; Heess, Nicolas; Erez, Tom; Tassa, Yuval; Silver, David; Wierstra, Daan (2019-07-05), Continuous control with deep reinforcement learning, doi:10.48550/arXiv.1509.02971
  6. ^ Schulman, John; Moritz, Philipp; Levine, Sergey; Jordan, Michael; Abbeel, Pieter (2018-10-20), High-Dimensional Continuous Control Using Generalized Advantage Estimation, doi:10.48550/arXiv.1506.02438
  • Konda, Vijay R.; Tsitsiklis, John N. (2003-01). "On Actor-Critic Algorithms". SIAM Journal on Control and Optimization. 42 (4): 1143–1166. doi:10.1137/S0363012901385691. ISSN 0363-0129. {{cite journal}}: Check date values in: |date= (help)
  • Sutton, Richard S.; Barto, Andrew G. (2018). Reinforcement learning: an introduction. Adaptive computation and machine learning series (2 ed.). Cambridge, Massachusetts: The MIT Press. ISBN 978-0-262-03924-6.
  • Bertsekas, Dimitri P. (2019). Reinforcement learning and optimal control (2 ed.). Belmont, Massachusetts: Athena Scientific. ISBN 978-1-886529-39-7.
  • Grossi, Csaba (2010). Algorithms for Reinforcement Learning. Synthesis Lectures on Artificial Intelligence and Machine Learning (1 ed.). Cham: Springer International Publishing. ISBN 978-3-031-00423-0.