End-to-end reinforcement learning

In end-to-end reinforcement learning, the end-to-end process, in other words, the entire process from sensors to motors in a robot or agent consists of only one layered or recurrent neural network without modularization, and the network is trained comprehensively by reinforcement learning^[1]. End-to-end reinforcement learning has been propounded for a long time^[2]^[3], and has been sparked by the successful results in playing ATARI TV games (2013-15)^[4]^[5]^[6]^[7] and AlphaGo (2016)^[8] by Google DeepMind. As well as deep learning, by using a neural network, it enables to learn massively parallel processing that humans can hardly design by hand, and to surpass what humans design. Unlike supervised learning, reinforcement learning makes autonomous learning possible. Therefore, it can make the interruption of human design minimum, and very flexible and purposive learning on a huge degree of freedom can be realized. That is the reason why it is highly expected to solve the frame problem or symbol grounding problem and to open up the way to artificial general intelligence (AGI) or strong AI.

In reinforcement learning research, it has been general that state space and action space are designed in advance and only the mapping from state space to action space is learned^[9]. Therefore, reinforcement learning has been limited to learning only for action, and human designers have to design how to construct state space from sensor signals and to give how the motion commands are generated for each action before learning. Neural networks have been often used in reinforcement learning, but that has been for non-linear function approximation to avoid the curse of dimensionality problem that occurs when table-lookup approach is used^[9]. Recurrent neural networks have been also used sometimes, but the main purpose of the use is to avoid perceptual aliasing or POMDP (partially observable Markov decision process)^[10]^[11]^[12]^[13]^[14].

However, the end-to-end reinforcement learning extends reinforcement learning from learning for actions to learning for entire process by extending the learned process to the entire process from sensors to motors. Therefore, not only actions, but also various functions including recognition and memory are expected to emerge. Especially, in higher functions, they do not connect directly with either sensors or motors, and so even deciding either their inputs or outputs is very difficult. Since that has disturbed the understanding or developing of them, the progress in it is highly expected by this approach.

History

The origin of this approach can be seen in TD-Gammon by G. Tesauro (1992)^[15]. In a popular game named Back Gammon, the evaluation of the game situation during self-play was learned through TD( $\lambda$ ) using a layered neural network. 4 inputs were used for the number of men of a given color at a given location on the board and in total there are 198 input signals. With zero knowledge built in, the network was able to learn from scratch to play the entire game at a fairly strong intermediate level of performance, and the internal representation after learning was observed.

K. Shibata's group has persisted in this framework and done so many works since around 1997^[16]^[3]. Other than Q-learning, they also employed actor-critic for continuous motion tasks^[17], and used a recurrent neural network for memory-required tasks^[18]. They also applied this framework to some real robot tasks^[17]^[19]. They have also shown that various functions emerged in this framework as in the next section.

Since around 2013, as mentioned, Google DeepMind showed very impressive learning results in playing TV games^[4]^[5] and game of Go (AlphaGo)^[8]. They used a deep convolutional neural network that has shown superior results in image recognition. They also used 4 frames of almost raw RGB pixels (84x84) as inputs of the network, and the network was trained based on reinforcement learning with the reward representing the sign of the change in the game score. All the 49 games could be learned using the same network architecture and Q-learning with the minimal prior knowledge, and it outperformed competing methods in almost all the games and performed at a level that is broadly comparable with or superior to a professional human game tester in the majority of games^[5]. The results make us feel the realization of AGI and strong AI. It is sometimes called DQN (Deep-Q network). In AlphaGo, deep neural networks are trained not only by reinforcement learning, but also by supervised learning. It was also combined with Monte Carlo tree search ^[8].

Function Emergence

It has been shown by K. Shibata's group that various functions emerge in this framework: (1)image recognition, (2)color constancy (optical illusion), (3)sensor motion (active recognition), (4)hand-eye coordination and hand reaching movement, (5)explanation of brain activities, (6)knowledge transfer, (7)memory, (8)selective attention, (9)prediction, (10)exploration and so on^[3]. Communications have been also established in this framework. (1)Dynamic communication (negotiation), (2)binalization of signals, and (3)grounded communication using a real robot and camera emerged in their works^[20].

References

^ Demis, Hassabis (March 11, 2016). Artificial Intelligence and the Future (Speech).
^ Shibata, Katsunari (January 14, 2011). "Chapter 6: Emergence of Intelligence through Reinforcement Learning with a Neural Network". In Mellouk, Abdelhamid (ed.). Advances in Reinforcement Learning. Intech. pp. 99–120. ISBN 978-953-307-369-9. {{cite book}}: External link in |chapterurl= (help); Unknown parameter |chapterurl= ignored (|chapter-url= suggested) (help)
^ ^a ^b ^c Shibata, Katsunari (March 7, 2017). "Functions that Emerge through End-to-End Reinforcement Learning". arXiv:1703.02239. A bot will complete this citation soon. Click here to jump the queue
^ ^a ^b Mnih, Volodymyr; et al. (December 2013). Playing Atari with Deep Reinforcement Learning (PDF). NIPS Deep Learning Workshop 2013. {{cite conference}}: Explicit use of et al. in: |first= (help)
^ ^a ^b ^c Mnih, Volodymyr; et al. (2015). "Human-level control through deep reinforcement learning". Nature. 518 (7540): 529–533. doi:10.1038/nature14236. {{cite journal}}: Explicit use of et al. in: |first= (help)}
^ V. Mnih et al. (26 February 2015). Performance of DQN in the Game Space Invaders. {{cite AV media}}: Explicit use of et al. in: |authors= (help)
^ V. Mnih et al. (26 February 2015). Demonstration of Learning Progress in the Game Breakout. {{cite AV media}}: Explicit use of et al. in: |authors= (help)
^ ^a ^b ^c Silver, David; et al. (2013). "Mastering the game of Go with deep neural networks and tree search". Nature. 529 (7587): 484–489. doi:10.1038/nature16961. {{cite journal}}: Explicit use of et al. in: |first= (help)}
^ ^a ^b Sutton, Richard S.; Barto, Andrew G. (1998). Reinforcement Learning: An Introduction. MIT Press. ISBN 978-0262193986.
^ Lin, Long-Ji; Mitchell, Tom M. (1993). Reinforcement Learning with Hidden States. From Animals to Animats. Vol. 2. pp. 271–280.
^ Onat, Ahmet; Kita, Hajime; et al. (1998). Q-learning with Recurrent Neural Networks as a Controller for the Inverted Pendulum Problem. The 5th International Conference on Neural Information Processing (ICONIP). pp. 837–840. {{cite conference}}: Explicit use of et al. in: |first2= (help)
^ Onat, Ahmet; Kita, Hajime; et al. (1998). Recurrent Neural Networks for Reinforcement Learning: Architecture, Learning Algorithms and Internal Representation. International Joint Conference on Neural Networks (IJCNN). pp. 2010–2015. {{cite conference}}: Explicit use of et al. in: |first2= (help)
^ Bakker, Bram; Linaker, Fredrik; et al. (2002). Reinforcement Learning in Partially Observable Mobile Robot Domains Using Unsupervised Event Extraction (PDF). 2002 IEEE/RSJ International Conference on. Intelligent Robots and Systems (IROS). pp. 938–943. {{cite conference}}: Explicit use of et al. in: |first2= (help)
^ Bakker, Bram; Zhumatiy, Viktor; et al. (2003). A Robot that Reinforcement-Learns to Identify and Memorize Important Previous Observation (PDF). 2003 IEEE/RSJ International Conference on. Intelligent Robots and Systems (IROS). pp. 430–435. {{cite conference}}: Explicit use of et al. in: |first2= (help)
^ Tesauro, Gerald (March 1995). "Temporal Difference Learning and TD-Gammon". Communications of the ACM. 38 (3). doi:10.1145/203330.203343.
^ Shibata, Katsunari; Okabe, Yoichi (1997). Reinforcement Learning When Visual Sensory Signals are Directly Given as Inputs (PDF). International Conference on Neural Networks (ICNN) 1997.
^ ^a ^b Shibata, Katsunari; Iida, Masaru (2003). Acquisition of Box Pushing by Direct-Vision-Based Reinforcement Learning (PDF). SICE Annual Conference 2003.
^ Utsunomiya, Hiroki; Shibata, Katsunari (2008). Contextual Behavior and Internal Representations Acquired by Reinforcement Learning with a Recurrent Neural Network in a Continuous State and Action Space Task (PDF). International Conference on Neural Information Processing (ICONIP) '08.
^ Shibata, Katsunari; Kawano, Tomohiko (2008). Learning of Action Generation from Raw Camera Images in a Real-World-like Environment by Simple Coupling of Reinforcement Learning and a Neural Network (PDF). International Conference on Neural Information Processing (ICONIP) '08.
^ Shibata, Katsunari (March 9, 2017). "Communications that Emerge through Reinforcement Learning Using a (Recurrent) Neural Network". arXiv:1703.03543. A bot will complete this citation soon. Click here to jump the queue

[Hassabis-1] Demis, Hassabis (March 11, 2016). Artificial Intelligence and the Future (Speech).

[Shibata1-2] Shibata, Katsunari (January 14, 2011). "Chapter 6: Emergence of Intelligence through Reinforcement Learning with a Neural Network". In Mellouk, Abdelhamid (ed.). Advances in Reinforcement Learning. Intech. pp. 99–120. ISBN 978-953-307-369-9. {{cite book}}: External link in |chapterurl= (help); Unknown parameter |chapterurl= ignored (|chapter-url= suggested) (help)

[Shibata2-3] Shibata, Katsunari (March 7, 2017). "Functions that Emerge through End-to-End Reinforcement Learning". arXiv:1703.02239. A bot will complete this citation soon. Click here to jump the queue

[DQN1-4] Mnih, Volodymyr; et al. (December 2013). Playing Atari with Deep Reinforcement Learning (PDF). NIPS Deep Learning Workshop 2013. {{cite conference}}: Explicit use of et al. in: |first= (help)

[DQN2-5] Mnih, Volodymyr; et al. (2015). "Human-level control through deep reinforcement learning". Nature. 518 (7540): 529–533. doi:10.1038/nature14236. {{cite journal}}: Explicit use of et al. in: |first= (help)}

[Invaders-6] V. Mnih et al. (26 February 2015). Performance of DQN in the Game Space Invaders. {{cite AV media}}: Explicit use of et al. in: |authors= (help)

[Breakout-7] V. Mnih et al. (26 February 2015). Demonstration of Learning Progress in the Game Breakout. {{cite AV media}}: Explicit use of et al. in: |authors= (help)

[AlphaGo-8] Silver, David; et al. (2013). "Mastering the game of Go with deep neural networks and tree search". Nature. 529 (7587): 484–489. doi:10.1038/nature16961. {{cite journal}}: Explicit use of et al. in: |first= (help)}

[RL-9] Sutton, Richard S.; Barto, Andrew G. (1998). Reinforcement Learning: An Introduction. MIT Press. ISBN 978-0262193986.

[Lin-10] Lin, Long-Ji; Mitchell, Tom M. (1993). Reinforcement Learning with Hidden States. From Animals to Animats. Vol. 2. pp. 271–280.

[Onat1-11] Onat, Ahmet; Kita, Hajime; et al. (1998). Q-learning with Recurrent Neural Networks as a Controller for the Inverted Pendulum Problem. The 5th International Conference on Neural Information Processing (ICONIP). pp. 837–840. {{cite conference}}: Explicit use of et al. in: |first2= (help)

[Onat2-12] Onat, Ahmet; Kita, Hajime; et al. (1998). Recurrent Neural Networks for Reinforcement Learning: Architecture, Learning Algorithms and Internal Representation. International Joint Conference on Neural Networks (IJCNN). pp. 2010–2015. {{cite conference}}: Explicit use of et al. in: |first2= (help)

[Bakker1-13] Bakker, Bram; Linaker, Fredrik; et al. (2002). Reinforcement Learning in Partially Observable Mobile Robot Domains Using Unsupervised Event Extraction (PDF). 2002 IEEE/RSJ International Conference on. Intelligent Robots and Systems (IROS). pp. 938–943. {{cite conference}}: Explicit use of et al. in: |first2= (help)

[Bakker2-14] Bakker, Bram; Zhumatiy, Viktor; et al. (2003). A Robot that Reinforcement-Learns to Identify and Memorize Important Previous Observation (PDF). 2003 IEEE/RSJ International Conference on. Intelligent Robots and Systems (IROS). pp. 430–435. {{cite conference}}: Explicit use of et al. in: |first2= (help)

[TD-Gammon-15] Tesauro, Gerald (March 1995). "Temporal Difference Learning and TD-Gammon". Communications of the ACM. 38 (3). doi:10.1145/203330.203343.

[Shibata3-16] Shibata, Katsunari; Okabe, Yoichi (1997). Reinforcement Learning When Visual Sensory Signals are Directly Given as Inputs (PDF). International Conference on Neural Networks (ICNN) 1997.

[Shibata4-17] Shibata, Katsunari; Iida, Masaru (2003). Acquisition of Box Pushing by Direct-Vision-Based Reinforcement Learning (PDF). SICE Annual Conference 2003.

[Shibata5-18] Utsunomiya, Hiroki; Shibata, Katsunari (2008). Contextual Behavior and Internal Representations Acquired by Reinforcement Learning with a Recurrent Neural Network in a Continuous State and Action Space Task (PDF). International Conference on Neural Information Processing (ICONIP) '08.

[Shibata6-19] Shibata, Katsunari; Kawano, Tomohiko (2008). Learning of Action Generation from Raw Camera Images in a Real-World-like Environment by Simple Coupling of Reinforcement Learning and a Neural Network (PDF). International Conference on Neural Information Processing (ICONIP) '08.

[Shibata7-20] Shibata, Katsunari (March 9, 2017). "Communications that Emerge through Reinforcement Learning Using a (Recurrent) Neural Network". arXiv:1703.03543. A bot will complete this citation soon. Click here to jump the queue

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]

[19]

[20]