End-to-end reinforcement learning

In end-to-end reinforcement learning, the end-to-end process, in other words, the entire process from sensors to motors in a robot or agent consists of only one layered or recurrent neural network without modularization, and the network is trained comprehensively by reinforcement learning^[1]. End-to-end reinforcement learning has been propounded by K. Shibata^[2]^[3] and has been sparked by the successful results in ATARI TV games (2013-15) ^[4] ^[5] and AlphaGo (2016) ^[6] by Google DeepMind. As well as deep learning, by using a neural network, it enables to learn massively parallel processing that humans can hardly design by hand, and to surpass what humans design. Unlike supervised learning, reinforcement learning makes autonomous learning possible. Therefore, it can make the interruption of human design minimum, and flexible learning on a huge degree of freedom can be realized. That is the reason why it is highly expected to open up the way to artificial general intelligence (AGI) or strong AI.

In reinforcement learning research, it has been general that state space and action space are designed in advance and the mapping from state space to action space is learned^[7]. Therefore, reinforcement learning has been limited to learning only for action, and human designers have to design how to construct state space from sensor signals and to give how the motion commands are generated for each action before learning. Neural networks have been often used in reinforcement learning, but that is for non-linear function approximation to avoid curse of dimensionality problem that occurs when table-lookup approach is used^[7]. Recurrent neural networks have been also used sometimes, but the main purpose of the use is to avoid perceptual aliasing or POMDP (partially observable Markov decision process)^[8]. However, the end-to-end reinforcement learning extends reinforcement learning from learning for actions to learning for entire process by extending the learned process to the entire process from sensors to motors. Therefore, not only actions, but also various functions including recognition and memory are expected to emerge. Especially, in higher functions, they do not connect directly with either sensors or motors, and so even deciding either their inputs or outputs are very difficult. Since that has disturbed the understanding or developing of them, the progress in it is highly expected by this approach.

History

The origin of this approach can be seen in TD-Gammon by G. Tesauro (1992) ^[9]. In a popular game named Back Gammon, the evaluation of the game situation during self-play was learned through TD( $\lambda$ ) using a layered neural network. 4 inputs were used for the number of men of a given color at a given location on the board and in total there are 198 input signals. With zero knowledge built in, the network was able to learn from scratch to play the entire game at a fairly strong intermediate level of performance, and the internal representation after learning was observed

K. Shibata's group has persisted in this framework since around 1995^[10] ^[11]. They did not use the term end-to-end reinforcement learning, but used the term direct-vision-based reinforcement learning. They also applied the framework to some real robot tasks: Box Pushing task^[12], Kissing AIBO task^[13]. They have also shown that various functions emerged in this framework as the next section. They have used actor-critic for learning of continuous motions and Q-learning for learning of discrete actions. They also used actor-Q for both discrete decisions and continuous motions^[14]. They have used a 3-layer or deep neural network (ex. 5 layers^[13]), and also used a recurrent neural network in learning of memory-required tasks^[15]. They also used a layered neural network that is similar to convolutional neural network and introduced local receptive field^[16].

Recently, as mentioned, Google DeepMind showed very impressive results in TV games^[4]^[5] and game of Go (AlphaGo)^[6]. They used a deep convolutional neural network that has shown superior results in image recognition. They also used 4 frames of almost raw RGB pixels (84x84) as inputs of the network, and the network was trained based on reinforcement learning with the reward representing the sign of the change in the game score. All the 49 games could be learned using the same network architecture and Q-learning with the minimal prior knowledge, and it outperformed competing methods in almost all the games and performed at a level that is broadly comparable with or superior to a professional human game tester in the majority of games^[5]. That is one reason why we feel the realization of AGI and strong AI. It is sometimes called DQN (Deep-Q network). However, in AlphaGo, deep neural networks are trained not only by reinforcement learning, but also by supervised learning. It was also combined with Monte Carlo search programs^[6].

Function Emergence

It has been shown by K. Shibata's group that various functions emerge in the framework: recognition, sensor motion, selective attention, prediction, memory, sensor motion, exploration, communication, explanation of optical illusion, explanation of brain activities in tool use and so on. It is difficult to introduce all of them, refer to ^[3]. In memory-required tasks, a recurrent neural network was used.

References

^ Demis, Hassabis (March 11, 2016). Artificial Intelligence and the Future (Speech).
^ Shibata, Katsunari (January 14, 2011). "Chapter 6: Emergence of Intelligence through Reinforcement Learning with a Neural Network". In Mellouk, Abdelhamid (ed.). Advances in Reinforcement Learning. Intech. pp. 99–120. ISBN 978-953-307-369-9. {{cite book}}: External link in |chapterurl= (help); Unknown parameter |chapterurl= ignored (|chapter-url= suggested) (help)
^ ^a ^b Shibata, Katsunari (March 7, 2017). "Functions that Emerge through End-to-End Reinforcement Learning". arXiv:1703.02239. A bot will complete this citation soon. Click here to jump the queue
^ ^a ^b Mnih, Volodymyr; et al. (December 2013). Playing Atari with Deep Reinforcement Learning (PDF). NIPS Deep Learning Workshop 2013. {{cite conference}}: Explicit use of et al. in: |first= (help)
^ ^a ^b ^c Mnih, Volodymyr; et al. (2015). "Human-level control through deep reinforcement learning". Nature. 518 (7540): 529–533. doi:10.1038/nature14236. {{cite journal}}: Explicit use of et al. in: |first= (help)}
^ ^a ^b ^c Silver, David; et al. (2013). "Mastering the game of Go with deep neural networks and tree search". Nature. 529 (7587): 484–489. doi:10.1038/nature16961. {{cite journal}}: Explicit use of et al. in: |first= (help)}
^ ^a ^b Sutton, Richard S.; Barto, Andrew G. (1998). Reinforcement Learning: An Introduction. MIT Press. ISBN 978-0262193986.
^ Lin, Long-Ji; Mitchell, Tom M. (1993). Reinforcement Learning with Hidden States. From Animals to Animats. Vol. 2. pp. 271–280. doi:10.1038/nature14236.}
^ Tesauro, Gerald (March 1995). "Temporal Difference Learning and TD-Gammon". Communications of the ACM. 38 (3). doi:10.1145/203330.203343.
^ Shibata, Katsunari; Okabe, Yoichi (1997). Reinforcement Learning When Visual Sensory Signals are Directly Given as Inputs (PDF). International Conference on Neural Networks (ICNN) 1997.
^ Shibata, Katsunari; et al. (1995). Active Perception Based on Reinforcement Learning (PDF). World Congress on Neural Networks (WCNN) 1995. {{cite conference}}: Explicit use of et al. in: |first1= (help)
^ Shibata, Katsunari; Iida, Masaru (2003). Acquisition of Box Pushing by Direct-Vision-Based Reinforcement Learning (PDF). SICE Annual Conference 2003.
^ ^a ^b Shibata, Katsunari; Kawano, Tomohiko (2008). Learning of Action Generation from Raw Camera Images in a Real-World-like Environment by Simple Coupling of Reinforcement Learning and a Neural Network    (PDF). International Conference on Neural Information Processing (ICONIP) '08.
^ Shibata, Katsunari; et al. (2001). Actor-Q Based Active Perception Learning System (PDF). International Conference on Robotics and Automation (ICRA) 2001. {{cite conference}}: Explicit use of et al. in: |first1= (help)
^ Shibata, Katsunari (2005). Discretization of Series of Communication Signals in Noisy Environment by Reinforcement Learning (PDF). The 7th International Conference on Adaptive and Natural Computing Algorithms (ICANNGA). {{cite conference}}: Cite has empty unknown parameter: |1= (help)
^ Shibata, Katsunari; Goto, Kenta (2013). Dmergence of Flexible Prediction-Based Discrete Decision Making and Continuous Motion Generation through Actor-Q-Learning (PDF). The 3th Joint IEEE International Conference on Development and Learning and on Epigenetic Robotics (ICDL-Epirob). {{cite conference}}: Cite has empty unknown parameter: |1= (help)

[Hassabis-1] Demis, Hassabis (March 11, 2016). Artificial Intelligence and the Future (Speech).

[Shibata1-2] Shibata, Katsunari (January 14, 2011). "Chapter 6: Emergence of Intelligence through Reinforcement Learning with a Neural Network". In Mellouk, Abdelhamid (ed.). Advances in Reinforcement Learning. Intech. pp. 99–120. ISBN 978-953-307-369-9. {{cite book}}: External link in |chapterurl= (help); Unknown parameter |chapterurl= ignored (|chapter-url= suggested) (help)

[Shibata2-3] Shibata, Katsunari (March 7, 2017). "Functions that Emerge through End-to-End Reinforcement Learning". arXiv:1703.02239. A bot will complete this citation soon. Click here to jump the queue

[DQN1-4] Mnih, Volodymyr; et al. (December 2013). Playing Atari with Deep Reinforcement Learning (PDF). NIPS Deep Learning Workshop 2013. {{cite conference}}: Explicit use of et al. in: |first= (help)

[DQN2-5] Mnih, Volodymyr; et al. (2015). "Human-level control through deep reinforcement learning". Nature. 518 (7540): 529–533. doi:10.1038/nature14236. {{cite journal}}: Explicit use of et al. in: |first= (help)}

[AlphaGo-6] Silver, David; et al. (2013). "Mastering the game of Go with deep neural networks and tree search". Nature. 529 (7587): 484–489. doi:10.1038/nature16961. {{cite journal}}: Explicit use of et al. in: |first= (help)}

[RL-7] Sutton, Richard S.; Barto, Andrew G. (1998). Reinforcement Learning: An Introduction. MIT Press. ISBN 978-0262193986.

[Lin-8] Lin, Long-Ji; Mitchell, Tom M. (1993). Reinforcement Learning with Hidden States. From Animals to Animats. Vol. 2. pp. 271–280. doi:10.1038/nature14236.}

[TD-Gammon-9] Tesauro, Gerald (March 1995). "Temporal Difference Learning and TD-Gammon". Communications of the ACM. 38 (3). doi:10.1145/203330.203343.

[Shibata3-10] Shibata, Katsunari; Okabe, Yoichi (1997). Reinforcement Learning When Visual Sensory Signals are Directly Given as Inputs (PDF). International Conference on Neural Networks (ICNN) 1997.

[Shibata4-11] Shibata, Katsunari; et al. (1995). Active Perception Based on Reinforcement Learning (PDF). World Congress on Neural Networks (WCNN) 1995. {{cite conference}}: Explicit use of et al. in: |first1= (help)

[ShibataRobot1-12] Shibata, Katsunari; Iida, Masaru (2003). Acquisition of Box Pushing by Direct-Vision-Based Reinforcement Learning (PDF). SICE Annual Conference 2003.

[ShibataRobot2-13] Shibata, Katsunari; Kawano, Tomohiko (2008). Learning of Action Generation from Raw Camera Images in a Real-World-like Environment by Simple Coupling of Reinforcement Learning and a Neural Network    (PDF). International Conference on Neural Information Processing (ICONIP) '08.

[Actor-Q-14] Shibata, Katsunari; et al. (2001). Actor-Q Based Active Perception Learning System (PDF). International Conference on Robotics and Automation (ICRA) 2001. {{cite conference}}: Explicit use of et al. in: |first1= (help)

[RNN-15] Shibata, Katsunari (2005). Discretization of Series of Communication Signals in Noisy Environment by Reinforcement Learning (PDF). The 7th International Conference on Adaptive and Natural Computing Algorithms (ICANNGA). {{cite conference}}: Cite has empty unknown parameter: |1= (help)

[Shibata5-16] Shibata, Katsunari; Goto, Kenta (2013). Dmergence of Flexible Prediction-Based Discrete Decision Making and Continuous Motion Generation through Actor-Q-Learning (PDF). The 3th Joint IEEE International Conference on Development and Learning and on Epigenetic Robotics (ICDL-Epirob). {{cite conference}}: Cite has empty unknown parameter: |1= (help)

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]