Deep learning
Deep learning refers to a sub-field of machine learning that is based on learning several levels of representations, corresponding to a hierarchy of features or factors or concepts, where higher-level concepts are defined from lower-level ones, and the same lower-level concepts can help to define many higher-level concepts.[1]
Deep learning is part of a broader family of machine learning methods based on learning representations. An observation (e.g., an image) can be represented in many ways (e.g., a vector of pixels), but some representations make it easier to learn tasks of interest (e.g., is this the image of a human face?) from examples, and research in this area attempts to define what makes better representations and how to learn them.
Fundamental concepts
Deep learning algorithms are based on distributed representations, a notion that was introduced with connectionism in the 1980's. The underlying assumption behind distributed representations is that the observed data were generated by the interactions of many factors (not all known to the observer), and that what is learned about a particular factor from some configurations of the other factors can often generalize to other, unseen configurations of the factors. Deep learning adds the assumption (seen as a prior about the unknown, data generating process) that these factors are organized into multiple levels, corresponding to different levels of abstraction or composition : higher-level representations are obtained by transforming or generating lower-level representations. The relationships between these factors can be viewed as similar to the relationships between entries in a dictionary or in Wikipedia, although these factors can be numerical (e.g., the position of the face in the image) or categorical (e.g., is it human face?), whereas entries in a dictionary are purely symbolic. The appropriate number of levels and the structure that relates these factors is something that a deep learning algorithm is also expected to discover from examples.
Deep learning algorithms often involve other important ideas that correspond to broad a priori beliefs about these unknown underlying factors. An important prior regarding a supervised learning task of interest (e.g., given an input image, predicting the presence of a face and the identity of the person) is that among the factors that explain the variations observed in the inputs (e.g. images), some of them are relevant to the prediction interest. This is a special case of the semi-supervised learning setup, which allows a learner to exploit large quantities of unlabeled data (e.g., images for which the presence of a face and the identity of the person, if any, are not known).
Many deep learning algorithms are actually framed as unsupervised learning, e.g., using many examples of natural images to discover good representations of them. Because most of these learning algorithms can be applied to unlabeled data, they can leverage large amounts of unlabeled data, even when these examples are not necessarily labeled, and even when the data cannot be associated with labels of the immediate tasks of interest.
Deep Learning in Artificial Neural Networks
Some of the most successful deep learning methods involve artificial neural networks. Deep Learning Neural Networks date back at least to the 1980 Neocognitron by Kunihiko Fukushima[2]. It is inspired by the 1959 biological model proposed by Nobel laureates David H. Hubel & Torsten Wiesel, who found two types of cells in the visual primary cortex: simple cells and complex cells. Many artificial neural networks can be viewed as cascading models[3] of cell types inspired by these biological observations.
With the advent of the back-propagation algorithm, many researchers tried to train supervised deep artificial neural networks from scratch, initially with little success. Sepp Hochreiter's diploma thesis of 1991[4][5] formally identified the reason for this failure in the "vanishing gradient problem," which not only affects many-layered feedforward networks, but also recurrent neural networks. The latter are trained by unfolding them into very deep feedforward networks, where a new layer is created for each time step of an input sequence processed by the network. As errors propagate from layer to layer, they shrink exponentially with the number of layers.
To overcome this problem, several methods were proposed. One is the Long short term memory (LSTM) network of 1997 by Hochreiter & Schmidhuber[6]. In 2009, deep multidimensional LSTM networks demonstrated the power of deep learning with many nonlinear layers, by winning three ICDAR 2009 competitions in connected handwriting recognition, without any prior knowledge about the three different languages to be learned[7][8].
Other methods use unsupervised pre-training to structure a neural network, making it first learn generally useful feature detectors. Then the network is trained further by back-propagation to classify labeled data. The deep model of Hinton et al. (2006) involves learning the distribution of a high level representation using successive layers of binary latent variables. However, real valued variables may also be used. The approach of Hinton et al. uses a restricted Boltzmann machine (Smolensky, 1986[9]) to model each new layer of higher level features. Each new layer guarantees an increase on the lower-bound of the log likelihood of the data, thus improving the model, if trained properly. Once sufficiently many layers have been learned the deep architecture may be used as a generative model by reproducing the data when sampling down the model (an "ancestral pass") from the top level feature activations.[10] Hinton reports that his models are effective feature extractors over high-dimensional, structured data.[11]
A Google team led by Andrew Ng and Jeff Dean created a neural network that learned to recognize higher-level concepts, such as cats, only from watching unlabeled images taken from YouTube videos.[12] [13]
Other methods rely on the sheer processing power of modern computers, in particular, GPUs. In 2010 it was shown by Dan Ciresan and colleagues[14] at the Swiss AI Lab IDSIA that despite the above-mentioned "vanishing gradient problem," the superior processing power of GPUs makes plain back-propagation feasible for deep feedforward neural networks with many layers. The method outperformed all other machine learning techniques on the old, famous MNIST handwritten digits problem of Yann LeCun and colleagues at NYU.
As of 2012, the state of the art in deep learning feedforward networks alternates convolutional layers and max-pooling layers, topped by several pure classification layers. Since 2011, GPU-based implementations of this approach won many pattern recognition contests, including the IJCNN 2011 Traffic Sign Recognition Competition[15], the ISBI 2012 Segmentation of neuronal structures in EM stacks challenge[16], and others.
Such supervised deep learning methods also were the first artificial pattern recognizers to achieve human-competitive performance on certain tasks[17].
References
- ^ Bengio, Y. (2009). Learning Deep Architectures for AI (PDF). Now Publishers.
- ^ K. Fukushima. Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position. Biological Cybernetics, 36(4): 93-202, 1980.
- ^ M Riesenhuber, T Poggio. Hierarchical models of object recognition in cortex. Nature neuroscience, 1999.
- ^ S. Hochreiter. Untersuchungen zu dynamischen neuronalen Netzen. Diploma thesis, Institut f. Informatik, Technische Univ. Munich, 1991. Advisor: J. Schmidhuber
- ^ S. Hochreiter, Y. Bengio, P. Frasconi, and J. Schmidhuber. Gradient flow in recurrent nets: the difficulty of learning long-term dependencies. In S. C. Kremer and J. F. Kolen, editors, A Field Guide to Dynamical Recurrent Neural Networks. IEEE Press, 2001.
- ^ Hochreiter, Sepp; and Schmidhuber, Jürgen; Long Short-Term Memory, Neural Computation, 9(8):1735–1780, 1997
- ^ Graves, Alex; and Schmidhuber, Jürgen; Offline Handwriting Recognition with Multidimensional Recurrent Neural Networks, in Bengio, Yoshua; Schuurmans, Dale; Lafferty, John; Williams, Chris K. I.; and Culotta, Aron (eds.), Advances in Neural Information Processing Systems 22 (NIPS'22), December 7th–10th, 2009, Vancouver, BC, Neural Information Processing Systems (NIPS) Foundation, 2009, pp. 545–552
- ^ A. Graves, M. Liwicki, S. Fernandez, R. Bertolami, H. Bunke, J. Schmidhuber. A Novel Connectionist System for Improved Unconstrained Handwriting Recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 31, no. 5, 2009.
- ^
Smolensky, P. (1986). Information processing in dynamical systems: Foundations of harmony theory. Vol. 1. pp. 194–281.
{{cite book}}
:|journal=
ignored (help) - ^ Hinton, G. E.; Osindero, S.; Teh, Y. (2006). "A fast learning algorithm for deep belief nets" (PDF). Neural Computation. 18 (7): 1527–1554. doi:10.1162/neco.2006.18.7.1527. PMID 16764513.
- ^ http://www.scholarpedia.org/article/Deep_belief_networks /
- ^ John Markoff (2012). "How Many Computers to Identify a Cat? 16,000". New York Times.
{{cite news}}
: Cite has empty unknown parameters:|1=
,|2=
,|3=
,|4=
, and|5=
(help) - ^ Template:Cite article
- ^ D. C. Ciresan, U. Meier, L. M. Gambardella, J. Schmidhuber. Deep Big Simple Neural Nets For Handwritten Digit Recognition. Neural Computation 22(12): 3207-3220, 2010.
- ^ D. C. Ciresan, U. Meier, J. Masci, J. Schmidhuber. Multi-Column Deep Neural Network for Traffic Sign Classification. Neural Networks, 2012.
- ^ D. Ciresan, A. Giusti, L. Gambardella, J. Schmidhuber. Deep Neural Networks Segment Neuronal Membranes in Electron Microscopy Images. In Advances in Neural Information Processing Systems (NIPS 2012), Lake Tahoe, 2012.
- ^ D. C. Ciresan, U. Meier, J. Schmidhuber. Multi-column Deep Neural Networks for Image Classification. IEEE Conf. on Computer Vision and Pattern Recognition CVPR 2012.
External links
- Deep learning
- Video on Recent Developments in Deep Learning, by Geoff Hinton