User:Iloverobotics/sandbox
Unsupervised learning is the field of machine learning that aims to represent the structure underlying unlabeled data. Input data are not provided with labels or correct outputs (unlike in supervised learning), and the actions of the learning algorithm are not associated with rewards or punishments (unlike in reinforcement learning). The only input that unsupervised learning algorithms have to work with are data points , which in the simplest case are independently and identically drawn from some distribution . Unsupervised learning studies how algorithms can model and learn to represent the statistical structure of the inputs[1].
Goals
[edit]Unsupervised learning algorithms can be used for applications such as outlier detection, classification, and dimensionality reduction [1]. For example, with high dimensional data sets, it might make sense to check if the input points are clustered in groups within the space of all possible inputs. If this is true the input data can be characterized using fewer dimensions, leading to a simplification of the problem and possibly to insights about its underlying structure[2].
Unsupervised learning techniques are widely used to reduce the dimensionality of high dimensional genomic data sets that may involve hundreds of thousands of variables. For example, weighted correlation network analysis is often used for identifying clusters (referred to as modules), modeling the relationship between clusters, calculating fuzzy measures of cluster (module) membership, identifying intramodular hubs, and for studying cluster preservation in other data sets[3].
History
[edit]Research on unsupervised learning dates back to the late ‘40s[2]. Early contributors include Donald Hebb, who linked statistical methods to neurophysiological experiments on plasticity (the Hebb rule) [4], and Donald MacKay, who adopted a cybernetic-theoretic approach [5]. In 1970, David Marr made an early unsupervised learning postulate about the goal of learning in his model of the neocortex [6]. More recently, Geoffrey Hinton and Terrence Sejnowski invented a learning model called the Boltzmann machine[7], which imported many of the statistics concepts that are now used in density estimation methods[8]. Horace Barlow investigated the ways in which the brain makes use of the continuous flow of sensory information without any associated rewards or punishments[9].
Unsupervised learning is closely related to the problem of density estimation in statistics [10]. This approach to unsupervised learning has a rich history [11]. Perhaps the earliest nonparametric estimator of a univariate density was the histogram. Further breakthroughs included the kernel, orthogonal series, and nearest-neighbor methods. Later, methods such as penalized likelihood, polynomial spline, variable kernel, sieves and projection pursuit were introduced [11].
Clustering methods represent an important area of unsupervised learning. One of the oldest clustering algorithms, k-means, was first proposed by Stuart Lloyd in 1957 [12]. Later on, many researchers investigated theoretical and algorithmic aspects and modifications of the method [13].
Unsupervised learning is currently a very active research area within the field of machine learning, and many advanced techniques have been developed to deal with a wide range of unlabeled datasets[10].
Data Types and Models
[edit]Different types of unlabeled data can be used as inputs to an unsupervised learning algorithm. According to the kind of input, different learning models can be chosen in order to extract the underlying structure of the data[1].
Models for Unstructured Input Data
[edit]When the input data are unstructured, the order in which the input observations arrive is irrelevant, and the algorithm will process the whole dataset at the same time (batch processing) [1].
If the input data are high dimensional, it is usually a good choice to perform dimensionality reduction. This can be achieved by using models that are defined in terms of latent or hidden variables. These models include, in order of increasing complexity:
- Factor Analysis (FA), where the input data are modeled as the combination of the elements (the hidden factors) of a lower dimensional multivariate Gaussian vector, plus Gaussian noise [1].
- Principal Component Analysis (PCA), a limiting case of Factor Analysis where the noise is isotropic [1].
- Independent Component Analysis (ICA), which extends Factor Analysis to the case where the factors are non-Gaussian [1].
Another widespread approach for the analysis of unstructured data is represented by cluster analysis. Clustering models are different from latent variable models, because they aim to divide the input data into groups (clusters). Clusters are defined as collections of data points that are “similar” (in some way) between them and are “dissimilar” to the objects belonging to other clusters [14]. The main clustering algorithms are:
- Basic Sequential Algorithmic Scheme (BSAS), a simple clustering algorithm based on a distance function between a point and a cluster.
- K-means, which assigns a point to the cluster with the nearest mean.
- Fuzzy C-Means, a method allowing one data point to belong to two or more clusters.
- Hierarchical Clustering Analysis (HCA) seeks to build a hierarchy of clusters, either combining small clusters (agglomerative) or splitting big clusters (divisive).
- Spectral Clustering, using the eigenvalues of the similarity matrix to perform dimensionality reduction before clustering in fewer dimensions.
- Mixture of Gaussians, where each cluster is modeled as a Gaussian distribution with particular mean and covariance.
- Density-based clustering, defining clusters as areas of higher density with respect to the rest of the data set.
- Non-negative Matrix Factorization (NMF), a form of matrix factorization that has been applied to text mining and document clustering.
Models for Sequential Input Data (Time Series)
[edit]When the input data are a time series, the data points are presented in a sequential order [1]. In this case, the previous assumption of iid observations does not hold anymore. Since new data arrive in a sequence, new observations are correlated to the old ones. The most complete way to model this situation would be to find the probability of the latest data point given the history of all the previous ones. This would mean, though, that the complexity of the model grows with the number of considered data points. In order to limit this complexity, we usually limit the number of past observations that are used in the model. The problem can be further simplified by applying dimensionality reduction techniques based on latent variables.
The models most frequently used to model sequential input data include, in order of increasing complexity:
- Markov Models, which model the direct relationship between the output and a limited window of past observations.
- Hidden Markov Models (HMMs), making use of discrete hidden variables to simplify and improve the predictions.
- State-Space Models (SSMs), where the hidden variables (called state variables) are continuous.
Models for Input Data with an Unknown Structure
[edit]The relation between input variables might be an unknown itself [1]. Graphical models are usually employed as tools to represent the dependencies between random variables in a probabilistic model. Different graph structures have different ways of representing conditional independence between variables.
The three main kinds of probabilistic graph models are:
- Undirected Graphs, allowing to easily represent conditional independence relationships between random variables.
- Factor Graphs, used to implement message passing algorithms for inference [1].
- Directed graphs, which emphasize statistical dependence relationships between random variables.
Advanced Models
[edit]Research in unsupervised learning has lead to advanced models which can overcome the limitations of the methods presented so far. These advanced models include:
- Mixture of factor analysers, which combine factor analysis and Gaussian mixtures
- Nonlinear state-space models, which overcome the linearity and Gaussian noise assumptions of SSMs
- Factorial HMMs, which allow a vector of discrete state variables
- Neural network models
Among neural network models, the self-organizing map (SOM) and adaptive resonance theory (ART) are commonly used unsupervised learning algorithms. The SOM is a topographic organization in which nearby locations in the map represent inputs with similar properties. The ART model allows the number of clusters to vary with problem size and lets the user control the degree of similarity between members of the same clusters by means of a user-defined constant called the vigilance parameter. ART networks are also used for many pattern recognition tasks, such as automatic target recognition and seismic signal processing. The first version of ART was "ART1", developed by Carpenter and Grossberg (1988).
Structure of the Unsupervised Learning Problem: a Bayesian Approach
[edit]Introduction
[edit]As shown in the previous sections, a number of algorithms have been proposed to solve unsupervised learning problems. These algorithms make use of a wide range of approaches, from clustering to neural networks to latent variables models.
Models that make use of latent variables can be analyzed adopting a Bayesian approach. In this section, we will focus on this particular approach as an example of framework for unsupervised learning. It must be pointed out, though, that many unsupervised learning algorithms do not have straightforward Bayesian interpretations. For details about specific algorithms, refer to the links in the list above or to the Related Articles and Further Reading sections below.
Bayesian approach to unsupervised learning
[edit]The Bayes rule can be used to motivate a coherent statistical framework for machine learning[1]. Assume a set of models . The algorithm starts with some prior beliefs over models , such that . A model is defined as a probability distribution over data points, i.e. . For simplicity, we also assume that the data are independently and identically distributed (iid). Given an input data set , the belief over models is given by
This means that the posterior over models is the prior multiplied by the likelihood (and normalized). The predictive distribution over new data is
Often models are defined by writing down a parametric probability distribution. For example, we assume that model is defined by the unknown vector of parameters . The prior over these model parameters is
Integrating over the parameters of a particular model, it is also possible to infer the posterior over the parameters of the model and to compute the predictive distribution as
Approximate forms of Bayesian Learning
[edit]Since it may be complicated to represent the entire posterior distribution over parameters, we choose to find a point-estimate of the parameters [1]. A natural choice is to pick the most probable parameter value given the data, which is known as the maximum a posterior (MAP) parameter estimate:
Another natural choice is the maximum likelihood (ML) parameter estimate
Using ML estimation, however, can lead to overfitting. This is due to the fact that models with more parameters will in general lead to higher values for the maximum likelihood. Overfitting can be avoided by maximizing a regularized likelihood. The ML parameters of a model with latent variables (see next section) can be estimated using expectation-maximization (EM) algorithms.
Related Articles
[edit]- Factor Analysis
- Principal Component Analysis
- Independent Component Analysis
- Mixture of Gaussians
- K-means
- EM Algorithm
- Markov model
- Hidden Markov Model
- State-space representation
- Graph (mathematics)
- Cluster Analysis
- Data Mining
- Dimensionality reduction
- Supervised learning
- Reinforcement Learning
- Self-organizing map
- Adaptive resonance theory
Further Reading
[edit]- Hastie, Trevor, et al. The elements of statistical learning. Vol. 2. No. 1. New York: Springer, 2009.
- Alpaydin, Ethem. Introduction to machine learning. MIT press, 2004.
External Links
[edit]- 9.520 class at MIT: Statistical Learning Theory and Applications[15]
- class on Coursera: Machine Learning[16]
Notes and References
[edit]- ^ a b c d e f g h i j k l Ghahramani, Zoubin. "Unsupervised Learning" (PDF). Retrieved 2014-12-09.
- ^ a b Dayan, Peter. "Unsupervised Learning" (PDF).
- ^ Scherf, Matthias (2000). "Highly Specific Localization of Promoter Regions in Large Genomic Sequences by PromoterInspector: A Novel Context Analysis Approach" (PDF). Journal of molecular biology. 297 (3): 599-606.
- ^ Hebb, Donald (1949). The Organization of Behavior. Wiley.
- ^ MacKay, Donald (1956). Automata Studies. Princeton, NJ: Princeton University Press.
- ^ Marr, David (1970). "A theory for cerebral neocortex". Proceedings of the Royal Society of London. 176: 161-234.
- ^ Hinton, Geoffrey (1986). Parallel Distributed Processing: Explorations in the Microstructure of Cognition. Cambridge, MA: MIT Press.
- ^ Grenander, U (1976). Lectures in Pattern Theory I, II and III: Pattern Analysis, Pattern Synthesis and Regular Structures. Berlin: Springer-Verlag.
- ^ Barlow, Horace (1989). "Unsupervised Learning". Neural Computation. 1: 295-311.
- ^ a b Bishop, Christopher (2006). Pattern Recognition and Machine Learning. New York: Springer.
- ^ a b Izenman, Alan Julian (1991). "Recent Developments in Nonparametric Density Estimation". Journal of the American Statistical Association. 86 (413): 205-224.
- ^ Lloyd, Stuart (1957). "Least square quantization in PCM". Bell Telephone Laboratories Paper.
- ^ Bock, Hans-Hermann (2007). Clustering Methods: A History of k-Means Algorithms. Springer.
- ^ Matteucci, Matteo. "A Tutorial on Clustering Algorithms". Retrieved 20 December 2014.
- ^ "9.520: Statistical Learning Theory and Applications".
- ^ "Coursera: Machine Learning".