User:Datakeeper/valuabledatasets

PAGE TITLE: List of datasets for machine learning research.

This is a list of noteworthy datasets for machine learning research. This list is not exhaustive, and is limited to noteworthy, high-quality datasets.

Datasets are an integral part of the field of machine learning. Major advances in this field can result from advances in learning algorithms (such as deep learning), computer hardware, and, less-intuitively, the availability of high-quality training datasets. High-quality labeled training datasets for supervised and semi-supervised machine learning algorithms are usually difficult and expensive to produce because of the large amount of time needed to label the data. Although they do not need to be labeled, high-quality datasets for unsupervised learning can also be difficult and costly to produce.^[1]^[2]^[3]^[4]^[5]

Image datasets

Facial recognition

Name	Brief Description	Instances	Download size (GB)	Format	Default Task	Preprocessing	Created (updated)	Source
SCFace	Color images of faces at various angles	4160	8.5	.jpg	classification, facial recognition	Location of facial features extracted. Coordinates of features given	2011	University of Zagreb

Object detection

Aerial Images

Other Images

Text datasets

Reviews

Name	Brief Description	Language	Instances	Download size (GB)	Default Task	Preprocessing	Created (updated)	Source
Amazon commerce reviews	Reviews from Amazon.com commerce	English	1500	.0021	classification	Full text not given, features include words used, punctuation, length, etc.	2011	UCI Machine Learning

News articles

Messages

Other text

Sound datasets

Speech

Name	Brief Description	Language	Instances	Download size (GB)	Format	Default Task	Preprocessing	Created (updated)	Source
Spoken Arabic Digits	Spoken arabic digits from 44 male and 44 female	Arabic	8800	.036	.txt	classification	Timeseries of Mel-frequency cepstrum coefficients	2010	UCI Machine Learning

Mechanical

Animal

Other sounds

Signal datasets

Medical

Name	Brief Description	Instances	Download size (GB)	Default Task	Preprocessing	Missing values?	Created (updated)	Source
EEG Database Data Set	Study to examine EEG correlates of genetic predisposition to alcoholism	8800	.7	classification	measurements from 64 electrodes placed on the scalp sampled at 256 Hz (3.9-msec epoch) for 1 second	Yes	1999	UCI Machine Learning

Electrical

Other signals

Other datasets

References

^ Wissner-Gross, A. "Datasets Over Algorithms". Edge.com. Retrieved 8 January 2016.
^ Weiss, Gary M., and Foster Provost. "Learning when training data are costly: the effect of class distribution on tree induction." Journal of Artificial Intelligence Research (2003): 315-354.
^ Turney, Peter. "Types of cost in inductive concept learning." (2000).
^ Abney, Steven. Semisupervised learning for computational linguistics. CRC Press, 2007.
^ Žliobaitė, Indrė, et al. "Active learning with evolving streaming data." Machine Learning and Knowledge Discovery in Databases. Springer Berlin Heidelberg, 2011. 597-612.

[1] Wissner-Gross, A. "Datasets Over Algorithms". Edge.com. Retrieved 8 January 2016.

[2] Weiss, Gary M., and Foster Provost. "Learning when training data are costly: the effect of class distribution on tree induction." Journal of Artificial Intelligence Research (2003): 315-354.

[3] Turney, Peter. "Types of cost in inductive concept learning." (2000).

[4] Abney, Steven. Semisupervised learning for computational linguistics. CRC Press, 2007.

[5] Žliobaitė, Indrė, et al. "Active learning with evolving streaming data." Machine Learning and Knowledge Discovery in Databases. Springer Berlin Heidelberg, 2011. 597-612.

[1]

[2]

[3]

[4]

[5]