User:Datakeeper/valuabledatasets
![]() | This is a draft article. It is a work in progress open to editing by anyone. Please ensure core content policies are met before publishing it as a live Wikipedia article. Find sources: Google (books · news · scholar · free images · WP refs) · FENS · JSTOR · TWL Last edited by 50.53.22.81 (talk | contribs) 4 years ago. (Update) |
PAGE TITLE: List of datasets for machine learning research.
This is a list of noteworthy datasets for machine learning research. This list is not exhaustive, and is limited to noteworthy, high-quality datasets.
Datasets are an integral part of the field of machine learning. Major advances in this field can result from advances in learning algorithms (such as deep learning), computer hardware, and, less-intuitively, the availability of high-quality training datasets. High-quality labeled training datasets for supervised and semi-supervised machine learning algorithms are usually difficult and expensive to produce because of the large amount of time needed to label the data. Although they do not need to be labeled, high-quality datasets for unsupervised learning can also be difficult and costly to produce.[1][2][3][4][5]
Image datasets
[edit]Facial recognition
[edit]Name | Brief Description | Instances | Download
size (GB) |
Format | Default Task | Preprocessing | Created
(updated) |
Source |
---|---|---|---|---|---|---|---|---|
SCFace | Color images of faces at various angles | 4160 | 8.5 | .jpg | classification,
facial recognition |
Location of facial features extracted.
Coordinates of features given |
2011 | University of Zagreb |
Object detection
[edit]Aerial Images
[edit]Other Images
[edit]Text datasets
[edit]Reviews
[edit]Name | Brief Description | Language | Instances | Download
size (GB) |
Default Task | Preprocessing | Created (updated) | Source |
---|---|---|---|---|---|---|---|---|
Amazon commerce reviews | Reviews from Amazon.com commerce | English | 1500 | .0021 | classification | Full text not given, features include
words used, punctuation, length, etc. |
2011 | UCI Machine Learning |
News articles
[edit]Messages
[edit]Other text
[edit]Sound datasets
[edit]Speech
[edit]Name | Brief Description | Language | Instances | Download
size (GB) |
Format | Default Task | Preprocessing | Created (updated) | Source |
---|---|---|---|---|---|---|---|---|---|
Spoken Arabic Digits | Spoken arabic digits from 44 male and 44 female | Arabic | 8800 | .036 | .txt | classification | Timeseries of
Mel-frequency cepstrum coefficients |
2010 | UCI Machine Learning |
Mechanical
[edit]Animal
[edit]Other sounds
[edit]Signal datasets
[edit]Medical
[edit]Name | Brief Description | Instances | Download
size (GB) |
Default Task | Preprocessing | Missing
values? |
Created (updated) | Source |
---|---|---|---|---|---|---|---|---|
EEG Database Data Set | Study to examine EEG correlates of genetic predisposition to alcoholism | 8800 | .7 | classification | measurements from 64 electrodes placed on the scalp sampled at 256 Hz (3.9-msec epoch) for 1 second | Yes | 1999 | UCI Machine Learning |
Electrical
[edit]Other signals
[edit]Other datasets
[edit]References
[edit]- ^ Wissner-Gross, A. "Datasets Over Algorithms". Edge.com. Retrieved 8 January 2016.
- ^ Weiss, Gary M., and Foster Provost. "Learning when training data are costly: the effect of class distribution on tree induction." Journal of Artificial Intelligence Research (2003): 315-354.
- ^ Turney, Peter. "Types of cost in inductive concept learning." (2000).
- ^ Abney, Steven. Semisupervised learning for computational linguistics. CRC Press, 2007.
- ^ Žliobaitė, Indrė, et al. "Active learning with evolving streaming data." Machine Learning and Knowledge Discovery in Databases. Springer Berlin Heidelberg, 2011. 597-612.