From Wikipedia, the free encyclopedia
The following tables compare some of the datasets that can be used in machine learning for training and testing.
Image datasets by name
General image datasets
Dataset
Creator
Free
License[ a]
Description
Number of examples (training + test)
Number of categories
Number of annotations
Size (MB )
Web page
Caltech 101
Fei-Fei Li , Marco Andreetto, Marc 'Aurelio Ranzato and Pietro Perona at the California Institute of Technology
Yes
?
Pictures of objects
9,146
101
—
131
[1]
Caltech 256
?
Yes
?
Pictures of objects
30,607
256
—
1,128
[2]
ImageNet
Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, Li Fei-Fei Stanford University
Yes
Varied - Image URLs
Images for WordNet nouns
14,197,122
1000
25
?
[3]
LabelMe
MIT Computer Science and Artificial Intelligence Laboratory
Yes
?
Pictures of scenes
187,240
—
658,992
?
[4]
MNIST database
?
Yes
?
Handwritten digits
60,000
10
—
11
[5]
MSCOCO Common Objects in Context
Tsung-Yi Lin et. al Microsoft Research
Yes
Creative Commons for Image Annotations
Images with multiple objects
325,000
2,500,000
5 captions per image (325k x 5 = 1.63M)
?
[6]
Overhead Imagery Research Data Set
?
Yes
?
Overhead images
908
—
1,800 (approx.)
161
[7]
Facial image datasets
Sound datasets by name
Dataset
Creator
Free
License[ a]
Description
Number of examples (training + test)
Size (MB )
Web page
TIMIT
John Garofolo, Lori Lamel, William Fisher, Jonathan Fiscus, David Pallett, Nancy Dahlgren, Victor Zue
No
LDC User Agreement for Non-Members
Recordings of 630 speakers of eight major dialects of American English, each reading ten phonetically rich sentences
6,300
?
[8]
^ a b Licenses here are a summary, and are not taken to be complete statements of the licenses.
See also
References