Datasets for Deep Learning

Home Profile Create Page

Datasets for Deep Learning

Mo Data stashed this in Data Sources

http://deeplearning.net/datasets/

Datasets

These datasets can be used for benchmarking deep learning algorithms:

Symbolic Music Datasets

Piano-midi.de: classical piano pieces (http://www.piano-midi.de/)
Nottingham : over 1000 folk tunes (http://abc.sourceforge.net/NMD/)
MuseData: electronic library of classical music scores (http://musedata.stanford.edu/)
JSB Chorales: set of four-part harmonized chorales (http://www.jsbchorales.net/index.shtml)

Natural Images

MNIST: handwritten digits (http://yann.lecun.com/exdb/mnist/)
NIST: similar to MNIST, but larger
Perturbed NIST: a dataset developed in Yoshua’s class (NIST with tons of deformations)
CIFAR10 / CIFAR100: 32×32 natural image dataset with 10/100 categories ( http://www.cs.utoronto.ca/~kriz/cifar.html)
Caltech 101: pictures of objects belonging to 101 categories (http://www.vision.caltech.edu/Image_Datasets/Caltech101/)
Caltech 256: pictures of objects belonging to 256 categories (http://www.vision.caltech.edu/Image_Datasets/Caltech256/)
Caltech Silhouettes: 28×28 binary images contains silhouettes of the Caltech 101 dataset
STL-10 dataset is an image recognition dataset for developing unsupervised feature learning, deep learning, self-taught learning algorithms. It is inspired by the CIFAR-10 dataset but with some modifications. http://www.stanford.edu/~acoates//stl10/
The Street View House Numbers (SVHN) Dataset - http://ufldl.stanford.edu/housenumbers/
NORB: binocular images of toy figurines under various illumination and pose (http://www.cs.nyu.edu/~ylclab/data/norb-v1.0/)
Imagenet: image database organized according to the WordNethierarchy (http://www.image-net.org/)
Pascal VOC: various object recognition challenges (http://pascallin.ecs.soton.ac.uk/challenges/VOC/)
Labelme: A large dataset of annotated images, http://labelme.csail.mit.edu/Release3.0/browserTools/php/dataset.php
COIL 20: different objects imaged at every angle in a 360 rotation(http://www.cs.columbia.edu/CAVE/software/softlib/coil-20.php)
COIL100: different objects imaged at every angle in a 360 rotation (http://www1.cs.columbia.edu/CAVE/software/softlib/coil-100.php)

Artificial Datasets

Arcade Universe - An artificial dataset generator with images containing arcade games sprites such as tetris pentomino/tetromino objects. This generator is based on the O. Breleux’s bugland dataset generator.
A collection of datasets inspired by the ideas from BabyAISchool:
- BabyAIShapesDatasets : distinguishing between 3 simple shapes
- BabyAIImageAndQuestionDatasets : a question-image-answer dataset
Datasets generated for the purpose of an empirical evaluation of deep architectures (DeepVsShallowComparisonICML2007):
- MnistVariations : introducing controlled variations in MNIST
- RectanglesData : discriminating between wide and tall rectangles
- ConvexNonConvex : discriminating between convex and nonconvex shapes
- BackgroundCorrelation : controlling the degree of correlation in noisy MNIST backgrounds

Faces

Labelled Faces in the Wild: 13,000 images of faces collected from the web, labelled with the name of the person pictured (http://vis-www.cs.umass.edu/lfw/)
Toronto Face Dataset
Olivetti: a few images of several different people (http://www.cs.nyu.edu/~roweis/data.html)
Multi-Pie: The CMU Multi-PIE Face Database (http://www.multipie.org/)
Face-in-Action (http://www.flintbox.com/public/project/5486/)
JACFEE: Japanese and Caucasian Facial Expressions of Emotion (http://www.humintell.com/jacfee/)
FERET: The Facial Recognition Technology Database (http://www.itl.nist.gov/iad/humanid/feret/feret_master.html)
mmifacedb: MMI Facial Expression Database (http://www.mmifacedb.com/)
IndianFaceDatabase: http://vis-www.cs.umass.edu/~vidit/IndianFaceDatabase/)
(e.g. The Yale Face Database (http://vision.ucsd.edu/content/yale-face-database) and The Yale Face Database B (http://vision.ucsd.edu/~leekc/ExtYaleDatabase/ExtYaleB.html)).

Text

20 newsgroups: classification task, mapping word occurences to newsgroup ID (http://qwone.com/~jason/20Newsgroups/)
Reuters (RCV*) Corpuses: text/topic prediction (http://about.reuters.com/researchandstandards/corpus/)
Penn Treebank : used for next word prediction or next character prediction (http://www.cis.upenn.edu/~treebank/)
Broadcast News: large text dataset, classically used for next word prediction (http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC97S44)
Wikipedia Dataset
Multidomain sentiment analysis dataset: http://www.cs.jhu.edu/~mdredze/datasets/sentiment/

Speech

TIMIT Speech Corpus: phoneme classification (http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC93S1)
Aurora : Timit with noise and additional information

Recommendation Systems

MovieLens: Two datasets available from http://www.grouplens.org. The first dataset has 100,000 ratings for 1682 movies by 943 users, subdivided into five disjoint subsets. The second dataset has about 1 million ratings for 3900 movies by 6040 users.
Jester: This dataset contains 4.1 million continuous ratings (-10.00 to +10.00) of 100 jokes from 73,421 users.
Netflix Prize: Netflix released an anonymised version of their movie rating dataset; it consists of 100 million ratings, done by 480,000 users who have rated between 1 and all of the 17,770 movies.
Book-Crossing dataset: This dataset is from the Book-Crossing community, and contains 278,858 users providing 1,149,780 ratings about 271,379 books.

Misc

“Musk” dataset
CMU Motion Capture Database: (http://mocap.cs.cmu.edu/)
Brodatz dataset: texture modeling (http://www.ux.uis.no/~tranden/brodatz.html)
Million Song dataset: http://labrosa.ee.columbia.edu/millionsong/
Merck Molecular Activity Challenge - http://www.kaggle.com/c/MerckActivity/data

Last modified on June 14, 2014, at 3:36 am by Caglar Gulcehre

<a rel="nofollow" target="_blank" href="http://deeplearning.net/datasets/">http://deeplearning.net/datasets/</a>

<a rel="nofollow" target="_blank" href="http://applesun0757.blog.163.com/blog/static/1873741922012522105030769/"><img src="//img.pandawhale.com/post-63932-deep-learning-Lu14.png" alt="deep learning" /></a>Datasets

These datasets can be used for benchmarking deep learning algorithms:

Symbolic Music Datasets

<ul><li>Piano-midi.de: classical piano pieces (<a rel="nofollow" target="_blank" href="http://www.piano-midi.de/">http://www.piano-midi.de/</a>)</li><li>Nottingham : over 1000 folk tunes (<a rel="nofollow" target="_blank" href="http://abc.sourceforge.net/NMD/">http://abc.sourceforge.net/NMD/</a>)</li><li>MuseData: electronic library of classical music scores (<a rel="nofollow" target="_blank" href="http://musedata.stanford.edu/">http://musedata.stanford.edu/</a>)</li><li>JSB Chorales: set of four-part harmonized chorales (<a rel="nofollow" target="_blank" href="http://www.jsbchorales.net/index.shtml">http://www.jsbchorales.net/index.shtml</a>)</li></ul>

Natural Images

<ul><li>MNIST: handwritten digits (<a rel="nofollow" target="_blank" href="http://yann.lecun.com/exdb/mnist/">http://yann.lecun.com/exdb/mnist/</a>)</li><li>NIST: similar to MNIST, but larger</li><li>Perturbed NIST: a dataset developed in Yoshua’s class (NIST with tons of deformations)</li><li>CIFAR10 / CIFAR100: 32×32 natural image dataset with 10/100 categories ( <a rel="nofollow" target="_blank" href="http://www.cs.utoronto.ca/~kriz/cifar.html">http://www.cs.utoronto.ca/~kriz/cifar.html</a>)</li><li>Caltech 101: pictures of objects belonging to 101 categories (<a rel="nofollow" target="_blank" href="http://www.vision.caltech.edu/Image_Datasets/Caltech101/">http://www.vision.caltech.edu/Image_Datasets/Caltech101/</a>)</li><li>Caltech 256: pictures of objects belonging to 256 categories (<a rel="nofollow" target="_blank" href="http://www.vision.caltech.edu/Image_Datasets/Caltech256/%29">http://www.vision.caltech.edu/Image_Datasets/Caltech256/) </a></li><li>Caltech Silhouettes: 28×28 binary images contains silhouettes of the Caltech 101 dataset</li><li>STL-10 dataset is an image recognition dataset for developing unsupervised feature learning, deep learning, self-taught learning algorithms. It is inspired by the <a rel="nofollow" target="_blank" href="http://www.cs.toronto.edu/~kriz/cifar.html">CIFAR-10 dataset</a> <a> but with some modifications. </a> <a rel="nofollow" target="_blank" href="http://www.stanford.edu/~acoates//stl10/">http://www.stanford.edu/~acoates//stl10/</a></li><li>The Street View House Numbers (SVHN) Dataset - <a rel="nofollow" target="_blank" href="http://ufldl.stanford.edu/housenumbers/">http://ufldl.stanford.edu/housenumbers/</a></li><li>NORB: binocular images of toy figurines under various illumination and pose (<a rel="nofollow" target="_blank" href="http://www.cs.nyu.edu/~ylclab/data/norb-v1.0/">http://www.cs.nyu.edu/~ylclab/data/norb-v1.0/</a>)</li><li>Imagenet: image database organized according to the WordNethierarchy (<a rel="nofollow" target="_blank" href="http://www.image-net.org/">http://www.image-net.org/</a>)</li><li>Pascal VOC: various object recognition challenges (<a rel="nofollow" target="_blank" href="http://pascallin.ecs.soton.ac.uk/challenges/VOC/">http://pascallin.ecs.soton.ac.uk/challenges/VOC/</a>)</li><li>Labelme: A large dataset of annotated images, <a rel="nofollow" target="_blank" href="http://labelme.csail.mit.edu/Release3.0/browserTools/php/dataset.php">http://labelme.csail.mit.edu/Release3.0/browserTools/php/dataset.php</a></li><li>COIL 20: different objects imaged at every angle in a 360 rotation(<a rel="nofollow" target="_blank" href="http://www.cs.columbia.edu/CAVE/software/softlib/coil-20.php">http://www.cs.columbia.edu/CAVE/software/softlib/coil-20.php</a>)</li><li>COIL100: different objects imaged at every angle in a 360 rotation (<a rel="nofollow" target="_blank" href="http://www1.cs.columbia.edu/CAVE/software/softlib/coil-100.php">http://www1.cs.columbia.edu/CAVE/software/softlib/coil-100.php</a>)</li></ul>

Artificial Datasets

<ul><li><a rel="nofollow" target="_blank" href="https://github.com/caglar/Arcade-Universe">Arcade Universe </a>- An artificial dataset generator with images containing arcade games sprites such as tetris pentomino/tetromino objects. This generator is based on the O. Breleux’s <a rel="nofollow" target="_blank" href="https://github.com/breuleux/bugland">bugland</a> dataset generator.</li><li>A collection of datasets inspired by the ideas from <a rel="nofollow" target="_blank" href="http://www.iro.umontreal.ca/~lisa/twiki/bin/view.cgi/Public/BabyAISchool">BabyAISchool</a>:<ul><li><a rel="nofollow" target="_blank" href="http://www.iro.umontreal.ca/~lisa/twiki/bin/view.cgi/Public/BabyAIShapesDatasets">BabyAIShapesDatasets</a> : distinguishing between 3 simple shapes</li><li><a rel="nofollow" target="_blank" href="http://www.iro.umontreal.ca/~lisa/twiki/bin/view.cgi/Public/BabyAIImageAndQuestionDatasets">BabyAIImageAndQuestionDatasets</a> : a question-image-answer dataset</li></ul></li><li>Datasets generated for the purpose of an empirical evaluation of deep architectures (<a rel="nofollow" target="_blank" href="http://www.iro.umontreal.ca/~lisa/twiki/bin/view.cgi/Public/DeepVsShallowComparisonICML2007">DeepVsShallowComparisonICML2007</a>):<ul><li><a rel="nofollow" target="_blank" href="http://www.iro.umontreal.ca/~lisa/twiki/bin/view.cgi/Public/MnistVariations">MnistVariations</a> : introducing controlled variations in MNIST</li><li><a rel="nofollow" target="_blank" href="http://www.iro.umontreal.ca/~lisa/twiki/bin/view.cgi/Public/RectanglesData">RectanglesData</a> : discriminating between wide and tall rectangles</li><li><a rel="nofollow" target="_blank" href="http://www.iro.umontreal.ca/~lisa/twiki/bin/view.cgi/Public/ConvexNonConvex">ConvexNonConvex</a> : discriminating between convex and nonconvex shapes</li><li><a rel="nofollow" target="_blank" href="http://www.iro.umontreal.ca/~lisa/twiki/bin/view.cgi/Public/BackgroundCorrelation">BackgroundCorrelation</a> : controlling the degree of correlation in noisy MNIST backgrounds</li></ul></li></ul>

Faces

<ul><li>Labelled Faces in the Wild: 13,000 images of faces collected from the web, labelled with the name of the person pictured (<a rel="nofollow" target="_blank" href="http://vis-www.cs.umass.edu/lfw/">http://vis-www.cs.umass.edu/lfw/</a>)</li><li>Toronto Face Dataset</li><li>Olivetti: a few images of several different people (<a rel="nofollow" target="_blank" href="http://www.cs.nyu.edu/~roweis/data.html">http://www.cs.nyu.edu/~roweis/data.html</a>)</li><li>Multi-Pie: The CMU Multi-PIE Face Database (<a rel="nofollow" target="_blank" href="http://www.multipie.org/">http://www.multipie.org/</a>)</li><li>Face-in-Action (<a rel="nofollow" target="_blank" href="http://www.flintbox.com/public/project/5486/">http://www.flintbox.com/public/project/5486/</a>)</li><li>JACFEE: Japanese and Caucasian Facial Expressions of Emotion (<a rel="nofollow" target="_blank" href="http://www.humintell.com/jacfee/">http://www.humintell.com/jacfee/</a>)</li><li>FERET: The Facial Recognition Technology Database (<a rel="nofollow" target="_blank" href="http://www.itl.nist.gov/iad/humanid/feret/feret_master.html">http://www.itl.nist.gov/iad/humanid/feret/feret_master.html</a>)</li><li>mmifacedb: MMI Facial Expression Database (<a rel="nofollow" target="_blank" href="http://www.mmifacedb.com/">http://www.mmifacedb.com/</a>)</li><li>IndianFaceDatabase: <a rel="nofollow" target="_blank" href="http://vis-www.cs.umass.edu/~vidit/IndianFaceDatabase/">http://vis-www.cs.umass.edu/~vidit/IndianFaceDatabase/</a>)</li><li>(e.g. The Yale Face Database (<a rel="nofollow" target="_blank" href="http://vision.ucsd.edu/content/yale-face-database">http://vision.ucsd.edu/content/yale-face-database</a>) and The Yale Face Database B (<a rel="nofollow" target="_blank" href="http://vision.ucsd.edu/~leekc/ExtYaleDatabase/ExtYaleB.html)">http://vision.ucsd.edu/~leekc/ExtYaleDatabase/ExtYaleB.html)</a>). </li></ul>

Text

<ul><li>20 newsgroups: classification task, mapping word occurences to newsgroup ID (<a rel="nofollow" target="_blank" href="http://qwone.com/~jason/20Newsgroups/">http://qwone.com/~jason/20Newsgroups/</a>)</li><li>Reuters (RCV*) Corpuses: text/topic prediction (<a rel="nofollow" target="_blank" href="http://about.reuters.com/researchandstandards/corpus/">http://about.reuters.com/researchandstandards/corpus/</a>)</li><li>Penn Treebank : used for next word prediction or next character prediction (<a rel="nofollow" target="_blank" href="http://www.cis.upenn.edu/~treebank/">http://www.cis.upenn.edu/~treebank/</a>)</li><li>Broadcast News: large text dataset, classically used for next word prediction (<a rel="nofollow" target="_blank" href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC97S44">http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC97S44</a>)</li><li>Wikipedia Dataset</li><li>Multidomain sentiment analysis dataset: <a rel="nofollow" target="_blank" href="http://www.cs.jhu.edu/~mdredze/datasets/sentiment/">http://www.cs.jhu.edu/~mdredze/datasets/sentiment/</a></li></ul>

Speech

<ul><li>TIMIT Speech Corpus: phoneme classification (<a rel="nofollow" target="_blank" href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC93S1">http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC93S1</a>)</li><li>Aurora : Timit with noise and additional information</li></ul>

Recommendation Systems

<ul><li>MovieLens: Two datasets available from <a rel="nofollow" target="_blank" href="http://www.grouplens.org/">http://www.grouplens.org</a>. The first dataset has 100,000 ratings for 1682 movies by 943 users, subdivided into five disjoint subsets. The second dataset has about 1 million ratings for 3900 movies by 6040 users. </li><li>Jester: This <a rel="nofollow" target="_blank" href="http://www.ieor.berkeley.edu/~goldberg/jester-data/">dataset</a> contains 4.1 million continuous ratings (-10.00 to +10.00) of 100 jokes from 73,421 users.</li><li>Netflix Prize: Netflix released an anonymised version of their movie rating <a rel="nofollow" target="_blank" href="http://www.netflixprize.com/">dataset</a>; it consists of 100 million ratings, done by 480,000 users who have rated between 1 and all of the 17,770 movies.</li><li>Book-Crossing dataset: This <a rel="nofollow" target="_blank" href="http://www.informatik.uni-freiburg.de/~cziegler/BX/">dataset</a> is from the Book-Crossing community, and contains 278,858 users providing 1,149,780 ratings about 271,379 books.</li></ul>

Misc

<ul><li>“Musk” dataset</li><li>CMU Motion Capture Database: (<a rel="nofollow" target="_blank" href="http://mocap.cs.cmu.edu/">http://mocap.cs.cmu.edu/</a>)</li><li>Brodatz dataset: texture modeling (<a rel="nofollow" target="_blank" href="http://www.ux.uis.no/~tranden/brodatz.html">http://www.ux.uis.no/~tranden/brodatz.html</a>)</li><li>Million Song dataset: <a rel="nofollow" target="_blank" href="http://labrosa.ee.columbia.edu/millionsong/">http://labrosa.ee.columbia.edu/millionsong/</a></li><li>Merck Molecular Activity Challenge - <a rel="nofollow" target="_blank" href="http://www.kaggle.com/c/MerckActivity/data">http://www.kaggle.com/c/MerckActivity/data</a></li></ul>

Last modified on June 14, 2014, at 3:36 am by Caglar Gulcehre

Mo Data
8:06 AM Jun 07 2015

Stashed in: Caltech

To save this post, select a stash from drop-down menu or type in a new one:

That is a lot of data sets!

Adam Rifkin
8:20 AM Jun 07 2015

Datasets for Deep Learning

Mo Data stashed this in Data Sources

You May Also Like: