Awesome public datasets
Mo Data stashed this in Data Sources
This list of public data sources are collected and tidyed from blogs, answers, and user reponses. Most of the data sets listed below are free, however, some are not. Other amazingly awesome lists can be found in the awesome-awesomeness and another awesome list.
U.S. Department of Agriculture's PLANTS Database Biology1000 Genomes Collaborative Research in Computational Neuroscience (CRCNS) Gene Expression Omnibus (GEO) Human Microbiome Project (HMP) ICOS PSP Benchmark MIT Cancer Genomics Data NIH Microarray data (FTP) Protein Data Bank PubChem Project PubGene (now Coremine Medical) Stanford Microarray Data The Personal Genome Project or PGP UCSC Public Data UniGene
Australian Weather Canadian Meteorological Centre Climate Data from UEA (updated at roughly monthly intervals) Global Climate Data Since 1929 NOAA Bering Sea Climate NOAA Climate Datasets NOAA Realtime Weather Models WU Historical Weather Worldwide
CrossRef DOI URLs DBLP Citation dataset NBER Patent Citations NIST complex networks data collection Protein-protein interaction network PyPI and Maven Dependency Network Scopus Citation Database Stanford GraphBase (Steven Skiena) Stanford Large Network Dataset Collection The Koblenz Network Collection The Laboratory for Web Algorithmics (UNIMI) UCI Network Data Repository UFL sparse matrix collection WSU Graph Database
3.5B Web Pages - Web graph extracted from CommonCraw 2012 web corpus. 53.5B Web clicks - Anonymized HTTP records from 100K users in Indiana Univ. CAIDA Internet Datasets - Network traces and topologies at geographically diverse locations. ClueWeb09 - About 1B web pages in ten languages that were collected in Jan. and Feb. 2009. ClueWeb12 - About 733M web pages collected between Feb. and May 2012. CommonCrawl Web Data - Petabytes of data collected over 7 years of web crawling. CRAWDAD Wireless datasets (Dartmouth) - A wireless network data resource for research communities. OpenMobileData (MobiPerf) - Mobile performance measurement data collected with active tests. UCSD Network Telescope - A passive traffic monitoring system covering IPv4 /8 net. Data
Challenges in Machine Learning DrivenData Competitions for Social Good ICWSM Data Challenge (since 2009) Kaggle Competition Data KDD Cup by Tencent 2012 Localytics Data Visualization Challenge Netflix Prize Yelp Dataset Challenge
BODC - Marine data of nearly 22,000 oceanographic vars. EOSDIS - A data collection of NASA's earth observing system data and information system. Factual Global Location Data - 65M POIs with extended attributes in 50 countries. Global Administrative Areas Database (GADM) - For countries and low-level subdivisions. Geo Spatial Data from ASU - Several small spatial or GIS datasets. GeoNames - Over eight million placenames (countries, city stat etc.) of the world. Natural Earth - Vectors and rasters of the world in multiple scales. OpenStreetMap - A free map worldwide maintained by the communities. TIGER/Line - Official United States boundaries and roads. TwoFishes - Foursquare's coarse geocoder. TZ Timezones - A shapefile of the TZ timezones of the world.
Australia (abs.gov.au)Australia (data.gov.au)Canada Chicago EuroStat FedStats Germany Glasgow, Scotland, UK Guardian world governments London Datastore, U.K Netherlands New Zealand NYC betanyc NYC Open Data OECD Open Government Data (OGD) Platform India San Francisco Data sets South Africa The World Bank U.K. Government Data U.S. American Community Survey U.S. CDC Public Health datasets U.S. Census Bureau U.S. Department of Housing and Urban Development (HUD) U.S. Federal Government Agencies U.S. Federal Government Data Catalog U.S. Food and Drug Administration (FDA) U.S. Open Government UK 2011 Census Open Atlas Project United Nations
EHDP Large Health Data Sets - A collection of health datasets across domains and countries. Gapminder World - A collection of multi-domain, demographic databases for our world. Medicare Coverage Database (MCD) - Containing national and local Coverage Determinations. Medicare Data Engine - Download, explore, and visualize Medicare.gov Data. Medicare Data File
2GB of Photos of Cats - 10K cat images with basic annotations. Face Recognition Benchmark - A collection of face datasets for benchmarking algorithms. ImageNet - An image database organized according to the WordNet hierarchy.
Delve Datasets (Univ. of Toronto) - Evaluating datasets for classification and regression. eBay Online Auctions (2012) - Seller-auction-bidder data with closing prices. IMDb Database - An online database of films, TB programs, and video games. Keel Repository - Multiple datasets for classification, regression, time series. Lending Club Loan Data - Loan status (Current, Late, Fully Paid, etc.) and latest payment info. Machine Learning Data Set Repository - A data search engine for machine learning tasks. Million Song Dataset - Audio features and metadata for a million popular music tracks. More Song Datasets - Complementary data of cover songs, lyrics, user listening data. MovieLens Data Sets - Online movie recommendation including movie tags, user ratings. RDataMining - "R and Data Mining" ebook data Registered Meteorites on Earth - 34,513 meteorites updated to 2012. Restaurants Health Score Data - Health status of restaurants in San Francisco. UCI Machine Learning Repository - One of most famous ML data repositories. Yahoo Ratings and Classification Data - About music, movies, user clicks, images etc. MuseumsCooper-Hewitt's Collection Database Minneapolis Institute of Arts metadata Tate Collection metadata The Getty vocabularies
ClueWeb09 FACC - Annotated English-language Web pages from the ClueWeb09 corpora. ClueWeb12 FACC - Annotated English-language Web pages from the ClueWeb12 corpora. DBpedia - Multi-domain ontology describing 4.58M “things” with 583M “facts”. Flickr Personal Taxonomies - Personalized tagging pictures with descriptive labels. Google Books Ngrams (2.2TB) - N-gram corpuses extracted from Google Books. Google Web 5gram (1TB, 2006) - 5-gram corpuses extracted from Web pages. Gutenberg eBooks List - Basic information about each eBook from Project Gutenberg. Hansards - 1.3M aligned text chunks from official records of Canadian Parliament. Machine Translation - The recurring translation task focusing on European languages. SMS Spam Collection - 5,574 real English messages, labled as being ham or spam. USENET corpus - A collection of public USENET postings between Oct 2005 and Jan 2011. Wikidata - Wikipedia databases available in JSON and XML formats. Wikipedia Links data - 40 Million Entities in Context. WordNet - Databases, associated packages and tools. PhysicsCERN Open Data Portal - Experimental data of CMS experiment, ALICE, ATLAS and LHCb NSSDC (NASA) - More than 230 TB of data from about 550 space science spacecraft
Amazon Archive.org Datasets CMU JASA data archive CMU StatLab collections Data360 Datamob.org Google Infochimps KDNuggets Data Collections Numbray Reddit Datasets RevolutionAnalytics Collection Sample R data sets Stats4Stem R data sets StatSci.org The Washington Post List UCLA SOCR data collection UFO Reports Wikileaks 911 pager intercepts Yahoo Webscope
Academic Torrents (UMB) - Sharing enormous datasets, for researchers, by researchers. Archive-it - Web archiving service built at the Internet Archive Datahub.io - The easy way to get, use and share data DataMarket (Qlik) Freebase.com - A community-curated database of well-known people, places, and things Harvard Dataverse Network - Scientific data for reproducible research ICPSR (UMICH) - Find and analyze data Statista.com - Statistics and Studies from more than 18,000 Sources
Ancestry.com Forum Dataset - Forum users and messages over ten years CMU Enron Email - 150 users, mostly senior management of Enron Facebook Data Scrape (2005) - 100 American colleges and univ. Facebook Social Networks from LAW (since 2007) Foursquare (2010, 2011) - Social networks, check-in locations and categories Foursquare from UMN/Sarwat (2013) - Users, venues, check-ins, ratings etc. General Social Survey (GSS, since 1972) - Demographic and attitudinal questions, topics etc. GetGlue - Users rating TV shows GitHub Archive - Programmers collaboration, projects progress etc. Mobile Social Networks (UMASS) - Timestamped mote-to-mote (up to 27 subjects) connections PewResearch Internet Project - A wide range of surveys about library usage, online dating etc. SourceForge.net Research Data - Historic and status statistics of projects and users' activities Stack Exchange Data Explorer - User-contributed content on the Stack Exchange network Titanic Survival Data Set - Demographic information of Titanic passengers Twitter Graph - Crawled entire Twitter site including tweets, user profiles, relations UCB's Archive of Social Science Data (D-Lab) - Holdings of political, social and health areas UCLA Social Sciences Data Archive - A collection of social science data on the Web UNIMI/LAW Social Network Datasets - Social networks like amazon, LiveJournal, dblp and more Universities Worldwide - Links to 9307 Universities in 205 countries UPJOHN for Employment Research - Labor surveys, unemployment spells and more Yahoo Graph and Social Data - Web page graph, user-group membership, IM friends etc. Youtube Video Graph (2007,2008) - Video relations, uploaders, views, ratings and more
Betfair Event Results - Fully time-stamped historical Betfair exchange data Cricsheet (baseball) - Thousands of Cricket matches Ergast Formula 1, from 1950 up to date (API available) Football/Soccer resouces (data and APIs) Lahman's Baseball Database - Batting and pitching statistics, team stats etc. Retrosheet (baseball) - Play-by-Play files, game logs and schedules
Airlines OD Data 1987-2008, used by ASA Challenge 2009 Bike Share Data Systems - Trip histories, site maps etc. Edge data for US domestic flights 1990 to 2009 Half a million Hubway rides in MA Marine Traffic - Ship tracks, port calls and more NYC Taxi Trip Data 2013 - FOIA/FOILed by Chris Whong OpenFlights - Airport, airline and route data RITA Airline On-Time Performance data of major air carriers in US RITA/BTS transport data collection (TranStat) Transport for London (TFL) - Trip histories and networking statistics Travel Tracker Survey (TTS), Chicago, 1990, 2007-2008 U.S. Bureau of Transportation Statistics (BTS) U.S. Freight Analysis Framework - Freight movement among states since 2007
DataWrangling: Some Datasets Available on the WebInside-r: Finding Data on the InternetQuora: Where can I find large datasets open to the public?RS.io: 100+ Interesting Data Sets for StatisticsStaTrek: Leveraging open data to understand urban lives