This list of public data sources are collected and tidied from blogs, answers, and user reponses. Most of the data sets listed below are free, however, some are not. Other amazingly awesome lists can be found in the awesome-awesomeness and sindresorhus's awesome list.
- 1000 Genomes
- American Gut (Microbiome Project)
- Collaborative Research in Computational Neuroscience (CRCNS)
- Gene Expression Omnibus (GEO)
- Sequence Read Archive(SRA)
- EBI ArrayExrepss
- ENCODE project
- Human Microbiome Project (HMP)
- ICOS PSP Benchmark
- MIT Cancer Genomics Data
- NIH Microarray data (FTP)
- OpenSNP genotypes data
- Pathguid: Protein-Protein Interactions Catalog
- Protein Data Bank
- PubChem Project
- PubGene (now Coremine Medical)
- Stanford Microarray Data
- The Personal Genome Project or PGP
- UCSC Public Data
- UniGene
- The Catalogue of Life
- Australian Weather
- Brazilian Weather - Historical data (In Portuguese)
- Canadian Meteorological Centre
- Climate Data from UEA (updated monthly)
- Global Climate Data Since 1929
- NASA Global Imagery Browse Services
- NOAA Bering Sea Climate
- NOAA Climate Datasets
- NOAA Realtime Weather Models
- The World Bank Open Data Resources for Climate Change
- UEA Climatic Research Unit
- WU Historical Weather Worldwide
- CrossRef DOI URLs
- DBLP Citation dataset
- NBER Patent Citations
- NIST complex networks data collection
- Protein-protein interaction network
- PyPI and Maven Dependency Network
- Scopus Citation Database
- Small Network Data
- Stanford GraphBase (Steven Skiena)
- Stanford Large Network Dataset Collection
- The Koblenz Network Collection
- The Laboratory for Web Algorithmics (UNIMI)
- The Nexus Network Repository
- UCI Network Data Repository
- UCI Network Data Repository
- UFL sparse matrix collection
- WSU Graph Database
- 3.5B Web Pages from CommonCraw 2012
- 53.5B Web clicks of 100K users in Indiana Univ.
- CAIDA Internet Datasets
- ClueWeb09 - 1B web pages
- ClueWeb12 - 733M web pages
- CommonCrawl Web Data over 7 years
- CRAWDAD Wireless datasets from Dartmouth Univ.
- Criteo click-through data
- Open Mobile Data by MobiPerf
- UCSD Network Telescope, IPv4 /8 net
- Challenges in Machine Learning
- D4D Challenge of Orange
- CrowdANALYTIX dataX
- DrivenData Competitions for Social Good
- ICWSM Data Challenge (since 2009)
- Kaggle Competition Data
- KDD Cup by Tencent 2012
- Localytics Data Visualization Challenge
- Netflix Prize
- Space Apps Challenge
- Telecom Italia Big Data Challenge
- Yelp Dataset Challenge
- CBOE Futures Exchange
- Google Finance
- Google Trends
- NASDAQ
- OANDA
- OSU Financial data
- Quandl
- St Louis Federal
- Yahoo Finance
- BODC - marine data of ~22K vars
- Cambridge, MA, US, GIS data on GitHub
- EOSDIS - NASA's earth observing system data
- Factual Global Location Data
- Geo Spatial Data from ASU
- GeoNames Worldwide
- Global Administrative Areas Database (GADM)
- Landsat 8 on AWS
- Natural Earth - vectors and rasters of the world
- OpenStreetMap (OSM)
- TIGER/Line - U.S. boundaries and roads
- TwoFishes - Foursquare's coarse geocoder
- TZ Timezones shapfiles
- World countries in multiple formats
- List of all countries in all languages
- OpenAddresses
- Antwerp, Belgium
- Austin, TX, US
- Australia (abs.gov.au)
- Australia (data.gov.au)
- Austria (data.gv.at)
- Belgium
- Brazil
- Cambridge, MA, US
- Canada
- Chicago
- Dallas Open Data
- Denver Open Data
- Durham, NC Open Data
- England LGInform
- EuroStat
- FedStats
- Finland
- France
- Germany
- Ghent, Belgium
- Glasgow, Scotland, UK
- Guardian world governments
- Houston Open Data
- Indian Government Data
- Indonesian Data Portal
- London Datastore, UK
- Los Angeles Open Data
- MassGIS, Massachusetts, U.S.
- Mexico
- Netherlands
- New Zealand
- NYC betanyc
- NYC Open Data
- OECD
- Oklahoma
- Open Government Data (OGD) Platform India
- Rio de Janeiro, Brazil
- Romania
- San Francisco Data sets
- Seattle
- Singapore Government Data
- South Africa
- Switzerland
- The World Bank
- Texas Open Data
- Puerto Rico Government
- U.K. Government Data
- Uruguay
- U.S. American Community Survey
- U.S. CDC Public Health datasets
- U.S. Census Bureau
- U.S. National Center for Education Statistics (NCES)
- U.S. Department of Housing and Urban Development (HUD)
- U.S. Federal Government Agencies
- U.S. Federal Government Data Catalog
- U.S. Food and Drug Administration (FDA)
- U.S. Open Government
- UK 2011 Census Open Atlas Project
- United Nations
- Vancouver, BC Open Data Catalog
- EHDP Large Health Data Sets
- Gapminder World, demographic databases
- Medicare Coverage Database (MCD), U.S.
- Medicare Data Engine of medicare.gov Data
- Medicare Data File
- MeSH, the vocabulary thesaurus used for indexing articles for PubMed
- Number of Ebola Cases and Deaths in Affected Countries (2014)
- 10k US Adult Faces Database
- 2GB of Photos of Cats (Original down - 20Agst2015) or Archive version
- Stanford Dogs Dataset
- The Oxford-IIIT Pet Dataset
- Animals with attributes
- Affective Image Classification
- Face Recognition Benchmark
- ImageNet (in WordNet hierarchy)
- International Affective Picture System, UFL
- Massive Visual Memory Stimuli, MIT
- SUN database, MIT
- YouTube Faces Database
- Indoor Scene Recognition
- Delve Datasets for classification and regression (Univ. of Toronto)
- Discogs Monthly Data
- eBay Online Auctions (2012)
- IMDb Database
- Keel Repository for classification, regression and time series
- Lending Club Loan Data
- Machine Learning Data Set Repository
- Million Song Dataset
- More Song Datasets
- MovieLens Data Sets
- RDataMining - "R and Data Mining" ebook data
- Registered Meteorites on Earth
- Restaurants Health Score Data in San Francisco
- UCI Machine Learning Repository
- Yahoo! Ratings and Classification Data
- Cooper-Hewitt's Collection Database
- Minneapolis Institute of Arts metadata
- Natural History Museum (London) Data Portal
- Rijksmuseum Historical Art Collection
- Tate Collection metadata
- The Getty vocabularies
- Blogger Corpus
- ClueWeb09 FACC
- ClueWeb12 FACC
- DBpedia - 4.58M things with 583M facts
- Flickr Personal Taxonomies
- Google Books Ngrams (2.2TB)
- Google Web 5gram (1TB, 2006)
- Gutenberg eBooks List
- Hansards text chunks of Canadian Parliament
- Machine Translation of European languages
- SMS Spam Collection in English
- SaudiNewsNet Collection of Saudi Newspaper Articles (Arabic, 30K articles)
- USENET postings corpus of 2005~2011
- Wikidata - Wikipedia databases
- Wikipedia Links data - 40 Million Entities in Context
- WordNet databases and tools
- CERN Open Data Portal
- NSSDC (NASA) data of 550 space spacecraft
- NASA Exoplanet Archive
- Sloan Digital Sky Survey (SDSS) - Mapping the Universe
- Amazon
- Archive.org Datasets
- CMU JASA data archive
- CMU StatLab collections
- Data360
- Datamob.org
- Infochimps
- KDNuggets Data Collections
- Microsoft Azure Data Market Free DataSets
- Numbray
- Reddit Datasets
- RevolutionAnalytics Collection
- Sample R data sets
- Stats4Stem R data sets
- StatSci.org
- The Washington Post List
- UCLA SOCR data collection
- UFO Reports
- Wikileaks 911 pager intercepts
- Yahoo Webscope
- Academic Torrents of data sharing from UMB
- Archive-it from Internet Archive
- Datahub.io
- DataMarket (Qlik)
- Freebase.com of people, places, and things
- Harvard Dataverse Network of scientific data
- ICPSR (UMICH)
- Open Data Certificates (beta)
- Statista.com - statistics and Studies
- 72 hours #gamergate scrape
- Cheng-Caverlee-Lee September 2009 - January 2010 Twitter Scrape
- May 2011 Calufa Twitter Scrape
- Network Twitter Data
- Social Twitter Data
- Twitter Data for Sentiment Analysis
- Ancestry.com Forum Dataset over 10 years
- CMU Enron Email of 150 users
- EDRM Enron EMail of 151 users, hosted on S3
- Facebook Data Scrape (2005)
- Facebook Social Networks from LAW (since 2007)
- FBI Hate Crime 2013 - aggregated data
- Foursquare Social Network in 2010, 2011
- Foursquare from UMN/Sarwat (2013)
- General Social Survey (GSS) since 1972
- GetGlue - users rating TV shows
- GitHub Collaboration Archive
- MIT Reality Mining Dataset
- Mobile Social Networks from UMASS
- PewResearch Internet Survey Project
- Reddit Comments
- SourceForge.net Research Data
- StackExchange Data Explorer
- Titanic Survival Data Set
- Texas Inmates Executed Since 1984
- Twitter Graph of entire Twitter site
- UCB's Archive of Social Science Data (D-Lab)
- UCLA Social Sciences Data Archive
- UNIMI/LAW Social Network Datasets
- Universities Worldwide
- UPJOHN for Labor Employment Research
- Yahoo! Graph and Social Data
- Youtube Video Social Graph in 2007,2008
- Google Scholar citation relations
- Political Polarity Data
- GDELT Global Events Database
- Skytrax' Air Travel Reviews Dataset
- Betfair Historical Exchange Data
- Cricsheet Matches (cricket)
- Ergast Formula 1, from 1950 up to date (API)
- Football/Soccer resources (data and APIs)
- Lahman's Baseball Database
- Retrosheet Baseball Statistics
- Time Series Data Library (TSDL) from MU
- UC Riverside Time Series Dataset
- Hard Drive Failure Rates
- Heart Rate Time Series from MIT
- Airlines OD Data 1987-2008
- Bike Share Systems (BSS) collection
- Bay Area Bike Share Data
- GeoLife GPS Trajectory from Microsoft Research
- Hubway Million Rides in MA
- Marine Traffic - ship tracks, port calls and more
- NYC Taxi Trip Data 2013 (FOIA/FOILed)
- NYC Taxi Trip Data 2009-
- OpenFlights - airport, airline and route data
- RITA Airline On-Time Performance data
- RITA/BTS transport data collection (TranStat)
- Transport for London (TFL)
- Travel Tracker Survey (TTS) for Chicago
- U.S. Bureau of Transportation Statistics (BTS)
- U.S. Domestic Flights 1990 to 2009
- U.S. Freight Analysis Framework since 2007
- NYC Uber trip data April 2014 to September 2014
- DataWrangling: Some Datasets Available on the Web
- Inside-r: Finding Data on the Internet
- Quora: Where can I find large datasets open to the public?
- RS.io: 100+ Interesting Data Sets for Statistics
- StaTrek: Leveraging open data to understand urban lives
- OpenDataMonitor: An overview of available open data resources in Europe
- OpenDataNetwork: A search engine of all Socrata powered data portals ranging from small cities to federal agencies and non-profits
- Zenodo: An open dependable home for the long-tail of science, enabling researchers to share and preserve any research outputs in any size, any format and from any science.