Open data sources
Government Open990: https://www.open990.com/catalog/ Governance data: ~100-variable snack-sized dataset is useful for research on nonprofit governance and management. Includes variables from Form 990s e-filed in 2011 to present, across more than 300,000 organizations. Free, with attribution, for non-commercial use. Download the dataset (csv), license, and documentation. Contractor compensation: ~100-variable snack-sized dataset is useful for research on nonprofit compensation. Features the five highest compensated independent contractors for each organization. Data are from Form 990 returns e-filed for TY 2016. Free, with attribution, for non-commercial use. Download the dataset (csv), license, and documentation. ACSPUMS American Community Survey Public use microdata at person level: https://docs.data.world/uscensus/#american-community-survey-linked-open-data https://appliednonprofitresearch.com/documentation/irs-990-spreadsheets/ 2017 American Housing Survey The AHS is sponsored by the Department of Housing and Urban Development (HUD) and conducted by the U.S. Census Bureau. The survey is the most comprehensive national housing survey in the United States. https://www.census.gov/programs-surveys/ahs/data/2017/ahs-2017-public-use-file--puf-/ahs-2017-national-public-use-file--puf-.html 2012 Economic Census https://www.census.gov/programs-surveys/economic-census/data/datasets.html 2010 Census https://www.census.gov/programs-surveys/decennial-census/data/datasets.2010.html Stateside Public Use Microdata Public Use Microdata Sample (PUMS) files contain records representing 10-percent samples of the occupied and vacant housing units in the United States and the people in the occupied units. Group quarters people also are included. The file contains individual weights for each person and housing unit, which when applied to the individual records, expand the sample to the relevant total. https://www.census.gov/data/datasets/2010/dec/stateside-pums.html Consumer Expenditure Survey PUMD https://www.bls.gov/cex/pumd_data.htm#csv American Time Use Survey The American Time Use Survey (ATUS) measures the amount of time people spend doing various activities, such as paid work, childcare, volunteering, and socializing. https://www.bls.gov/tus/#data
Transportation Taxi data https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page Large monthly datasets of pickup/dropoff times, trip distance, fare amount and more for yellow or green taxis in New York City (https://www1.nyc.gov/assets/tlc/downloads/pdf/data_dictionary_trip_records_yellow.pdf) Housing https://www.census.gov/programs-surveys/ahs/data/2017/ahs-2017-public-use-file--puf-/ahs-2017-national-public-use-file--puf-.html
Environmental Air quality OpenAQ Can obtain pm25, pm10, so2, no2, o3, co, and bc data for specific countries, cities, locations from last 90 days through ropenaq package: https://github.com/ropensci/ropenaq Climate SILO Australian annual climate data (daily/monthly rainfall, min/max temp, evaporation, mean sea level pressure, solar radiation, and vapour pressure) from 1889 (for most variables) until current year. https://www.longpaddock.qld.gov.au/silo/gridded-data/ Earth AfSIS Subsaharan Africa soil chemistry data from 2009-2013 analyzed both wet and dry https://github.com/qedsoftware/afsis-soil-chem-tutorial
Health
MIMIC II Electronic health records from an intensive care clinical database (see https://archive.physionet.org/physiobank/database/mimic2-iaccd/data_dictionary.txt for list of variables). The dataset was used to “investigate the effectiveness of indwelling arterial catheters in hemodynamically stable paitents with respiratory failure” https://archive.physionet.org/physiobank/database/mimic2-iaccd/
MIMIC more complicated dataset (https://alpha.physionet.org/content/mimicdb/1.0.0/)
MIMIC III Requires permission to access https://alpha.physionet.org/content/mimiciii/1.4/
Balance The Human Balance Evaluation Database contains force platform recordings from subjects undergoing stabilography tests. The subjects performed standing tasks under four different conditions: with their eyes opened or closed, while standing on a rigid or unstable surface. Each condition was tested three times, with the order of the conditions being randomized among subjects. A total of 1930 trials performed by 163 different subjects are given in this database. https://alpha.physionet.org/content/hbedb/1.0.0/ (data available at https://alpha.physionet.org/content/hbedb/1.0.0/BDSinfo.txt)
Parkinson’s The Tappy Keystroke dataset contains keystroke logs collected from over 200 subjects, with and without Parkinson's Disease (PD), as they typed normally on their own computer (without any supervision) over a period of weeks or months (having initially installed a custom keystroke recording app, Tappy) https://alpha.physionet.org/content/tappy/1.0.0/
Stroke Clinical data for 120 elderly patients, 60 of whom suffered a stroke https://physionet.org/content/cves/1.0.0/ (data available at https://physionet.org/content/cves/1.0.0/subjects.csv)
Pregnancy Number of prior pregnancies, BMI, pregnancy term, fetus sex, and mother’s age for 91 pregnant women. https://physionet.org/content/sufhsdb/1.0.0/FetalPCGSpreadsheet.xls
More pregnancy Data on 111 pregnant women in Iceland: https://physionet.org/content/ehgdb/1.0.0/ (data available at https://physionet.org/content/ehgdb/1.0.0/info.txt)
Voices This database includes 208 voice samples, from 150 pathological, and 58 healthy voices.The healthy voices or the presence of each vocal fold's disorders were clinically verified by the medical experts involved in the project. All diagnoses were made according to indications of the SIFEL protocol, a clinical protocol compiled by the Italian Society of Phoniatrics and Logopaedics. The database includes information such as gender, age, pathology, lifestyle habits (e.g. smoking, alcohol and coffee consummation), occupational status, and the results of two specific medical questionnaires: the Voice Handicap Index (VHI) and Reflux Symptom Index (RSI). https://physionet.org/content/voiced/1.0.0/
Longitudinal Dehydration Quantitative estimation of dehydration in 10 subjectes (total body water loss) using bioimpedance measurements, temperature measurements, salivary samples, and sweat samples.
Amazon Web Services open data (https://registry.opendata.aws/) Broad Institute Cancer Program data (http://portals.broadinstitute.org/cgi-bin/cancer/datasets.cgi) Center for Open Science OSF (https://osf.io/dashboard) European Union Open Data Portal (http://data.europa.eu/euodp/en/data/) Federal government (explore.data.gov, www.bls.gov, www.census.gov, https://data.cms.gov)
fivethirtyeight package (https://github.com/rudeboybert/fivethirtyeight) Gapminder (www.gapminder.org) Google (https://toolbox.google.com/datasetsearch) Kaggle (www.kaggle.com) NOAA Climate data (https://www.ncdc.noaa.gov/data-access) StatLib at Carnegie Mellon (http://lib.stat.cmu.edu) UNICEF (https://data.unicef.org/) World Health Organization (http://apps.who.int/gho/data/node.home) Awesome public datasets (https://github.com/caesar0301/awesome-public-datasets)
Climate: https://en.tutiempo.net/climate/download/ http://actuariesclimateindex.org/data/ https://weather.gc.ca/grib/index_e.html (https://weather.gc.ca/grib/grib2_RDPA_ps10km_e.html)
Criminal Justice https://data.police.uk/data/ https://www.policedatainitiative.org/datasets/ https://catalog.data.gov/dataset?tags=crime
Social network data https://archive.org/details/oxford-2005-facebook-matrix http://law.di.unimi.it/datasets.php https://archive.org/details/201309_foursquare_dataset_umn http://files.pushshift.io/reddit/comments/ http://snap.stanford.edu/data/egonets-Twitter.html http://netsg.cs.sfu.ca/youtubedata/
Sports https://data.world/ninja/anw-obstacle-history https://historicdata.betfair.com/#/home https://cricsheet.org/ http://ergast.com/mrd/db/ https://www.jokecamp.com/blog/guide-to-football-and-soccer-data-and-apis/ http://www.seanlahman.com/baseball-archive/statistics/
Transportation https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page https://github.com/fivethirtyeight/uber-tlc-foil-response https://openflights.org/data.html http://www.planecrashinfo.com/database.htm https://www.toronto.ca/city-government/data-research-maps/open-data/open-data-catalogue/#84045f23-7465-0892-8889-7b6f91049b29 https://www.bts.gov/browse-statistical-products-and-data
Healthcare and medical https://openpaymentsdata.cms.gov/ https://www.ehdp.com/vitalnet/datasets.htm https://data.nal.usda.gov/dataset/composition-foods-raw-processed-prepared-usda-national-nutrient-database-standard-reference-release-27 https://data.humdata.org/dataset/ebola-cases-2014
Government Open990: https://www.open990.com/catalog/ Governance data: ~100-variable snack-sized dataset is useful for research on nonprofit governance and management. Includes variables from Form 990s e-filed in 2011 to present, across more than 300,000 organizations. Free, with attribution, for non-commercial use. Download the dataset (csv), license, and documentation. Contractor compensation: ~100-variable snack-sized dataset is useful for research on nonprofit compensation. Features the five highest compensated independent contractors for each organization. Data are from Form 990 returns e-filed for TY 2016. Free, with attribution, for non-commercial use. Download the dataset (csv), license, and documentation. https://appliednonprofitresearch.com/documentation/irs-990-spreadsheets/
ACSPUMS American Community Survey Public use microdata at person level: https://www.census.gov/programs-surveys/acs/data/pums.html
2017 American Housing Survey The AHS is sponsored by the Department of Housing and Urban Development (HUD) and conducted by the U.S. Census Bureau. The survey is the most comprehensive national housing survey in the United States. https://www.census.gov/programs-surveys/ahs/data/2017/ahs-2017-public-use-file--puf-/ahs-2017-national-public-use-file--puf-.html
2012 Economic Census https://www.census.gov/programs-surveys/economic-census/data/datasets.html
2010 Census https://www.census.gov/programs-surveys/decennial-census/data/datasets.2010.html
Stateside Public Use Microdata Public Use Microdata Sample (PUMS) files contain records representing 10-percent samples of the occupied and vacant housing units in the United States and the people in the occupied units. Group quarters people also are included. The file contains individual weights for each person and housing unit, which when applied to the individual records, expand the sample to the relevant total. https://www.census.gov/data/datasets/2010/dec/stateside-pums.html
Consumer Expenditure Survey PUMD https://www.bls.gov/cex/pumd_data.htm#csv
American Time Use Survey The American Time Use Survey (ATUS) measures the amount of time people spend doing various activities, such as paid work, childcare, volunteering, and socializing. https://www.bls.gov/tus/#data
CMS Data from the Center of Medicare and Medicaid Services https://data.cms.gov
Crime UK Crime July 2019 These CSV files provide street-level crime, outcome, and stop and search information, broken down by police force and 2011 lower layer super output area (LSOA). https://data.police.uk/data/. https://data.police.uk/data/fetch/c1b1536b-5638-452d-9ab0-e267c7bc7c17/
Sports Cricket Cricsheet is Retrosheet for Cricket. We provide ball-by-ball data for Men’s and Women’s Test Matches, One-day internationals, Twenty20 Internationals, some other international T20s, and all Indian Premier League seasons. https://cricsheet.org/ Cricket Game logs https://www.retrosheet.org/gamelogs/index.html Motor racing http://ergast.com/mrd/db/
Transportation Taxi data https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page Large monthly datasets of pickup/dropoff times, trip distance, fare amount and more for yellow or green taxis in New York City (https://www1.nyc.gov/assets/tlc/downloads/pdf/data_dictionary_trip_records_yellow.pdf)
Health MIMIC II Electronic health records from an intensive care clinical database (see https://archive.physionet.org/physiobank/database/mimic2-iaccd/data_dictionary.txt for list of variables). The dataset was used to “investigate the effectiveness of indwelling arterial catheters in hemodynamically stable patients with respiratory failure” https://archive.physionet.org/physiobank/database/mimic2-iaccd/ Balance The Human Balance Evaluation Database contains force platform recordings from subjects undergoing stabilography tests. The subjects performed standing tasks under four different conditions: with their eyes opened or closed, while standing on a rigid or unstable surface. Each condition was tested three times, with the order of the conditions being randomized among subjects. A total of 1930 trials performed by 163 different subjects are given in this database. https://alpha.physionet.org/content/hbedb/1.0.0/ (data available at https://alpha.physionet.org/content/hbedb/1.0.0/BDSinfo.txt) Parkinson’s The Tappy Keystroke dataset contains keystroke logs collected from over 200 subjects, with and without Parkinson's Disease (PD), as they typed normally on their own computer (without any supervision) over a period of weeks or months (having initially installed a custom keystroke recording app, Tappy) https://alpha.physionet.org/content/tappy/1.0.0/
Stroke Clinical data for 120 elderly patients, 60 of whom suffered a stroke https://physionet.org/content/cves/1.0.0/ (data available at https://physionet.org/content/cves/1.0.0/subjects.csv)
Pregnancy Number of prior pregnancies, BMI, pregnancy term, fetus sex, and mother’s age for 91 pregnant women. https://physionet.org/content/sufhsdb/1.0.0/FetalPCGSpreadsheet.xls More pregnancy Data on 111 pregnant women in Iceland: https://physionet.org/content/ehgdb/1.0.0/ (data available at https://physionet.org/content/ehgdb/1.0.0/info.txt) Voices This database includes 208 voice samples, from 150 pathological, and 58 healthy voices. The healthy voices or the presence of each vocal fold's disorders were clinically verified by the medical experts involved in the project. All diagnoses were made according to indications of the SIFEL protocol, a clinical protocol compiled by the Italian Society of Phoniatrics and Logopaedics. The database includes information such as gender, age, pathology, lifestyle habits (e.g. smoking, alcohol and coffee consummation), occupational status, and the results of two specific medical questionnaires: the Voice Handicap Index (VHI) and Reflux Symptom Index (RSI). https://physionet.org/content/voiced/1.0.0/
Find your own data
Dataset Collections (data mostly ready to download) CMU Data library http://lib.stat.cmu.edu/datasets/ Awesome public datasets https://github.com/awesomedata/awesome-public-datasets Fivethirtyeight Info about datasets available at https://docs.google.com/spreadsheets/d/1IMWAHNPIDzplafWW6AGnGyHmB1BMjohEw_V5HmT70Gs/edit#gid=840984416 US Crime datasets https://catalog.data.gov/dataset?tags=crime
Topic search (narrow down to data of interest and then export) WHO Global Health https://www.who.int/gho/en/ US Police data https://www.policedatainitiative.org/datasets/ European Union Open Data Portal (http://data.europa.eu/euodp/en/data/group) Federal government (explore.data.gov) Bureau of Labor Statistics (bls.gov0 Air Quality OpenAQ Can obtain pm25, pm10, so2, no2, o3, co, and bc data for specific countries, cities, locations from last 90 days through ropenaq package: https://github.com/ropensci/ropenaq Climate SILO Australian annual climate data (daily/monthly rainfall, min/max temp, evaporation, mean sea level pressure, solar radiation, and vapour pressure) from 1889 (for most variables) until current year. https://www.longpaddock.qld.gov.au/silo/gridded-data/ Climate https://en.tutiempo.net/climate/download/, http://actuariesclimateindex.org/data/
Environment AfSIS Subsaharan Africa soil chemistry data from 2009-2013 analyzed both wet and dry https://github.com/qedsoftware/afsis-soil-chem-tutorial Sports https://www.jokecamp.com/blog/guide-to-football-and-soccer-data-and-apis/
Dataset searches Center for Open Science OSF (https://osf.io/dashboard) Gapminder Collection of global data https://www.gapminder.org/data/ Google dataset search https://toolbox.google.com/datasetsearch Kaggle datasets https://www.kaggle.com/datasets