Contributions are very welcome - please follow the guidelines and the Code of Conduct.
- BioRxiv XML - Bulk access to the full text of bioRxiv articles for the purposes of text and data mining (TDM) is available via a dedicated Amazon S3 resource.
- ChemTables: 788 chemical patent tables with labels of their content type. Built for semantic classification of table type. Licensed under CC BY NC 3.0.
- Europe PMC - Bulk download of full text and SI of > 5 million articles.
- IUPAC Gold Book
- LibreText: Open-access chemistry textbook.
- MedRxiv XML - Text and data mining is possible via dedicated Amazon S3 resource.
- NLM literature archive: NLM LitArch (NLM Literature Archive) is a digital archive for books, documents, and articles in the fields of life science, medicine, and healthcare at the National Institutes of Health. Also accessible via NCBI bookshelf.
- OpenStax Free textbooks, including Chemistry 2e, which is released under CC-BY 4.0.
- PubChemSTM: 281K chemical structure and text pairs
- PubMed central: free full-text archive
- PubMed: abstracts and outlinks
- S2ORC: The Semantic Scholar Open Research Corpus. 81.1M English-language academic papers spanning many academic disciplines largest publicly-available collection of machine-readable academic text). Released under CC BY-NC 4.0.
- Elsevier Corpus: This is a corpus of 40k (40,001) open access (OA) CC-BY articles from across Elsevier’s journals represent the first cross-discipline research of data at this scale to support NLP and ML research.
- Crystallography Open Database: open-access collection of crystal structures of organic, inorganic, metal-organic compounds and minerals, excluding biopolymers. They also derived SMILES for some compounds.
- Enamine HTS collection: 1 930 980 diverse screening compounds (37 billion molecules in 2D and 4.5 billion in 3D)
- nCov-Group Data Repository: SMILES, fingerprints, descriptors, and images of millions of compounds.
- zinc20: ZINC20 library prepared for Deep Docking-accelerated virtual screening
- zinc22: commercially-available compounds for virtual screening
- COCONUT: is an open source project for Natural Products (NPs) storage, search and analysis.
- nmrshiftdb2: is database for organic structures and their nuclear magnetic resonance (NMR) spectra.
- ACNet: a benchmark for Activity Cliff Prediction, 400K Matched Molecular Pairs (MMPs) against 190 targets, including over 20K MMP-cliffs and 380K non-AC MMPs from ChEMBL (version 28).
- Aquasoldb: Curation of nine open source datasets on aqueous solubility. The authors also assigned reliability groups.
- BindingDB: molecular recognition database, contains 2.6M data for 1.1M Compounds and 8.10K Targets (Feb 2023)
- ChEBI-20: 33,010 molecule-description pairs (for molecule captioning task)
- ESol: Water solubility data(log solubility in mols per litre) for common organic small molecules.
- Flashpoint: Sun et al. collected a dataset of the flashpoints of 10575 molecules from academic papers, the Gelest chemical catalogue, the DIPPR database, Lange's Handbook of Chemistry, the Hazardous Chemicals Handbook, and the PubChem database.
- FreeSolv: Experimental and Calculated Small Molecule Hydration Free Energies
- Harvard OPV: "experimental photovoltaic data from the literature, and corresponding quantum-chemical calculations performed over a range of geometries, each with quantum chemical results using a variety of density functionals and basis sets"
- Hydrogen Storage Materials Database: data on hydrides for hydrogen storage (information such as chemical formula and hydrogen capacity)
- ILThermo: thermodynamic and transport properties of pure ionic liquids and mixtures of them.
- Leffingwell Odor Dataset: 3523 molecules associated with expert-labeled odor descriptors from the Leffingwell PMP 2001 database
- Limiting activity coefficients: for different solvent/solute pairs, used to train a SMILES-based transformer.
- Lipophilicty: Experimental results of octanol/water distribution coefficient(logD at pH 7.4).
- MoleculeNet - Benchmark suite that contains multiple datasets listed here
- oechem: On Feb 17 2023 OCHEM contained 3774118 records for 689 properties (with at least 50 records) collected from 20609 sources (user is granted a Creative Commons CC-BY (version 4.0) license to data submitted)
- Papyrus: A large scale curated dataset aimed at bioactivity predictions. Contains multiple large publicly available datasets such as ChEMBL and ExCAPE-DB combined with smaller datasets.
- Photoswitch Dataset: Curated dataset of 405 photoswitch molecules.
- QM Datasets: QM7, QM7b, QM8, QM9, MD Trajectories
- SolProp: Database of 1 million solvent/solute COSMO-RS calculations and 10145 experimental solvation free energies (originally published as part of this paper).
- SOMAS: Experimental and calculated solubilities for small molecules. Originally proposed for the design of redox-flow batteries.
- Therapeutic Data Commons: ML tasks that cover small molecules and biologics, including antibodies, peptides, miRNAs, and gene editing therapies. Original data can be found here.
- ThermoML Archive: experimental thermophysical and thermochemical property data (in ThermoML XML format)
- Open Targets: is a large-scale resource that uses human genetics and genomics data for systematic drug target identification and prioritization.
- Probes & Drugs Portal: is an interactive, open data resource for chemical biology. Overview of libraries of bioactive compounds (e.g., ChEMBL, Guide to PHARMACOLOGY), including commercial screening libraries.
- Guide to PHARMACOLOGY: is an expert-curated resource of ligand-activity-target relationships. It includes activity data even for data with unknown bioactivity value (under CC BY-SA 4.0).
- Drug Indications Database (DID): is a dataset of structured drug-indication relations. It is intended to facilitate the building of practical, comprehensive, integrated drug ontologies.
- The Metabolism and Transport Database : is a cheminformatics and bioinformatics resource that contains curated data related to human small molecule metabolism and transport.
- The Human Metabolome Database (HMDB): is a freely available electronic database containing detailed information about small molecule metabolites found in the human body.
- KEGG PATHWAY Database(KEGG): a database resource for understanding high-level functions and utilities of the biological system, such as the cell, the organism and the ecosystem, from molecular-level information, especially large-scale molecular datasets generated by genome sequencing and other high-throughput experimental technologies.
- MetXBioDB Metabolite Biotransformations: a comprehensive collection of biotransformation reactions and metabolite information from the BioTransformer database. It includes the transformation and metabolism of metabolites.
- QSAR datasets - Meta-QSAR (phase I & II): Data (extracted from ChEMBL) used in Olier et al. Meta-QSAR: a large-scale application of meta-learning to drug design and discovery.
- EPA CompTox: is a widely used resource for chemistry, toxicity, and exposure information for hundreds of thousands of chemicals including, but not limited to, chemical properties, environmental fate, and transport, hazard, in vitro to in vivo extrapolation (IVIVE), exposure, bioactivity (each data has its license).
- PAMPA Permeability and NCATS dataset: is a dataset of commonly employed assay to evaluate drug permeability across the cellular membrane to help in ADME prediction.
- Cell Effective Permeability (Caco-2) dataset: by Wang et al. is a dataset used to measure the absorption of drugs through intestinal tissue by simulating it using a human colon epithelial cancer cell line (Caco-2).
- ustop: Reactions extracted by text-mining from United States patents published between 1976 and September 2016.
- Dreher-Doyle: yields and conditions for 3955 Pd-catalysed Buchwald–Hartwig C–N crosscouplings
- Perera: yields and conditions for 5760 Pd-catalysed Suzuki-Miyaura C-C cross-couplings
- porous materials AI gym: open data sets for machine learning pertaining to porous materials.
- awesome materials informatics: overview of software, data and initatives in the field of materials informatics