lm-datasets is a collection of datasets for language model training including scripts for downloading, preprocesssing, and sampling.
The documentation is available here.
Install the lm-datasets
package with pip:
pip install lm-datasets
In order to keep the package minimal by default, lm-datasets
comes with optional dependencies useful for some use cases.
For example, if you want to have the text extraction for all available datasets, run:
pip install lm-datasets[datasets]
To download and extract the plain-text of one or more datasets, run the following command:
lm_datasets extract_text $DATASET_ID $OUTPUT_DIR
By default, output is saved as JSONL files. To change the output format, you can use the --output_format
argument as below:
lm_datasets extract_text $DATASET_ID $OUTPUT_DIR --output_format parquet --output_compression zstd
A list or table with all available datasets can be print with the follow command:
lm_datasets print_stats --print_output md
Language | Tokens |
---|---|
bg | 53 B |
ca | 5 B |
code | 250 B |
cs | 128 B |
da | 34 B |
de | 795 B |
el | 108 B |
en | 6 T |
es | 674 B |
et | 15 B |
eu | 696 M |
fi | 55 B |
fr | 655 B |
ga | 767 M |
gl | 70 M |
hr | 8 B |
hu | 179 B |
it | 386 B |
lt | 24 B |
lv | 14 B |
mt | 4 B |
nl | 238 B |
nn | 307 M |
no | 9 B |
pl | 223 B |
pt | 187 B |
ro | 77 B |
sh | 2 M |
sk | 47 B |
sl | 11 B |
sr | 10 B |
sv | 89 B |
uk | 47 B |
Source | Tokens |
---|---|
academic_slovene_kas | 1 B |
bgnc_admin_eur | 79 M |
bgnc_news_corpus | 18 M |
brwac | 3 B |
bulgarian_news | 283 M |
bulnc | 567 M |
cabernet | 712 M |
cc_gigafida | 127 M |
colossal_oscar | 208 B |
croatian_news_engri | 695 M |
curlicat | 410 M |
danewsroom | 472 M |
danish_gigaword | 1 B |
dewac | 2 B |
dialogstudio | 0 |
dk_clarin | 441 M |
enc2021 | 0 |
estonian_reference_corpus | 175 M |
eurlex | 121 B |
euscrawl | 423 M |
ga_bilingual_legistation | 4 M |
ga_universal_dependencies | 3 M |
greek_legal_code | 45 M |
greek_web_corpus | 3 B |
hrwac | 1 B |
itwac | 2 B |
korpus_malti | 366 M |
legal_mc4 | 29 B |
macocu | 23 B |
marcell_legislative_subcorpus_v2 | 31 M |
norwegian_cc | 5 B |
openlegaldata | 10 B |
oscar | 9 T |
oscar_opengptx | 245 B |
parlamento_pt | 819 M |
pes2o | 42 B |
pl_nkjp | 1 M |
pl_parliamentary_corpus | 671 M |
proof_pile | 8 B |
redpajama | 46 B |
seimas_lt_en | 48 k |
sk_court_decisions | 11 B |
sk_laws | 45 M |
slwac_web | 1 B |
sonar | 500 M |
sonar_new_media | 36 M |
spanish_legal | 3 B |
srpkor | 0 |
starcoder | 250 B |
state_related_latvian_web | 1 M |
styria_news | 409 M |
sv_gigaword | 1 B |
syn_v9 | 5 B |
uk_laws | 579 M |
wiki | 12 B |
wikibooks | 353 M |
wikihow | 2 M |
wikinews | 79 M |
wikiquote | 268 M |
wikisource | 2 B |
wikivoyage | 132 M |
ylenews | 0 |
We provide a Web-based application through streamlit to browse all datasets and their contained text content. To start the app, first clone this repository, install dependencies, and run the following command:
# clone is needed since streamlit does not support apps from modules yet
git clone https://github.com/malteos/lm-datasets.git
streamlit run src/lm_datasets/viewer/app.py -- \
--raw_datasets_dir=$RAW_DATASETS_DIR \
--output_dir=$PROCESSED_DATASET_DIR
To setup, your local development environment we recommend conda and cloning the repository. The repository also includes settings and launch scripts for VSCode.
git clone [email protected]:malteos/lm-datasets.git
cd lm-datasets
conda create -n lm-datasets python=3.10
conda activate lm-datasets
pip install -r requirements.txt
Alternatively, you can install the Python package directly from the dev branch:
pip install git+https://github.com/malteos/lm-datasets.git@dev
This repository uses git hooks to validate code quality and formatting.
pre-commit install
git config --bool flake8.strict true # Makes the commit fail if flake8 reports an error
To run the hooks:
pre-commit run --all-files
The tests can be executed with:
pytest --doctest-modules --cov-report term --cov=lm_datasets
The work on the lm-datasets software is partially funded by the German Federal Ministry for Economic Affairs and Climate Action (BMWK) through the project OpenGPT-X (project no. 68GX21007D).
Apache 2.0
(Please note that the actual datasets are released with different licenses)