Create a scientific dataset like this.
This project contains several Python scripts that are used to process arxiv metadata and create a dataset.
- Create a new virtual environment from your preferred python distribution.
- Install using
pip install -r requirements.txt
if using pip orconda env create -f environment.yml
if using Anaconda or Miniforge. - Setup
scientific_dataset_arxiv/config.py
file for intended usage. You can configure the start and end year, and the maximum pdfs you want to download per month. - You can also customise the search term after which the data would be returned from the txt files. The default
search_term = 'introduction'
is a good choice. This is how the reference dataset was created too.
download_convert.py
: This script is used to download PDFs from Arxiv GCP bucket and convert them into text files.merge_metadata_articles.py
: This script is used to merge the metadata, which contains ID, title, and abstract, with the articles extracted.merge_parquet.py
: This script is used to merge all the files together into one dataset.- Check out a sample of the end result in the
test_merged_parquet.ipynb
notebook.
To use these scripts, run them in the order listed above. Make sure to replace start_year, end_year, and max_pdfs_per_month for your preferred years to get the dataset for in all four scripts.
In the end, you should end up with a dataset that looks a little like scientific_papers. However, it is updated with the latest articles for a more up to date training!
You can find preprocessed data here.
This has all the papers from the year 2007 to 2023.
However, many files would be discarded either due to loss in conversion to text file or missing search_term.
Check out the data
folder in the Hugging Face repo to find yearwise parquet files.
You can find the raw data, which combines metadata from Arxiv and the full text extracted from the pdfs, here into a single parquet file for futher customised processing.
Check out the data
folder in the Hugging Face repo to find yearwise parquet files.