Create a scientific dataset like this.

This project contains several Python scripts that are used to process arxiv metadata and create a dataset.

Setup

Create a new virtual environment from your preferred python distribution.
Install using pip install -r requirements.txt if using pip or conda env create -f environment.yml if using Anaconda or Miniforge.
Setup scientific_dataset_arxiv/config.py file for intended usage. You can configure the start and end year, and the maximum pdfs you want to download per month.
You can also customise the search term after which the data would be returned from the txt files. The default search_term = 'introduction' is a good choice. This is how the reference dataset was created too.

Scripts

download_convert.py: This script is used to download PDFs from Arxiv GCP bucket and convert them into text files.
merge_metadata_articles.py: This script is used to merge the metadata, which contains ID, title, and abstract, with the articles extracted.
merge_parquet.py: This script is used to merge all the files together into one dataset.
Check out a sample of the end result in the test_merged_parquet.ipynb notebook.

Usage

To use these scripts, run them in the order listed above. Make sure to replace start_year, end_year, and max_pdfs_per_month for your preferred years to get the dataset for in all four scripts.

In the end, you should end up with a dataset that looks a little like scientific_papers. However, it is updated with the latest articles for a more up to date training!

Note

You can find preprocessed data here. This has all the papers from the year 2007 to 2023.
However, many files would be discarded either due to loss in conversion to text file or missing search_term. Check out the data folder in the Hugging Face repo to find yearwise parquet files.

You can find the raw data, which combines metadata from Arxiv and the full text extracted from the pdfs, here into a single parquet file for futher customised processing. Check out the data folder in the Hugging Face repo to find yearwise parquet files.

Some screenshots of the same

Hugging Face reference dataset:
Self preprocessed dataset:
Raw dataset:

Name		Name	Last commit message	Last commit date
Latest commit History 69 Commits
scientific_dataset_arxiv		scientific_dataset_arxiv
.gitignore		.gitignore
.gitmodules		.gitmodules
README.md		README.md
download_convert.py		download_convert.py
environment.yml		environment.yml
merge_metadata_articles_by_year.py		merge_metadata_articles_by_year.py
merge_metadata_unprocessed_by_year.py		merge_metadata_unprocessed_by_year.py
merge_parquet.py		merge_parquet.py
requirements.txt		requirements.txt
test_merged_parquet.ipynb		test_merged_parquet.ipynb
upload_to_hf.ipynb		upload_to_hf.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Create a scientific dataset like this.

Setup

Scripts

Usage

Note

Some screenshots of the same

About

Releases

Packages

Languages

mitanshu7/scientific_dataset_arxiv

Folders and files

Latest commit

History

Repository files navigation

Create a scientific dataset like this.

Setup

Scripts

Usage

Note

Some screenshots of the same

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages