Skip to content

This project generates a high-quality Alpaca-style dataset from input text files, PDFs, and Word documents.

License

Notifications You must be signed in to change notification settings

ekatraone/alpaca-dataset-generator

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

18 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Alpaca-style Dataset Generator

This project generates a high-quality Alpaca-style dataset from input text files, PDFs, and Word documents. It features optimized performance, GPU acceleration, and customizable output.

Features

  • Multi-threaded data loading from various file formats (txt, pdf, docx)
  • Batch processing for efficient dataset generation
  • GPU acceleration (if available)
  • Separate raw and validated output files
  • Progress tracking for all major steps
  • Customizable configuration

Project Structure

alpaca-dataset-generator/
│
├── src/
│   ├── main.py
│   ├── config.py
│   ├── data_loader.py
│   ├── model_setup.py
│   ├── dataset_generator.py
│   ├── validation.py
│   └── utils.py
│
├── data/
│   └── input/
│       ├── file1.txt
│       ├── file2.pdf
│       └── file3.docx
│
├── output/
│   ├── raw_dataset.jsonl
│   └── validated_dataset.jsonl
│
├── requirements.txt
└── README.md

Setup

  1. Clone the repository:
git clone https://github.com/ekatraone/alpaca-dataset-generator.git
cd alpaca-dataset-generator
  1. Create a virtual environment and activate it:
python -m venv venv
source venv/bin/activate  # On Windows, use `venv\Scripts\activate`
  1. Install the required packages:
pip install -r requirements.txt
  1. Download NLTK data:
python -m nltk.downloader punkt stopwords

Configuration

Open src/config.py and adjust the settings as needed:

  • input_folder: Path to your input data folder (default: 'data/input')
  • output_file: Path for the raw output file (default: 'output/raw_dataset.jsonl')
  • validated_output_file: Path for the validated output file (default: 'output/validated_dataset.jsonl')
  • num_examples: Number of examples to generate
  • batch_size: Batch size for processing
  • max_workers: Number of worker threads for data loading

Usage

  1. Place your input files (.txt, .pdf, .docx) in the data/input/ directory.

  2. Run the script:

python src/main.py --num_examples 1000
  • --num_examples: Number of examples to generate (default: 1000)
  1. The script will generate two files in the output/ directory:
    • raw_dataset.jsonl: Contains all generated examples
    • validated_dataset.jsonl: Contains only the examples that passed validation

Customization

  • To modify the types of examples generated, edit the instructions list in src/dataset_generator.py.
  • To adjust validation criteria, modify the is_valid_output function in src/utils.py.

Troubleshooting

  • If you encounter CUDA out-of-memory errors, try reducing the batch_size in src/config.py.
  • If the process is too slow, you can try increasing max_workers or batch_size, but be cautious of memory usage.

Releases

For information about the latest releases and changes, please refer to the CHANGELOG.md file.

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

This project is licensed under the MIT License - see the LICENSE file for details.

About

This project generates a high-quality Alpaca-style dataset from input text files, PDFs, and Word documents.

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages