Alpaca-style Dataset Generator

This project generates a high-quality Alpaca-style dataset from input text files, PDFs, and Word documents. It features optimized performance, GPU acceleration, and customizable output.

Features

Multi-threaded data loading from various file formats (txt, pdf, docx)
Batch processing for efficient dataset generation
GPU acceleration (if available)
Separate raw and validated output files
Progress tracking for all major steps
Customizable configuration

Project Structure

alpaca-dataset-generator/
│
├── src/
│   ├── main.py
│   ├── config.py
│   ├── data_loader.py
│   ├── model_setup.py
│   ├── dataset_generator.py
│   ├── validation.py
│   └── utils.py
│
├── data/
│   └── input/
│       ├── file1.txt
│       ├── file2.pdf
│       └── file3.docx
│
├── output/
│   ├── raw_dataset.jsonl
│   └── validated_dataset.jsonl
│
├── requirements.txt
└── README.md

Setup

Clone the repository:

git clone https://github.com/ekatraone/alpaca-dataset-generator.git
cd alpaca-dataset-generator

Create a virtual environment and activate it:

python -m venv venv
source venv/bin/activate  # On Windows, use `venv\Scripts\activate`

Install the required packages:

pip install -r requirements.txt

Download NLTK data:

python -m nltk.downloader punkt stopwords

Configuration

Open src/config.py and adjust the settings as needed:

input_folder: Path to your input data folder (default: 'data/input')
output_file: Path for the raw output file (default: 'output/raw_dataset.jsonl')
validated_output_file: Path for the validated output file (default: 'output/validated_dataset.jsonl')
num_examples: Number of examples to generate
batch_size: Batch size for processing
max_workers: Number of worker threads for data loading

Usage

Place your input files (.txt, .pdf, .docx) in the data/input/ directory.
Run the script:

python src/main.py --num_examples 1000

--num_examples: Number of examples to generate (default: 1000)

The script will generate two files in the output/ directory:
- raw_dataset.jsonl: Contains all generated examples
- validated_dataset.jsonl: Contains only the examples that passed validation

Customization

To modify the types of examples generated, edit the instructions list in src/dataset_generator.py.
To adjust validation criteria, modify the is_valid_output function in src/utils.py.

Troubleshooting

If you encounter CUDA out-of-memory errors, try reducing the batch_size in src/config.py.
If the process is too slow, you can try increasing max_workers or batch_size, but be cautious of memory usage.

Releases

For information about the latest releases and changes, please refer to the CHANGELOG.md file.

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Alpaca-style Dataset Generator

Features

Project Structure

Setup

Configuration

Usage

Customization

Troubleshooting

Releases

Contributing

License

About

Releases

Packages

Contributors 2

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
src		src
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
README.md		README.md
index.html		index.html
requirements.txt		requirements.txt

License

ekatraone/alpaca-dataset-generator

Folders and files

Latest commit

History

Repository files navigation

Alpaca-style Dataset Generator

Features

Project Structure

Setup

Configuration

Usage

Customization

Troubleshooting

Releases

Contributing

License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages