This project generates a high-quality Alpaca-style dataset from input text files, PDFs, and Word documents. It features optimized performance, GPU acceleration, and customizable output.
- Multi-threaded data loading from various file formats (txt, pdf, docx)
- Batch processing for efficient dataset generation
- GPU acceleration (if available)
- Separate raw and validated output files
- Progress tracking for all major steps
- Customizable configuration
alpaca-dataset-generator/
│
├── src/
│ ├── main.py
│ ├── config.py
│ ├── data_loader.py
│ ├── model_setup.py
│ ├── dataset_generator.py
│ ├── validation.py
│ └── utils.py
│
├── data/
│ └── input/
│ ├── file1.txt
│ ├── file2.pdf
│ └── file3.docx
│
├── output/
│ ├── raw_dataset.jsonl
│ └── validated_dataset.jsonl
│
├── requirements.txt
└── README.md
- Clone the repository:
git clone https://github.com/ekatraone/alpaca-dataset-generator.git
cd alpaca-dataset-generator
- Create a virtual environment and activate it:
python -m venv venv
source venv/bin/activate # On Windows, use `venv\Scripts\activate`
- Install the required packages:
pip install -r requirements.txt
- Download NLTK data:
python -m nltk.downloader punkt stopwords
Open src/config.py
and adjust the settings as needed:
input_folder
: Path to your input data folder (default: 'data/input')output_file
: Path for the raw output file (default: 'output/raw_dataset.jsonl')validated_output_file
: Path for the validated output file (default: 'output/validated_dataset.jsonl')num_examples
: Number of examples to generatebatch_size
: Batch size for processingmax_workers
: Number of worker threads for data loading
-
Place your input files (.txt, .pdf, .docx) in the
data/input/
directory. -
Run the script:
python src/main.py --num_examples 1000
--num_examples
: Number of examples to generate (default: 1000)
- The script will generate two files in the
output/
directory:raw_dataset.jsonl
: Contains all generated examplesvalidated_dataset.jsonl
: Contains only the examples that passed validation
- To modify the types of examples generated, edit the
instructions
list insrc/dataset_generator.py
. - To adjust validation criteria, modify the
is_valid_output
function insrc/utils.py
.
- If you encounter CUDA out-of-memory errors, try reducing the
batch_size
insrc/config.py
. - If the process is too slow, you can try increasing
max_workers
orbatch_size
, but be cautious of memory usage.
For information about the latest releases and changes, please refer to the CHANGELOG.md file.
Contributions are welcome! Please feel free to submit a Pull Request.
This project is licensed under the MIT License - see the LICENSE file for details.