Small research project - how much it would cost to create Alpaca-like dataset using slightly different approach. All data byproducts are CC0-licensed.
Remember that developing a model based on data you generated via model API might violate the terms of service of the model API provider.
- Clone the repo:
git clone https://github.com/mobarski/alpaca-libre && cd alpaca-libre
- Install required python modules:
pip install -r requirements.txt
- View / edit generate.py
- Set API_KEY:
export OPENAI_KEY=...
- Run the script:
python3 generate.py
data/seed_tasks.jsonl
- is from the Self-Instruct paperdata/alpaca_libre_prompt_v1.txt
- is from the Alpaca paper (with slight modfification)
The output file is in the jsonl format. It contains one task (json object) per line. Each task object has the following items:
- status - anything other than 'ok' is bad
- instruction
- input
- output
- other
GitHub repos:
- https://github.com/tatsu-lab/stanford_alpaca
- https://github.com/yizhongw/self-instruct
- https://github.com/orhonovich/unnatural-instructions
Papers:
- https://crfm.stanford.edu/2023/03/13/alpaca.html
- https://arxiv.org/abs/2212.10560
- https://arxiv.org/abs/2212.09689
- 0.3
- parallel main loop
- better cli output
- output format change (everythig not essential is placed in the "other" object)
- basic output quality check
- fix: multiline input/output handling
- fix: no initial space / empty section handling
- fix: