Small research project - how much it would cost to create Alpaca-like dataset using slightly different approach. All data byproducts are CC0-licensed.
Remember that developing a model based on data you generated via model API might violate the terms of service of the model API provider.
- Clone the repo:
git clone https://github.com/mobarski/alpaca-libre && cd alpaca-libre
- Install required python modules:
pip install -r requirements.txt
- View / edit generate.py
- Set API_KEY:
export OPENAI_KEY=...
- Run the script:
python3 generate.py
data/seed_tasks.jsonl
- is from the Self-Instruct paperdata/alpaca_libre_prompt_v1.txt
- is from the Alpaca paper (with slight modfification)
The output file (data/alpaca_libre_tasks_v1.jsonl
) is in the jsonl format.
It contains one task (json object) per line.
Each task object has the following items:
- status - anything other than 'ok' is bad
- instruction
- input
- output
- other
GitHub repos:
- https://github.com/tatsu-lab/stanford_alpaca
- https://github.com/yizhongw/self-instruct
- https://github.com/orhonovich/unnatural-instructions
Papers: