Skip to content

Reimplementation of the task generation part from the Alpaca paper

License

Notifications You must be signed in to change notification settings

mobarski/alpaca-libre

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

22 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Alpaca Libre

Small research project - how much it would cost to create Alpaca-like dataset using slightly different approach. All data byproducts are CC0-licensed.

Remember that developing a model based on data you generated via model API might violate the terms of service of the model API provider.

alpaca on the Altiplano grasslands with the Statue of Liberty in the background

Usage

  1. Clone the repo: git clone https://github.com/mobarski/alpaca-libre && cd alpaca-libre
  2. Install required python modules: pip install -r requirements.txt
  3. View / edit generate.py
  4. Set API_KEY: export OPENAI_KEY=...
  5. Run the script: python3 generate.py

Attribution

  • data/seed_tasks.jsonl - is from the Self-Instruct paper
  • data/alpaca_libre_prompt_v1.txt - is from the Alpaca paper (with slight modfification)

Output

The output file is in the jsonl format. It contains one task (json object) per line. Each task object has the following items:

  • status - anything other than 'ok' is bad
  • instruction
  • input
  • output
  • other

References

GitHub repos:

Papers:

Changelog

  • 0.3
    • parallel main loop
    • better cli output
    • output format change (everythig not essential is placed in the "other" object)
    • basic output quality check
    • fix: multiline input/output handling
    • fix: no initial space / empty section handling
    • fix:

About

Reimplementation of the task generation part from the Alpaca paper

Resources

License

Stars

Watchers

Forks

Languages