Skip to content

Reimplementation of the task generation part from the Alpaca paper

License

Notifications You must be signed in to change notification settings

mobarski/alpaca-libre

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

26 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Alpaca Libre

🦙🗽 Small research project - how much it would cost to create Alpaca-like dataset using slightly different approach. All data byproducts are CC0-licensed.

👉 Follow me on Twitter for news and updates.

🚫 Remember that releasing a model based on data you generated via model API might violate the Terms of Service of the model API provider.

alpaca on the Altiplano grasslands with the Statue of Liberty in the background

Usage

  1. Clone the repo: git clone https://github.com/mobarski/alpaca-libre && cd alpaca-libre
  2. Install required python modules: pip install -r requirements.txt
  3. View / edit generate.py
  4. Set API_KEY: export OPENAI_KEY=...
  5. Run the script: python3 generate.py

Attribution

  • data/seed_tasks.jsonl - is from the Self-Instruct paper
  • data/alpaca_libre_prompt_v1.txt - is from the Alpaca paper (with slight modfification)

Output

Files in the data/output directory are in the same format as original Alpaca dataset.

Files in the data/output/work directory are in the .jsonl format and:

  • contain one task (JSON object) per line,

  • contain also tasks that failed quality checks (status!='ok')

    • these tasks might be marked as 'ok' after manual inspection
  • each task object has the following items:

    • status - anything other than 'ok' is bad

    • instruction - instruction part of the prompt

    • input - input part of the prompt

    • output - expected output

    • other - dictionary for other information (similarity, etc)

References

GitHub repos:

Papers:

Changelog

  • 0.4.1
    • v4 dataset converted into the same format as original Alpaca
    • jsonl dataset moved into work dir
  • 0.4
    • grouping turns into rounds
    • basic input quality check
    • better <noinput> handling
    • <nooutput> handling
    • retry with backoff on API error
    • progressbars
    • fixed: typos in Alpaca prompt
    • fixed: whitespace handling after task number
  • 0.3
    • parallel main loop
    • better cli output
    • output format change (everythig not essential is placed in the "other" object)
    • basic output quality check
    • fixed: multiline input/output handling
    • fixed: no initial space / empty section handling
    • fixed: <noinput>

About

Reimplementation of the task generation part from the Alpaca paper

Resources

License

Stars

Watchers

Forks

Languages