- Welcome to the InstructLab Taxonomy
- Learning
- Getting Started with Skill Contributions
- Getting Started with Knowledge Contributions
- Taxonomy tree layout
- Contribute knowledge and skills to the taxonomy!
InstructLab 🥼 uses a novel synthetic data-based alignment tuning method for Large Language Models (LLMs.) The "lab" in InstructLab 🥼 stands for Large-scale Alignment for Chat Bots.
The LAB method is driven by taxonomies, which are largely created manually and with care.
This repository contains a taxonomy tree that allows you to create models tuned with your data (enhanced via synthetic data generation) using LAB 🐶 method.
Learn about the concepts of "skills" and "knowledge" in our InstructLab Community Learning Guide.
Skills require a much smaller volume of content than knowledge contributions. An entire skill
contribution to the taxonomy tree can be just a few lines of YAML in the
qna.yaml
file ("qna" is short for "questions and answers"). Each qna.yaml
file requires a minimum of five question and answer pairs.
Tip
The skill taxonomy structure is used in several ways:
- To select the right subset of the taxonomy to use for data generation.
- To determine the interpretability by human contributors and maintainers.
- As part of the prompt to the GPT model used to generate synthetic samples.
Note
To get the best use of this structure, make sure the names of directories match the intent of the taxonomy files. You can also verify that a person's contribution is in the most logical location in the taxonomy structure before signing off.
Taxonomy skill files must be a valid YAML file named
qna.yaml
. Each qna.yaml
files contains a set of key/value entries with the following keys:
-
task_description
: A description of the skill. This key is required. -
created_by
: The GitHub username of the contributor. This key is required. -
seed_examples
: A collection of key/value entries. New submissions should have at least five entries, although older files may have fewer. This key is required.-
question
: A question for the model. This key is required. -
answer
: The desired response from the model. This key is required. -
context
: Grounded skills require the user to provide context containing information that the model is expected to take into account during processing. This is different from knowledge, where the model is expected to gain facts and background knowledge from the tuning process. The context key is optional for freeform skills. -
attribution
: A collection of key/value entries that contain the following keys. All sources of information must be specified. This key is required.-
source
: The source of the provided information. If information in the context, question, or answer comes from a third party, such as Wikipedia, then the value must specify a URL to the source material. If the contributor self-authored all of the information, then the value must beself-authored
. This key is required. -
license
: The value must specify the SPDX License Identifier of the source information. See CONTRIBUTING.MD for guidance on acceptable licenses for source information. If the information is self-authored, then the value must beApache-2.0
. This key is required.
-
-
Other keys at any level are currently ignored.
To make the qna.yaml
files easier and faster for humans to read, it is recommended to specify task_description
first, followed by created_by
, and finally seed_examples
.
In seed_examples
, it is recommended to specify context
first (if applicable), followed by question
, answer
, and finally attribution
.
task_description: <string>
created_by: <string>
seed_examples:
- question: <string>
answer: |
<multi-line string>
attribution:
- source: <string>
license: <SPDX license identifier>
- context: |
<multi-line string>
question: <string>
answer: |
<multi-line string>
attribution:
- source: <string>
license: <SPDX license identifier>
...
If you have not written YAML before, don't be intimidated - it's just text.
Tip
- Spaces and indentation matter in YAML. Two spaces to indent.
- Don't use tabs!
- Be careful to not have trailing spaces at the end of a line.
- Each example in
seed_examples
begins with a "-". Place this "-" in front of the first field (question
orcontext
). The remaining keys in the example should not have this "-". - Some special characters such as " and ' need to be "escaped." This is why some of the lines for keys in the example YAML we provided have the '|' character. This character escapes all of the special characters in the value for the key. You might also want to use the '|' character for multi-line strings.
It is recommended that you lint, or verify your YAML using a tool.
One linter option is yamllint.com. You can copy/paste your YAML into the box and click Go to have it analyze your YAML and make recommendations.
Online tools like prettified and
yaml-validator can automatically
reformat your YAML to adhere to our yamllint
PR checks, such as breaking lines
longer than 120 characters.
This example snippet assumes the GitHub username mairin
and shows
some of the question/answer pairs present in the actual
file:
task_description: |
The pun task enables the telling of funny pun-based jokes.
created_by: mairin # Use your GitHub username; only one creator supported
seed_examples:
- question: Tell me a pun about birds.
answer: | # The | is needed to escape characters like ` or '
Why do birds eat wood?
Because they're peckish!
attribution:
- source: self-authored
license: Apache-2.0
- question: Tell me a pun about x-rays.
answer: |
What do dentists call their x-rays?
Tooth pics!
attribution:
- source: self-authored
license: Apache-2.0
- question: Tell me a pun about gas.
answer: |
Why did the car have a belly ache?
Because it had too much gas!
attribution:
- source: self-authored
license: Apache-2.0
Seriously, that's it.
Here is the location of this YAML in the taxonomy tree. Note that the YAML file itself, plus any added directories that contain the file, is the entirety of the skill in terms of a taxonomy contribution:
[...]
└── writing
└── freeform
| └── jokes/puns <=== here it is :)
| | └── qna.yaml
│ ├── debate
│ │ └── qna.yaml
│ ├── legal
│ │ ├── agreement
│ │ │ └── qna.yaml
[...]
Remember that grounded compositional skills require additional context and include a context
field.
This example snippet assumes the GitHub username mairin
and shows some of the question/answer pairs present in the actual file:
task_description: |
This skill provides the ability to read a markdown-formatted table.
created_by: mairin # Use your GitHub username; only one creator supported
seed_examples:
- context: |
| **Breed** | **Size** | **Barking** | **Energy** |
|----------------|--------------|-------------|------------|
| Afghan Hound | 25-27 in | 3/5 | 4/5 |
| Labrador | 22.5-24.5 in | 3/5 | 5/5 |
| Cocker Spaniel | 14.5-15.5 in | 3/5 | 4/5 |
| Poodle (Toy) | <= 10 in | 4/5 | 4/5 |
question: |
Which breed has the most energy?
answer: |
The breed with the most energy is the Labrador.
attribution:
- source: self-authored
license: Apache-2.0
- context: |
| **Name** | **Date** | **Color** | **Letter** | **Number** |
|----------|----------|-----------|------------|------------|
| George | Mar 5 | Green | A | 1 |
| Gráinne | Dec 31 | Red | B | 2 |
| Abigail | Jan 17 | Yellow | C | 3 |
| Bhavna | Apr 29 | Purple | D | 4 |
| Rémy | Sep 9 | Blue | E | 5 |
question: |
What is Gráinne's letter and what is her color?
answer: |
Gráinne's letter is B and her color is red.
attribution:
- source: self-authored
license: Apache-2.0
- context: |
| Banana | Apple | Blueberry | Strawberry |
|--------|------------|-----------|------------|
| Yellow | Red, Green | Blue | Red |
| Large | Medium | Small | Small |
| Peel | Peel | No peel | No peel |
question: |
Which fruit is blue, small, and has no peel?
answer: |
The blueberry is blue, small, and has no peel.
attribution:
- source: self-authored
license: Apache-2.0
[...]
└── extraction
└── inference
| └── qualitative
| | ├── sentiment
| | | └── qna.yaml
| | └── tone_and_style
| | └── qna.yaml
│ ├── quantitative
│ │ ├── table_analysis <=== here it is :)
│ | | └── qna.yaml
│ │ ├── word_frequency
│ │ │ └── qna.yaml
[...]
Note
We are not currently accepting knowledge contributions, but we will open this up in the future!
While skills are foundational or performative, knowledge is based more on answering questions that involve facts, data, or references.
Knowledge in the taxonomy tree also consists of a few more elements than skills.
Each knowledge node in the tree has a qna.yaml
similar to the format of the
qna.yaml
for skills, but it has an extra folder for knowledge documents called
knowledge_documents
. The knowledge document formats currently supported are
markdown (.md).
Each qna.yaml
file requires a minimum of five question-answer
pairs. The qna.yaml
format must include the following fields:
created_by
(your GitHub username)domain
(category the knowledge falls under)seed_examples
(five or more examples sourced from the provided knowledge documents)created_by
(your GitHub username)task_description
(an optional description of the knowledge).attribution
source
license
(cite your sources)
task_description: |
Knowledge about Taylor Swift's music.
created_by: mairin
domain: pop culture
seed_examples:
- question: |
Is Taytay coming to Boston in 2024?
answer: |
Not that is known yet. Taylor Swift last performed in the Boston area at
the Gilette Stadium in Foxboro, MA for 3 nights from Friday May 19, 2023
to Sunday May 21, 2023. In 2024, she is making international tour stops
for her Eras tour outside of the United States.
attribution:
- source: self-authored
license: Apache-2.0
- question: |
Which album was released more recently, Reputation or Midnights?
answer: |
The Taylor Swift Album Reputation was released on November 10, 2017.
Midnights was released October 21, 2022. Midnights was released more
recently, but there are rumors that there will be a re-release of
Reputation called Reputation (Taylor's version) in the later half of 2024
which would make that the most recently-released album of the set at that
time.
attribution:
- source: self-authored
license: Apache-2.0
- question: |
Which album has the song "You Need to Calm Down?"
answer: |
The song "You Need to Calm Down" appears on Taylor Swift's 2019 album
Lover as track 14.
attribution:
- source: self-authored
license: Apache-2.0
This knowledge references two markdown files:
ts-world-tour-2024-schedule.md
and ts-discography-2024.md
. You must submit these
files in their entirety along with the knowledge's
qna.yaml
file in a knowledge_documents
folder. Using multiple files means that knowledge
consists of a much higher volume of content than a skill.
Due to the higher volume, it will naturally take longer to receive acceptance for a knowledge contribution pull request than for a skill pull request. Smaller pull requests are simpler and require less time and effort to review.
What might these markdown files look like? They can be freeform. Here's what a
snippet of ts-discography-2024.md
might look like:
# Albums
## Studio Albums
### Taylor Swift
- Released: October 24, 2006
- Label: Big Machine
- Track Listing:
1. "Tim McGraw"
2. "Picture to Burn"
3. "Teardrops on My Guitar"
4. "A Place in This World"
5. "Cold as You"
6. "The Outside"
7. "Tied Together with a Smile"
8. "Stay Beautiful"
9. "Should've Said No"
10. "Mary's Song (Oh My My My)"
11. "Our Song"
### Fearless
- Released: November 11, 2008
- Label: Big Machine
- Track Listing:
1. "Fearless"
2. "Fifteen"
3. "Love Story"
4. "Hey Stephen"
[..]
In contrast to the layout of skills in the taxonomy, here's what the previously referenced knowledge might look like in the tree:
[...]
└── knowledge
└── textbooks
├── culture
│ └── music
│ └── pop
│ ├── taylor swift <=== here it is :)
│ │ ├── knowledge_documents
│ │ │ ├── ts-discography-2024.md
│ │ │ └── ts-world-tour-2024-schedule.md
│ │ └── qna.yaml
│ └── the rolling stones
│ ├── knowledge_documents
│ │ ├── rs-discography-2024.md
│ │ ├── rs-guitar-tabs.md
│ │ ├── rs-lyrics-catalog-2024.md
│ │ └── rs-tour-history.md
│ └── qna.yaml
[...]
The taxonomy tree is organized in a cascading directory structure. At the end of each branch, there is a YAML file (qna.yaml) that contains the examples for that domain. Maintainers can decide to change the names of the existing branches or to add new branches.
Important
Folder names do not have spaces.
Below is an illustrative directory structure to show this layout:
.
└── writing
├── freeform
│ ├── brainstorming
│ │ ├── idea_generation
│ │ │ └── qna.yaml
│ │ ├── refute_claim
│ │ │ └── qna.yaml
│ ├── poetry
│ │ ├── ballad
│ │ │ └── qna.yaml
│ │ ├── epic_poetry
│ │ │ └── qna.yaml
│ ├── prose
│ │ ├── articles
│ │ │ └── qna.yaml
│ │ ├── emails
│ │ │ ├── formal
│ │ │ │ └── qna.yaml
│ │ │ └── informal
│ │ │ └── qna.yaml
└── grounded
├── editing
│ ├── grammar
│ │ └── qna.yaml
│ └── spelling
│ └── qna.yaml
├── meeting_insights
│ ├── action_items
│ │ └── qna.yaml
│ ├── corporate_email
│ │ └── qna.yaml
└── summarization
└── wiki_insights
├── concise
│ └── qna.yaml
├── detailed
For an extensive example of this layout see, taxonomy_tree_layout in the documentation folder.
The ability to contribute to a Large Language Model (LLM) has been difficult in no small part because it is difficult to get access to the necessary compute infrastructure.
This taxonomy repository will be used as the seed to synthesize the training data for InstructLab-trained models. We intend to retrain the model(s) using the main branch following InstructLab's progressive training on a regular basis. This enables fast iteration of the model(s), for the benefit of the open source community.
By contributing your skills and knowledge to this repository, you will see your changes built into an LLM within days of your contribution rather than months or years! If you are working with a model and notice its knowledge or ability lacking, you can correct it by contributing knowledge or skills and check if it's improved after your changes are built.
While public contributions are welcome to help drive community progress, you can also fork this repository under the Apache License, Version 2.0, add your own internal skills, and train your own models internally. However, you might need your own access to significant compute infrastructure to perform sufficient retraining.
You can contribute to the taxonomy in the following two ways:
-
Adding new examples to existing leaf nodes:
- Go to the corresponding leaf node / end of the branch and modify the YAML
- Add a new example to the
qna.yaml
files as a new entry to the list
-
Adding new branches/skills corresponding to the existing domain:
- You can add new folders under the corresponding category (replace any spaces
_
) - Create a new
qna.yaml
file containing examples for the new skill
- You can add new folders under the corresponding category (replace any spaces
- You have a GitHub account
- You have access to this repo
-
Click Fork to fork your own copy of the repo.
-
On the Create a new fork page, enter the information into the following fields:
- Repository name:
taxonomy
is fine - Description: Enter the description of your fork, not of the skills you will create. You can write something that makes sense to you or leave it blank.
- Copy the main branch only: The box is selected by default. You can choose to leave the box selected or clear it.
- Repository name:
-
Click Create Fork.
You will get a copy of the taxonomy repo in your github account. This is your own copy, so don't worry about making mistakes. If you do end up making a mistake and want to start over: you can delete the fork and create a new fork.
The following image shows the compositional skills directory and its contents. Skills are contributed to this directory.
The other top-level directory you can contribute to is the knowledge directory, which is used for knowlege contributions. You can read more about the difference between skills and knowledge in the community documentation.
Based on the directories that exist in the tree, make a best guess at where in the tree structure to add the skill that you want to contribute. If you get to a point where you've gone deep enough into the tree and you can't find any directories that match, feel free to create a new directory (and subdirectories, if needed) to best represent your skill.
For example, I want to contribute a skill for creating puns. Puns are a specific type of joke. I started in the writing directory of the tree, and saw two main directories there: freeform and grounded.
Under the freeform directory, I saw subdirectories such as brainstorming, debate, legal, poetry, prose, etc.
Under the grounded directory, I saw subdirectories such as editing, meeting_insights, summarization/wiki_insights.
Puns seemed to fit best under the freeform directory, but I didn't think they fit under any of the pre-existing directories under freeform, so I created a jokes directory. Under jokes, I created a puns subdirectory. By making jokes a directory, I can continue to add subdirectories for different types of jokes. For example, I can add a new subdirectory for the knock-knock joke skill that I want to create. 🙂
It can be a little tricky mechanically to create directories in GitHub's web UI, but you can complete the process using the following steps:
-
In the GitHub repo, click the folder that you want to create the new directory inside of.
-
Click Add File and select Create new file from the menu.
-
Type the name of the first directory that you want to create. In the example animation, we use "jokes/" as the first directory.
Note
When you type the "/" character, the directory name will "lock in" and you'll be able to type the name of the subdirectory that you want to add under it. In the example, we typed "knock-knock/" as the subdirectory name.
Note
Make sure to replace any spaces (
) in the folder name with underscores (_
)
- After you have entered the name of all of the directories that you want to add, type the file name. The file name should always be
qna.yaml
(qna stands for "Question aNd Answer.")
Here's an animated graphic to show how it works:
TO BE CONTINUED
This information will be added once knowledge contributions are opened up.
For additional information about how to make a contribution, see the documentation on contributing.
This taxonomy repository will be used as the seed to synthesize the training data for InstructLab-trained models. We intend to retrain the model(s) using the main branch as often as possible (at least weekly). Fast iteration of the model(s) benefits the open source community and enables model developers who do not have access to the necessary compute infrastructure.