About Taxonomy
Welcome to the InstructLab Taxonomy¶
InstructLab 🐶 uses a novel synthetic data-based alignment tuning method for Large Language Models (LLMs.) The "lab" in InstructLab 🐶 stands for Large-Scale Alignment for ChatBots1.
The LAB method is driven by taxonomies, which are largely created manually and with care.
The instructlab/taxonomy repository contains a taxonomy tree that allows you to create models tuned with your data (enhanced via synthetic data generation) using the LAB 🐶 method.
Choosing domains for the taxonomy¶
In general, we use the Dewey Decimal Classification (DDC) System to determine our domains (and subdomains) in the taxonomy. This DDC SUMMARIES document is a great resource for determining where a topic might be classified.
If you are unsure where to put your knowledge or compositional skill, create a folder in the miscellaneous_unknown
folder under the knowledge
or compositional_skills
folders.
Learning¶
Learn about the concepts of "skills" and "knowledge" in our InstructLab Community Learning Guide.
Taxonomy tree layout¶
The taxonomy tree is organized in a cascading directory structure. At the end of each branch, there is a YAML file (qna.yaml
) that contains the examples for that domain along with any attribution files (attribution.txt
). Maintainers can decide to change the names of the existing branches or to add new branches.
Important
Folder names do not have spaces. Use underscores between words.
Taxonomy diagram¶
Note
These diagrams show subsets of the taxonomy. They are not a complete representation.
flowchart TD;
na[not accepting contributions\n at this time]:::na
taxonomy --> foundational_skill & compositional_skills & knowledge
foundational_skill:::na --> reasoning:::na
reasoning:::na --> common_sense_reasoning:::na
reasoning:::na --> mathematical_reasoning:::na
reasoning:::na --> theory_of_mind:::na
compositional_skills --> engineering
compositional_skills --> grounded
compositional_skills --> lingustics
grounded --> grounded/arts
grounded --> grounded/geography
grounded --> grounded/history
grounded --> grounded/science
knowledge --> knowledge/arts
knowledge --> knowledge/miscellaneous_unknown
knowledge --> knowledge/science
knowledge --> knowledge/technology
knowledge/science --> animals --> birds --> black_capped_chickadee --> black_capped_chickadee-a & black_capped_chickadee-q
knowledge/science --> astronomy --> constellations --> phoenix --> phoenix-a & phoenix-q
black_capped_chickadee-a{attribution.txt}
black_capped_chickadee-q{qna.yaml}
phoenix-a{attribution.txt}
phoenix-q{qna.yaml}
classDef na fill:#EEE
Below is an illustrative directory structure to show this layout:
.
└── linguistics
├── writing
│ ├── brainstorming
│ │ ├── idea_generation
| │ └── qna.yaml
│ │ attribution.txt
│ │ ├── refute_claim
| │ └── qna.yaml
│ │ attribution.txt
│ ├── prose
│ │ ├── articles
│ │ └── qna.yaml
│ │ attribution.txt
└── grammar
└── qna.yaml
│ attribution.txt
└── spelling
└── qna.yaml
attribution.txt
Contribute knowledge and skills to the taxonomy¶
The ability to contribute to a Large Language Model (LLM) has been difficult in no small part because it is difficult to get access to the necessary compute infrastructure.
This taxonomy repository will be used as the seed to synthesize the training data for InstructLab-trained models. We intend to retrain the model(s) using the main branch following InstructLab's progressive training on a regular basis. This enables fast iteration of the model(s), for the benefit of the open source community.
By contributing your skills and knowledge to this repository, you will see your changes built into an LLM within days of your contribution rather than months or years! If you are working with a model and notice its knowledge or ability lacking, you can correct it by contributing knowledge or skills and check if it's improved after your changes are built.
While public contributions are welcome to help drive community progress, you can also fork this repository under the Apache License, Version 2.0, add your own internal skills, and train your own models internally. However, you might need your own access to significant compute infrastructure to perform sufficient retraining.
Ways to contribute¶
You can contribute to the taxonomy in the following two ways:
- Adding new examples to existing leaf nodes:
- Adding new branches/skills corresponding to the existing domain:
For more information, see the Ways of contributing to the taxonomy repository documentation.
How to contribute skills and knowledge¶
To contribute to this repo, you'll use the Fork and Pull model common in many open source repositories. You can add your skills and knowledge to the taxonomy in multiple ways; for additional information on how to make a contribution, see the Documentation on contributing. You can also use the following guides to help with contributing:
- Contributing using the GitHub webpage UI.
- Contributing knowledge to the taxonomy in the Knowledge contribution guidelines.
Why should I contribute?¶
This taxonomy repository will be used as the seed to synthesize the training data for InstructLab-trained models. We intend to retrain the model(s) using the main branch as often as possible (at least weekly). Fast iteration of the model(s) benefits the open source community and enables model developers who do not have access to the necessary compute infrastructure.
-
Shivchander Sudalairaj, Abhishek Bhandwaldar, Aldo Pareja, Kai Xu, David D. Cox, Akash Srivastava. "LAB: Large-Scale Alignment for ChatBots", arXiv preprint arXiv: 2403.01081, 2024. (* denotes equal contributions) ↩