Skip to content

Taxonomy Contribution Guidelines

We welcome contributions to the upstream InstructLab taxonomy.(1)

  1. upstream here refers to the idea in open source of an "upstream" project that is before your product in a workstream.

A contribution to the upstream taxonomy involves

  • identifying whether your submission is a knowledge submission or a skills submission,
  • defining the domain for the submission, or the branches and leaf node that will contain your knowledge,
  • converting any source data to Markdown or PDF and storing them in a git-based repository (required for knowledge submissions),
  • writing a YAML file called qna.yaml that contains the seed content that will inform synthetic data generation and the repo URL where your converted source data is stored,
  • creating an attribution file called attribution.txt that provides licensing information for source data, and
  • submitting those two files in the proper spot in the taxonomy tree to the InstructLab taxonomy.

You can submit knowledge or compositional skills. More information on the two types of contribution can be found in our knowledge contribution details guide or our compositional skills contribution guide.

attribution.txt

An important part of contributing to the InstructLab project is citing your sources of information. This comes in the form of your attribution.txt that you add to the pull requests. Almost all instances of attribution can be covered by the parameters required for Creative Commons Attribution licenses. Some parameters are as follows:

  • Title of work
  • Link to work
    • Include link to a specific revision where possible
  • License of the work
    • Include an SPDX identifier where possible
  • Creator names
  • Copyright information
  • Modification information
    • Indicate if work was itself derived from another openly licensed work

You can also see this citation style in the Data sources documentation.

Upstream taxonomy tree layout

The upstream taxonomy tree is organized in a cascading directory structure. At the end of each branch, there is a YAML file (qna.yaml) that contains the examples for that domain. Maintainers can decide to change the names of the existing branches or to add new branches.

Important

Folder names do not have spaces. Use underscores between words.

Below is an illustrative directory structure to show this layout:

.
└── linguistics
    ├── writing
    │   ├── brainstorming
    │   │   ├── idea_generation
    |   │       └── qna.yaml
    │   │           attribution.txt
    │   │   ├── refute_claim
    |   │       └── qna.yaml
    │   │           attribution.txt
    │   ├── prose
    │   │   ├── articles
    │   │       └── qna.yaml
    │   │           attribution.txt
    └── grammar
        └── qna.yaml
        │   attribution.txt
        └── spelling
            └── qna.yaml
                attribution.txt

Upstream taxonomy diagram

Note

This diagram shows a subset of the upstream taxonomy. It is not a complete representation.

 flowchart TD;
   na[not accepting contributions\n at this time]:::na
   taxonomy --> foundational_skill & compositional_skills & knowledge

   foundational_skill:::na --> reasoning:::na
   reasoning:::na --> common_sense_reasoning:::na
   reasoning:::na --> mathematical_reasoning:::na
   reasoning:::na --> theory_of_mind:::na

   compositional_skills --> engineering
   compositional_skills --> grounded
   compositional_skills --> lingustics

   grounded --> grounded/arts
   grounded --> grounded/geography
   grounded --> grounded/history
   grounded --> grounded/science

   knowledge --> knowledge/arts

   knowledge --> knowledge/miscellaneous_unknown
   knowledge --> knowledge/science
   knowledge --> knowledge/technology
   knowledge/science --> animals --> birds --> black_capped_chickadee --> black_capped_chickadee-a & black_capped_chickadee-q
   knowledge/science --> astronomy --> constellations --> phoenix --> phoenix-a & phoenix-q

   black_capped_chickadee-a{attribution.txt}
   black_capped_chickadee-q{qna.yaml}
   phoenix-a{attribution.txt}
   phoenix-q{qna.yaml}
   classDef na fill:#EEE

Considerations

When working with the upstream taxonomy, there are some things to keep in mind, like topics to avoid, what LLMs are great at doing, and what LLMs will fail at. Explore this information before deciding what to contribute, especially the note about hallucinations at the end of this page.

Avoid These Topics

While the tuning process may eventually benefit from being used to help the models work with complex social topics, at this time this is an area of active research we do not want to take lightly. Therefore, please keep your submissions clear of the following topics:

We are also not accepting submissions of the following content:

  • Jokes
    • We received many joke submissions at the beginning of the project, and with jokes being "in the eye of the beholder" and puns requiring nuance for native English speakers, we realized we were possibly unconsciously biasing our model. We have discovered that working with jokes has its own challenges, and if we want something generalized, finding consensus was unsuccessful. For now, we're not accepting additional submissions of jokes.
  • Poems
    • We received many poem submissions at the beginning of the project, and with poems being "in the eye of the beholder" and puns requiring nuance for native English speakers, we realized we were possibly unconsciously biasing our model. We have discovered that working with both topics has its own challenges, and if we want something generalized, finding consensus was unsuccessful. For now, we're not accepting additional submissions of poems.
  • Code
    • Anything code-related that can be traced back to code for a computer. Not limited to sed or bash but yamls for OpenShift or Kubernetes, to python snippets to Java suggestions. There are specific models focused on this space and this isn't for this model for the time being.
  • "Guardrails" for AI
    • We expect our upstream engineering team to create these types of skills and safeguards. We appreciate our community wanting to help with this, but there are underlying engineering decisions and taking this from the community may conflict with these.

Building Your LLM Intuition

LLMs have inherent limitations that make certain tasks extremely difficult, like doing math problems. They're great at other tasks, like creative writing. And they could be better at things like logical reasoning.

In regard to skills, consider these limitations and advantages when you're generating skills. Skills in the first and second categories are welcomed. Skills in the third category are usually borderline and may be rejected.

In regard to knowledge, providing an LLM training pipeline with knowledge helps create a basis of information that the model can learn from. With InstructLab, you can teach it to use this knowledge via the qna.yaml files. For example, you can give an LLM the entire periodic table, then in a qna.yaml add something like:

question: What is the symbol and atomic number for Chlorine?
answer: |
  The symbol for chlorine is "Cl", and the atomic number is 17.

With a few of these question-and-answer pairs, the model will learn the periodic table because it has the knowledge data.

LLMs are great at

  • Brainstorming
  • Creativity
  • Connecting information
  • Cross-lingual behavior

For these, however, it's common for LLMs to already have excellent performance. Try 3-5 examples in ilab model chat to confirm a deficit in the model before you build your submission, and share the examples in your Pull Request (PR).

Skills in this category are welcomed, as refining these abilities helps us get better at the kinds of tasks where LLMs can excel.

LLMs need help with

  • Chains of reasoning
  • Analysis
  • Story plots
  • Reassembling information
  • Effective and succinct summaries

LLM behavior in these sorts of topics are very difficult for the model to get right. Skills in this category are particularly welcome. Try several examples to understand the nuances of the model's ability to do these sorts of tasks, and then consider using corrections to the results you get in your tuning process.

LLMs are not so great at

  • Math
  • Computation
  • "Turing-complete" type tasks
  • Generating only true real-world information (they're prone to hallucinations)

LLMs may struggle with solving math and computation problems, and LLMs may always struggle here. Solving math and computation problems via probability on natural language queries is probably not the best way to solve them. That said, improving some of these foundational skills may be something this work tackles in the future, but not at this time.

Most skill submissions in these categories are likely to be rejected.

On Hallucinations

For hallucinations in particular, trying to solve this with a skill is unlikely to work. Consider contributing to the knowledge taxonomy instead to improve the model's understanding of facts.