Knowledge Guidelines

Info

The following information is if you are comfortable contributing using GitHub, a version control system primarily used for code. If you are not comfortable with this platform, you can use the InstructLab UI to submit knowledge. To learn more, head to the UI overview page.

You can create a Git repository to host your knowledge contributions anywhere (GitLab, Gerrit, etc.), but it might be favorable to create one on GitHub. At the current time, we require a GitHub username to contribute, and all work is done in GitHub.

The following instructions show you how to create a knowledge repository in GitHub and contribute to the taxonomy.

Prerequisites¶

If you are submitting to the repository directly: - You have a GitHub account - You have a forked copy of the taxonomy repository - You have verified that the model does not already know the knowledge you want to submit

If you are using the UI to submit: - You have a GitHub account - You have verified that the model does not already know the knowledge you want to submit

Note

Due to the higher volume, it will naturally take longer to receive acceptance for a knowledge contribution pull request than for a skill pull request. Smaller pull requests are simpler and require less time and effort to review.

Preparing your knowledge documents¶

You need to set up your source documents as Markdown or PDF files in a git repository. You can organize the knowledge files in your repository however you want. You just need to ensure the YAML is pointing to the correct file.

Accepted Sources of Knowledge¶

Warning

We are currently only accepting sources from this list at this time due to legal requirements to keep InstructLab open source. We prefer that you keep your submission to articles from Wikipedia at this time. Our taxonomy triage team will reject any contributions that do not match this pattern. Thanks for helping us keep InstructLab 100% open source!

These are the main knowledge domains that we are currently accepting knowledge contributions for: arts, engineering, geography, history, linguistics, mathematics, philosophy, religion, science, and technology.

Due to the open source nature of InstructLab, all content has to meet specific licensing requirements. This list has currently approved sources for knowledge. If you wish to use a different source, we need to approve it, and that means your submission will be on hold until we get legal review and approval. Please be patient!

Domain Name	Status	Notes
Wikipedia	approved	-
Project Gutenberg	approved	Pre-1927 works; public domain under US copyright law
Wikisource (library)	approved	"free library that anyone can improve"
OpenStax textbooks family of publications	approved	-
The Open Organization publications	approved	-
The Scrum Guide	approved	-
US Congress site	reviewed - manually verify	US government sources may have different licensing; a legal review will need to verify each source
US White House site	reviewed - manually verify	US government sources may have different licensing; a legal review will need to verify each source
US Senate site	reviewed - manually verify	US government sources may have different licensing; a legal review will need to verify each source
US IRS site	reviewed - manually verify	US government sources may have different licensing; a legal review will need to verify each source
NASA	reviewed - manually verify	See guidelines
Smithsonian Libraries	reviewed - manually verify	For any material marked \"No Copyright - United States" or "CC0" as described here
European Union (EU) site	reviewed - manually verify	Specifically documents submitted under "public registrars" as described here
Internet Archive	reviewed - manually verify	Pre-1927 works; public domain under US copyright law
PLOS family of open access journals	reviewed - manually verify	-
Open Practice Library	reviewed - manually verify	-
Cynefin.io wiki	reviewed - manually verify	-
The Open Education Project	reviewed - manually verify	-

Creating your own knowledge repository¶

To create a new GitHub repository, follow the GitHub documentation in Creating a new repository.

The specific steps are listed as follows:

In your GitHub profile page, navigate to the repositories tab. You will see a search bar where you can search your repositories or create a new one.
This takes you to a page titled “Create a new repository”. Create a custom name for your repository and add a README.md file. For example, “knowledge_contributions” could be a good name for your repository.
Click “Create” when you are all set.

Convert your knowledge documentation to Markdown or PDF¶

There are many online tools that can help you convert your documents to Markdown. If you are using a wiki page for your contributions, you can use pandocs to convert the documents. For Wikipedia sources on pandoc, use from: mediawiki and convert to: markdown_strict to access the proper Markdown format.

Add the Markdown or PDF file to your repository¶

To add a file to your GitHub repository, follow the GitHub documentation in Adding a file to a repository.

The specific steps are listed as follows:

Navigate to “Add files”. Click “Create new file” if you want to manually add your Markdown content. Click “Upload files” if you have a file locally to add.
Add a description and commit your changes.

Since this is your own repository, you can commit directly to the main branch.
You can then see your new content in your repository.

Important

Make a note of your commit SHA; you'll need it for your qna.yaml.

Creating your knowledge submission in GitHub¶

For knowledge submissions, we need a qna.yaml file and an attribution.txt file.

The `qna.yaml` file¶

For the current version of the taxonomy, version 3, here are the available fields:

Note

Tokens in the case of context, questions, and answers can fit to "words," but it's specifically tokens, and not words, that are the limitations.

Key	Type	Required	Constraints	Value	Notes
`version`	Y	integer	-	`3`	The taxonomy schema version used in the `qna.yaml` file. Defined in instructlab/schema
`created_by`	Y	string	-	Your GitHub username	-
`domain`	Y	string	-	Knowledge sub-category	The knowledge domain which is used in prompts to the teacher model during synthetic data generation. The domain should be brief such as the title to a textbook chapter or section.
`seed_examples`	Y	array	at least 5 sets	null	This is a collection of questions and answers with context from the knowledge document that InstructLab uses to generate data synthetically.
`context`	Y	string	< 500 tokens	A chunk of the document showing off the different unique content to help guide the teacher model. If you have only text, that's one thing, but if you have tables or other content, be sure to add that, too.	This should be a copy-paste from the Markdown version of your document
`questions_and_answers`	Y	array	at least 3 pairs per context	null	This is a collection of questions and answers.
`question`	Y	string	> 250 tokens	A question related to the grounded in the relevant context	Questions are things you'd expect someone to ask the model based on the context given. This will be used for synthetic data generation.
`answer`	Y	string	> 250 tokens	An answer for the question, longer then a one-word answer.	Answers are what you'd like the model to give as an answer. It will not be an exact answer the model always gives.
`document_outline`	Y	string	-	This provides the context specific for each document chunk; this should be as specific as you possibly can get.
`document`	Y	object	-	null	The collection of data for the knowledge document.
`repo`	Y	string	a git URL	The URL (with a `.git` suffix) that identifies your git repo where you've stored your knowledge documents	-
`commit`	Y	string	full commit hash	A SHA1 full commit hash that corresponds to the document in the repo	This hash must be exactly where the system can find the document.
`patterns`	Y	array	`.md`, `.pdf`	A list of glob patterns specifying the files in the repo.	Any glob pattern that starts with `*` must be quoted due to YAML rules. Currently, the system accepts `.md` and `.pdf` files.

Important

There must be at least 5 sets of 3 questions and 3 answers with context in every qna.yaml file. Also the "context blocks" should be as diverse and unique as possible. The goal is to get as much different information in so as the teacher LLM reads through the document it gets "inspired" by the different content.

An example file¶

To build a strong taxonomy,

Create a pull request in the taxonomy repository¶

Navigate to your forked taxonomy repository and ensure it is up-to-date.

There are a few ways you can create a pull request:

For details on the local process, check out The GitHub Workflow Guide in the Kubernetes documentation and the GitHub flow in the GitHub documentation.
For details on contributing using the GitHub webpage UI, see Contributing using the GH UI or Creating a pull request in the GitHub documentation.

Example of a directory tree¶

In the taxonomy repository, here's what the previously referenced knowledge might look like in the tree:

[...]

└── knowledge
    └── science
        ├── astronomy
        │ └── constellations
        │     └── Phoenix <=== here it is :)
        │     |    └── qna.yaml
        |     |        attribution.txt
        │     └── Orion
        │          └── qna.yaml
        |              attribution.txt
[...]

Verification¶

Here are a few things to check before seeking reviews for your contribution:

Your qna.yaml follows the proper formatting. See examples in Knowledge: YAML examples
Ensure all parameters are set. Especially the document, repo, commit and pattern keys; these parameters are specific to knowledge contributions and require more analysis.
Include an attribution.txt file for citing your sources. see For your attribution.txt file for more information.

PR Upstream Workflow¶

The following table outlines the expected timing for the PRs you have submitted. The PRs go through a few steps, and checks, but you should be able to map your label to the place that it is in.

Label	Actor	Action	Duration
-	Contributor	Submit PR	-
-	Contributor	Fix failed PR checks	-
triage-needed	Triager	Review PR, ask for changes	Days
triage-dco-requested	Contributor	Fix DCO	-
triage-requested-changes	Contributor	Make requested changes	Days
precheck-generate-ready	Triager	Run prechecks and generate	Days
community-build-ready	Backend	Model gets retrained	Weeks
	Triager	Check the numbers and PR merged or closed	-

Submissions¶

To make the qna.yaml files easier and faster for humans to read, it is recommended to specify version first, followed by task_description, then created_by, and finally seed_examples. In seed_examples, it is recommended to specify context first (if applicable), followed by question and answer.

Example qna.yaml

version: 2
task_description: <string>
created_by: <string>
seed_examples:
  - question: <string>
    answer: |
      <multi-line string>
  - context: |
      <multi-line string>
    question: <string>
    answer: |
      <multi-line string>
  # ...

Then, you create an attribution.txt file that includes the sources of your information, if any. These sources can also be self-authored sources for skills.

Fields in attribution.txt

[Link to source]
[Link to work]
[License of the work]
[Creator name]

Example of a self-authored source attribution.txt

Title of work: Customizing an order for tea
Link to work: -
License of the work: CC BY-SA-4.0
Creator names: Jean-Luc Picard

You may copy this example and replace the title of the work (your skill) and the creator name to submit a skill. The license is Creative Commons Attribution-ShareAlike 4.0 International, which is shortened to CC BY-SA-4.0.

For more information on what to include in your attribution.txt file, reference the general contribution guidelines.