The knowledge files¶
Taxonomy trees in InstructLab have leaf-node directories. These leaf nodes contain at least one file, and usually two:
- A
qna.yaml
file that asks and answers questions about the information in the git repository where you have stored a knowledge document. - An
attribution.txt
file that includes the sources for the information used in theqna.yaml
. This file is only required when submitting knowledge to the InstructLab taxonomy repository.
The qna.yaml
file¶
Note
Tokens in the case of context, questions, and answers can fit to "words," but it's specifically tokens, and not words, that are the limitations. Learn more.
In general, here are the fields in the YAML file:
Key | Type | Required | Constraints | Value | Notes |
---|---|---|---|---|---|
version |
integer | Y | - | 3 |
The taxonomy schema version used in the qna.yaml file. Defined in instructlab/schema |
created_by |
string | Y | no spaces | Your GitHub username (for the upstream taxonomy) or your name with no spaces (for general intructlab use) | - |
domain |
string | Y | - | Knowledge sub-category | The knowledge domain which is used in prompts to the teacher model during synthetic data generation. The domain should be brief such as the title to a textbook chapter or section. |
seed_examples |
Y | array | at least 5 sets | null | This is a collection of questions and answers with context from the knowledge document that InstructLab uses to generate data synthetically. |
context |
string | Y | < 500 tokens | A chunk of the knowledge document showing off the different unique content to help guide the teacher model. If the knowledge documents have only text, all context would be text. If the knowledge documents have tables or other content formats, ensure samples of those formats are all used. | This should be a copy-paste from the Markdown version of your document |
questions_and_answers |
Y | array | at least 3 pairs per context | null | This is a collection of questions and answers. |
question |
Y | string | > 250 tokens | A question related to and grounded in the relevant context | Questions are things you'd expect someone to ask the model based on the context given. This will be used for synthetic data generation. |
answer |
Y | string | > 250 tokens | An answer for the question, longer than a one-word or one-number answer | Answers are what you'd like the model to give as an answer. It will not be an exact answer the model always gives. |
document_outline |
Y | string | - | A brief title-like summary of the document. This provides the context specific for each document chunk. It should be as specific as you possibly can get. For example, "Acme Company 2023 Financial Report" rather than "Financial report." | - |
document |
Y | object | - | null | The collection of data for the knowledge document. |
repo |
Y | string | a git URL | The URL (with a .git suffix) that identifies your git repo where you've stored your knowledge documents |
- |
commit |
Y | string | full commit hash | A SHA1 full commit hash that corresponds to the document in the repo | This hash must be exactly where the system can find the document. |
patterns |
Y | array | *.md , *.pdf |
A list of glob patterns specifying the files in the repo. | Any glob pattern that starts with * must be quoted due to YAML rules. Currently, the system accepts .md and .pdf files. |
version
¶
The version
field is the version of the schema that is in use. Currently, the value here should be 3
.
qna.yaml | |
---|---|
1 2 3 4 5 6 7 8 |
|
domain
¶
The domain
field helps the synthetic data generation (SDG) process by identifying what specialized area the knowledge provided covers.
qna.yaml | |
---|---|
1 2 3 4 5 6 7 8 |
|
created_by
¶
The created_by
field defines the user who submitted the knowledge. If you're working upstream, it would be your GitHub username. If you're working on your own taxonomy, it would be some kind of identifier with no spaces.
qna.yaml | |
---|---|
1 2 3 4 5 6 7 8 |
|
seed_examples
¶
The seed_examples
field does not have anything next to it because it is an array. An array in YAML is a collection of other values, and those values are indicated through indentation on subsequent lines.
qna.yaml | |
---|---|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
|
context
¶
The context
field is a snippet from the knowledge document that is used to answer the questions and answers that follow. It should be directly copied and pasted from the knowledge document as found in your knowledge repository.
There will be more than one context
field in the document.
qna.yaml | |
---|---|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 |
|
questions_and_answers
¶
The questions_and_answers
field starts another array.
qna.yaml | |
---|---|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
|
question
¶
The question
field is a sample question that the teacher model can use to train the student model during the synthetic data generation process.
There will be more than one question
field in the document.
qna.yaml | |
---|---|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
|
answer
¶
The answer
field is a sample expected answer for a question that the teacher model can use to train the student model during the synthetic data generation process.
There will be more than one answer
field in the document.
qna.yaml | |
---|---|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
|
document_outline
¶
The document_outline
field is a short description of the knowledge document's topic.
qna.yaml | |
---|---|
1 2 3 4 5 6 7 8 9 |
|
document
¶
The document
field starts another array.
qna.yaml | |
---|---|
1 2 3 4 5 6 7 8 9 |
|
repo
¶
As InstructLab currently requires a git repository, you would provide an address to a git repository here in the repo
field. That repository may be a local one or one hosted on version control providers like GitHub or GitLab.
qna.yaml | |
---|---|
1 2 3 4 5 6 7 8 9 |
|
commit
¶
The commit
field holds the git-based commit hash where the exact version of your knowledge document can be found. This field allows you to pin a document version.
qna.yaml | |
---|---|
1 2 3 4 5 6 7 8 9 |
|
patterns
¶
The patterns
field indicates what kind of files the SDG process should use. You can give an exact name, like phoenix_constellation.md
, or a glob pattern, like *.md
. This is a list of patterns, so there may be more than one.
qna.yaml | |
---|---|
1 2 3 4 5 6 7 8 9 10 |
|
Example of a knowledge submission¶
You can review a full example of a knowledge submission for the upstream taxonomy in this commit in the taxonomy repository.
version: 3
domain: astronomy
created_by: juliadenham
seed_examples:
- context: |
**Phoenix** is a minor constellation in the southern sky. Named after the mythical
phoenix, it was first depicted on a celestial atlas by Johann Bayer in his 1603
*Uranometria*. The French explorer and astronomer Nicolas Louis de
Lacaille charted the brighter stars and gave their Bayer designations
in 1756. The constellation stretches from roughly −39 degrees to −57 degrees
declination, and from 23.5h to 2.5h of right ascension. The constellations Phoenix,
Grus, Pavo, are known as the Southern Birds.
questions_and_answers:
- question: |
What is the Phoenix constellation?
answer: |
Phoenix is a minor constellation in the southern sky.
- question: |
Who charted the Phoenix constellation?
answer: |
The Phoenix constellation was charted by french explorer and
astronomer Nicolas Louis de Lacaille.
- question: |
How far does the Phoenix constellation stretch?
answer: |
The phoenix constellation stretches from roughly −39° to −57°
declination, and from 23.5h to 2.5h of right ascension.
- context: |
Phoenix was the largest of the 12 constellations established by Petrus
Plancius from the observations of Pieter Dirkszoon Keyser and Frederick de
Houtman. It first appeared on a 35cm diameter celestial globe published
in 1597 (or 1598) in Amsterdam by Plancius with Jodocus Hondius. The first
depiction of this constellation in a celestial atlas was in Johann Bayer's
*Uranometria* of 1603. De Houtman included it in his southern star catalog
the same year under the Dutch name *Den voghel Fenicx*, "The Bird Phoenix",
symbolising the phoenix of classical mythology. One name of the brightest star Alpha
Phoenicis—Ankaa—is derived from the Arabic: العنقاء, romanized: al-‘anqā’,
lit. 'the phoenix', and was coined sometime after 1800 in relation to the constellation.
questions_and_answers:
- question: |
What is the brightest star in the Phoenix constellation
called?
answer: |
Alpha Phoenicis or Ankaa is the brightest star in the Phoenix
Constellation.
- question: Where did the Phoenix constellation first appear?
answer: |
The Phoenix constellation first appeared on a 35-cm diameter
celestial globe published in 1597 (or 1598) in Amsterdam by
Plancius with Jodocus Hondius.
- question: |
What does "The Bird Phoenix" symbolize?
answer: |
"The Bird Phoenix" symbolizes the phoenix of classical mythology.
- context: |
Phoenix is a small constellation bordered by Fornax and Sculptor to the north,
Grus to the west, Tucana to the south, touching on the corner of Hydrus to the
south, and Eridanus to the east and southeast. The bright star Achernar is
nearby. The three-letter abbreviation for the constellation, as adopted by the
International Astronomical Union in 1922, is "Phe". The official constellation
boundaries, as set by Belgian astronomer Eugène Delporte in 1930,
are defined by a polygon of 10 segments. In the equatorial coordinate system, the right
ascension coordinates of these borders lie between 23h 26.5m and 02h 25.0m,
while the declination coordinates are between −39.31° and −57.84°. This means it remains
below the horizon to anyone living north of the 40th parallel in the Northern
Hemisphere, and remains low in the sky for anyone living north of the equator.
It is most visible from locations such as Australia and South Africa during
late Southern Hemisphere spring. Most of the constellation lies within, and
can be located by, forming a triangle of the bright stars Achernar, Fomalhaut
and Beta Ceti—Ankaa lies roughly in the centre of this.
questions_and_answers:
- question: What are the characteristics of the Phoenix constellation?
answer: |
Phoenix is a small constellation bordered by Fornax and Sculptor to
the north, Grus to the west, Tucana to the south, touching on the
corner of Hydrus to the south, and Eridanus to the east and southeast.
The bright star Achernar is nearby.
- question: |
When is the phoenix constellation most visible?
answer: |
Phoenix is most visible from locations such as Australia and
South Africa during late Southern Hemisphere spring.
- question: |
What are the Phoenix Constellation boundaries?
answer: |
The official constellation boundaries for Phoenix, as set by Belgian
astronomer Eugène Delporte in 1930, are defined by a polygon of 10
segments.
- context: |
Ten stars have been found to have planets to date, and four planetary
systems have been discovered with the SuperWASP project. HD 142 is a yellow
giant that has an apparent magnitude of 5.7, and has a planet HD 142b 1.36
times the mass of Jupiter which orbits every 328 days. HD 2039 is a yellow
subgiant with an apparent magnitude of 9.0 around 330 light years away which
has a planet HD 2039 b six times the mass of Jupiter. WASP-18 is a star of
magnitude 9.29 which was discovered to have a hot Jupiter-like planet taking
less than a day to orbit the star. The planet is suspected to be causing WASP-18 to
appear older than it really is. WASP-4 and WASP-5 are solar-type yellow stars around 1000
light years distant and of 13th magnitude, each with a single planet larger
than Jupiter. WASP-29 is an orange dwarf of spectral type K4V and visual magnitude
11.3, which has a planetary companion of similar size and mass to Saturn. The planet
completes an orbit every 3.9 days.
questions_and_answers:
- question: In the Phoenix constellation, how many stars have planets?
answer: |
In the Phoenix constellation, ten stars have been found to have
planets to date, and four planetary systems have been discovered
with the SuperWASP project.
- question: |
What is HD 142?
answer: |
HD 142 is a yellow giant that has an apparent magnitude of 5.7, and
has a planet (HD 142 b) 1.36 times the mass of Jupiter which
orbits every 328 days.
- question: |
Are WASP-4 and WASP-5 solar-type yellow stars?
answer: |
Yes, WASP-4 and WASP-5 are solar-type yellow stars around 1000 light
years distant and of 13th magnitude, each with a single planet
larger than Jupiter.
- context: |
The constellation does not lie on the galactic plane of the Milky Way, and there
are no prominent star clusters. NGC 625 is a dwarf irregular galaxy of apparent magnitude 11.0
and lying some 12.7 million light years distant. Only 24000 light years in
diameter, it is an outlying member of the Sculptor Group. NGC 625 is
thought to have been involved in a collision and is experiencing a burst
of active star formation. NGC 37 is a lenticular galaxy of apparent magnitude
14.66. It is approximately 42 kiloparsecs in diameter and about 12.9 billion years old.
Robert's Quartet , and three spiral galaxies NGC 88 and NGC 92) is a group of
four galaxies located around 160 million light-years away which are in the process of colliding
and merging. They are within a circle of radius of 1.6 arcmin, corresponding to about
75,000 light-years. Located in the galaxy ESO 243-49 is HLX-1, an intermediate-mass
black hole intermediate-mass_black_hole —the first one of its kind identified.
It is thought to be a remnant of a dwarf galaxy that was absorbed in a collision
with ESO 243-49. Before its discovery, this class of black hole was only hypothesized.
questions_and_answers:
- question: |
Is the Phoenix Constellation part of the Milky Way?
answer: |
The Phoenix constellation does not lie on the galactic plane of
the Milky Way, and there are no prominent star clusters.
- question: |
How many light years away is NGC 625?
answer: |
NGC 625 is 24000 light years in diameter and is an outlying
member of the Sculptor Group.
- question: |
What is Robert's Quartet composed of?
answer: |
Robert's Quartet is composed of the irregular galaxy NGC 87,
and three spiral galaxies NGC 88, NGC 89 and NGC 92.
document_outline: |
Information about the Phoenix Constellation including the
history, characteristics, and features of the stars in the constellation.
document:
repo: https://github.com/juliadenham/Summit_knowledge
commit: 0a1f2672b9b90582e6115333e3ed62fd628f1c0f
patterns:
- phoenix_constellation.md
Title of work: Phoenix (constellation)
Link to work: https://en.wikipedia.org/wiki/Phoenix_(constellation)
Revision: https://en.wikipedia.org/w/index.php?title=Phoenix_(constellation)&oldid=1237187773
License of the work: CC-BY-SA-4.0
Creator names: Wikipedia Authors
For more information on what to include in your attribution.txt
file for an upstream submission, reference the upstream contribution document on knowledge.
Example of a knowledge document file¶
The previous knowledge example references one Markdown file: phoenix_constellation.md
. You can also add multiple Markdown files or PDFs for knowledge contributions.
Here's what a snippet of phoenix_constellation.md
might look like in your Git repository.
Example of a .md
file
# Phoenix (constellation)
**Phoenix** is a minor constellation in the southern sky. Named after the mythical
phoenix, it was first depicted on a celestial atlas by Johann Bayer in his 1603
*Uranometria*. The French explorer and astronomer Nicolas Louis de
Lacaille charted the brighter stars and gave their Bayer designations
in 1756. The constellation stretches from roughly −39 degrees to −57 degrees
declination, and from 23.5h to 2.5h of right ascension. The constellations Phoenix,
Grus, Pavo, are known as the Southern Birds.
The brightest star, Alpha Phoenicis, is named Ankaa, an Arabic word meaning 'the Phoenix'.
It is an orange giant of apparent magnitude 2.4. Next is Beta Phoenicis, actually a
binary system composed of two yellow giants with a combined apparent magnitude of 3.3. Nu
Phoenicis has a dust disk, while the constellation has ten star systems with known planets and the recently
discovered galaxy clusters El Gordo and the Phoenix
Cluster—located 7.2 and 5.7 billion light years away respectively, two of the largest objects in the visible
universe. Phoenix is the radiant of two annual meteor showers: the Phoenicids in December, and the July
Phoenicids.
You can organize the knowledge files in your repository however you want. You need to ensure the document
section of the YAML file is pointing to the correct file.