InstructLab FAQ¶
Last updated: October 2024
[!TIP] AI is a rapidly-developing field with a lot of specialized terminology. You may wish read through the glossary before getting started with the documentation.
Table of Contents¶
- Document summary
- General FAQ
- What is InstructLab?
- What is LAB?
- How does InstructLab work?
- What are the goals of the InstructLab project?
- How can I contribute?
- I'm having problems with the
ilab
CLI tool. What should I do? - Why should I contribute?
- What large language models (LLMs) am I contributing to through the InstructLab project?
- What is Merlinite-7b?
- What is Granite-7b-lab?
- What is a “skill”?
- What is “knowledge”?
- Is the project looking for certain types of skill contributions?
- What are the acceptance criteria for a skills submission?
- What are the acceptance criteria for a knowledge submission?
- How can I submit a skill or knowledge?
- What happens after you submit a pull request?
- How are submissions reviewed?
- How long will it take for my pull request to be reviewed?
- If my pull request is accepted, how long will it take for my changes to appear in the next model update?
- What is the software license for InstructLab?
- Am I required to license code submissions to InstructLab under the Apache 2.0 license?
- My contribution requires submitting data along with code. What data is permissible to include?
- Where can I download updated models of InstructLab?
- I have a question about the project. Where should I go?
- What are the software and hardware requirements for using InstructLab?
- Glossary
- Additional Resources
Document summary¶
This page serves as a comprehensive FAQ for the InstructLab project, detailing how it works, how to begin contribution, and the goals behind the project. Key information includes:
- InstructLab Overview: This open source project allows users to interact with and train the Granite-7b community AI Large Language Model (LLM) by contributing skills and knowledge.
- LAB Method: A synthetic data-based tuning method for LLMs consisting of a taxonomy-driven data curation process, a synthetic data generator, and two-phased training with replay buffers. Learn more in the Large-Scale Alignment for ChatBots paper outlining the methodology.
- Contribution Process: Contributors can add skills or knowledge to the LLM by creating YAML files and testing changes locally before submitting a pull request to InstructLab’s GitHub taxonomy repository. Contributors may also contribute to the InstructLab tooling and library codebases.
- Project Goal: To democratize contributions to AI and LLMs.
Documentation disclaimer¶
There are currently three repositories that contain documentation crucial to getting users starting with the project:
- Community This repository shares InstructLab's activity and collaboration details across the community and include the most current information about the project, communication channels, and people processes.
ilab
command-line interface (CLI) tool. This repository is responsible for theilab
CLI tool. It provides information about how to download theilab
CLI, how to contribute to theilab
CLI tool, among others.- Taxonomy Tree. This repository is responsible for the taxonomy tree that allows you to create models tuned with your data. It provides information about what skills and knowledge are, how to create a pull request to contribute to the AI model, and expectations for pull request review.
As this project grows, documentation and its organization will change. Members of this project will be made aware of significant changes and updates made to documentation.
Unless otherwise noted, all documentation for the InstructLab project is licensed under the CC-BY-4.0 license.
General FAQ¶
What is InstructLab?¶
InstructLab (Large-scale Alignment for chatBots) is an open source initiative that provides a platform for easy engagement with AI Large Language Models (LLM) by using the ilab
command-line interface (CLI) tool. You can use the CLI to work with Granite-7b to test new skills and knowledge, for example, asking it to write a meeting notes summary or answer a question about a particular subject. Users can then augment the LLM’s capabilities by submitting the skills and knowledge they have tested to the project’s taxonomy repository on GitHub by creating a pull request. This approach encourages community-driven enhancements without the need for complex model forking or fine-tuning of the model, promoting rapid development through collaborative contributions.
[!IMPORTANT] Building models locally on consumer-grade hardware using quantized models with the
ilab
CLI is not meant for production-grade model creation. Theilab
desktop configuration is meant for testing single knowledge or skill contributions on top of an already trained and quantized model. It is not for building a complete, production-grade model. For the full InstructLab production-grade model build process, multi-GPU hardware configurations are required, and the student model must be an untrained, unquantized base model.
What is LAB?¶
LAB (Large-scale Alignment for chatBots) is a novel synthetic data-based align tuning method for LLMs from IBM Research. It consists of three components:
- A taxonomy-drive data curation process
- A large-scale synthetic data generator
- Multi-phased-training with replay buffers
The LAB approach allows incrementally adding new knowledge and skills to an already pre-trained model without catastrophic forgetting.
More information about the LAB method can be found on the Hugging Face project page.
How does InstructLab work?¶
InstructLab is driven by taxonomies and works by empowering users to add new skills and knowledge to a pre-trained LLM.
What are the goals of the InstructLab project?¶
The goal on the InstructLab project is to democratize contributions to AI and LLMs. There are two approaches to achieving this goal in our community:
-
Enabling collaborative contribution to a large language model (LLM) through the project's taxonomy repository. When users contribute to this repository, the project resynthesizes its open source training data. Our community Granite-based model is then retrained, ensuring that community contributions are integrated while enriching the model’s capabilities over time.
-
Providing open source tooling to enable the InstructLab methodology and enabling community contributions to this toolset in accordance with open source project principles. This tooling includes the InstructLab core engine & CLI as well as libraries such as the sdg, training, and evaluation libraries.
How can I contribute?¶
You can begin your contribution journey by reading over the Contributing guide and joining the Community Discord Server or the Community Slack Channel.
When you're ready to start contributing, you can follow the Getting Started guide. This guide shows you how to
- Install the
ilab
CLI. - Deploy the LLM locally.
- Add skills or knowledge and train to the local LLM with your data.
- Create a pull request and add your information to the InstructLab taxonomy.
- Get reviews on your pull requests
I'm having problems with the ilab
CLI tool. What should I do?¶
A list of common problems associated with downloading the ilab
CLI tool can be found in the CLI repository's discussion board.
Why should I contribute?¶
InstructLab is designed to enable collaboration around the InstructLab Granite models, open source licensed LLMs that contributors can access through Hugging Face. Participating is an opportunity to contribute to open source AI regardless of technical background.
When contributors write an addition to the existing taxonomy, make a pull request, and get it reviewed and merged, their changes are rolled out in the next build. This update strategy expedites the model’s capabilities and allows contributors to see the impact that they have made on the model much sooner than other LLMs.
What large language models (LLMs) am I contributing to through the InstructLab project?¶
Contributions to the InstructLab project include fine-tuning Granite-7b, an open-source licensed LLM. Contributors have direct access to the model they are improving through Hugging Face.
What is Merlinite-7b?¶
Merlinite-7b is a Mistral-7b derivative model fine-tuned with the LAB (Large-scale Alignment for chatBots) method using Mixtral-8x7b-Instruct as a teacher model.
More information about the Merlinite-7b can be found on the Hugging Face project page.
What is Granite-7-lab?¶
Granite-7b-lab is a model that was built from scratch by IBM and fine tuned with the LAB (Large-scale Alignment for chatBots) method.
More information about the Granite-7b can be found on the Hugging Face project page.
What is a “skill”?¶
In the context of InstructLab, a skill is a capability domain submitted by a contributor intending to train the AI model on the submitted information. In other words, when you submit a skill, you teach the AI model how to do something.
InstructLab skills are broken down into two main categories, compositional and foundational:
- Compositional skills. Composition or performative skills allow AI models to perform specific tasks or functions. With InstructLab, there are two types of composition skills:
- Freeform compositional skills are performative skills that do not require additional context. For example, to train an AI model to write a poem, you would provide examples of poems.
- Grounded compositional skills are performative skills that require additional context. One example is how an AI model reads the value of a cell in a table layout. To create the grounded skill to read a table formatted in Markdown, the additional context might be an example table layout.
- Foundational skills. Foundational skills are skills like math, reasoning, and coding. Note: Foundational skills are not currently being accepted.
Skills are written in a YAML file and submitted to the InstructLab upstream project for review. See the Skills: YAML examples for different types of examples.
What is “knowledge”?¶
Knowledge consists of data and facts. When creating knowledge for an AI model, you are providing it with additional data and information to answer questions more accurately. Whereas skills are the information that trains an AI model on how to do something, knowledge is based on the AI model’s ability to answer questions that involve facts, data, or references.
Like skills, knowledge submissions are submitted in YAML format to the InstructLab upstream project for review. See the Knowledge: YAML examples for different types of examples.
Is the project looking for certain types of skill contributions?¶
Currently, InstructLab only accepts compositional (freeform and grounded) skills and knowledge. However, any type of freeform or grounded skill can be submitted. Some skills might not be added to the taxonomy repository for reasons such as duplication, submitting a skill that the model already does well, or submitting a controversial skill.
Foundational skills are not currently being accepted.
For a list of accepted skills, see Accepted Skills.
What are the acceptance criteria for a skills submission?¶
Skills should seek to add capabilities or a knowledge domain to the AI model; in other words, a skills submission should teach the AI model how to do something instead of providing information about something. A good skills submission might address something that the AI model does poorly and seek to enhance its ability to execute that capability better. For a list of commonly accepted skills, see Accepted Skills.
Skills submissions that are unlikely to be accepted include submitting a knowledge request instead of a skills request, submitting a skill that the model already does well, submitting a controversial skill, or submitting skills that do not execute pure math or coding. For a list of skills to avoid submitting, see Skills to Avoid.
What are the acceptance criteria for a knowledge submission?¶
Requirements for knowledge submissions can be found in the Getting Started with Knowledge Contributions guide.
How can I submit a skill or knowledge?¶
For information about submitting a skill after you have identified a gap, see the Ways to contribute guide.
What happens after you submit a pull request?¶
After a pull request is submitted, a review is conducted by both the Taxonomy Triage team and the Taxonomy Approvers team to ensure that they are relevant, actionable, and have all of the required information needed to be a valuable addition to the AI model. Triagers might provide feedback and use labels to manage the state of the submitted pull request. Triagers also might provide informative feedback and helpful comments to improve the submission. After the pull request is approved, a Taxonomy Approver merges the skill.
More information regarding basic review questions, subjective review questions, labels, and the reasons for approval, further review requirements, or rejection can be found on the Triaging contributions page of the GitHub repository.
How are submissions reviewed?¶
For code review, the project maintainers use LGTM (Looks Good to Me) in comments on the code review to indicate acceptance. A change requires LGTMs from two of the maintainers.
For skills and knowledge PRs, your PR will be checked to ensure it is relevant, actionable, and has all the information necessary for the approval team to review and merge the PR. The Triage team will use labels to manage the state and action of PRs as well as provide feedback to contributors based upon the following review guidelines:
- Does the PR have the pull request template information filled out?
- Did all the PR checks pass?
- Does the skill have three or more examples?
- Are the YAML fields correct?
- No PII in content
- Does this content include anything documented in the project's Avoid these Topics guidelines?
- Does it adhere to the Code of Conduct guidelines?
- Was a response clearly generated by the LLM?
How long will it take for my pull request to be reviewed?¶
Due to the large number of contributions currently being received, it is difficult to provide an exact timeline for reviewing your pull request.
If my pull request is accepted, how long will it take for my changes to appear in the next model update?¶
After a pull request is accepted, the changes are regularly incorporated into InstructLab.
What is the software license for InstructLab?¶
The InstructLab project as well as the Granite-7b models are distributed under Apache License, Version 2.0.
What is the content license for InstructLab documentation?¶
Unless otherwise specified, all documentation for InstructLab is licensed under the CC-BY-4.0 license from Creative Commons.
Am I required to license code submissions to InstructLab under the Apache 2.0 license?¶
Yes. Code contributions to the InstructLab project are subject to the terms and conditions under the Apache 2.0 license.
My contribution requires submitting data along with code. What data is permissible to include?¶
It is recommended that third-party content be licensed with an open data license that does not restrict commercial use or the creation of derivative works, including the following licenses:
- CC0
- CDLA-Permissive
- CC-BY-4.0
- CC-BY-4.0 SA
- Apache 2.0
- MIT
Do submissions to the project require a contributor license agreement of some kind?¶
The InstructLab project follows the same approach (the Developer's Certificate of Origin 1.1 (DCO)) that the Linux Kernel community uses to manage code contributions. Unless the file says otherwise for this project, the relevant open source license is the Apache License, Version 2.0. When submitting a patch for review, you must include a sign-off statement in the commit message. See the "Legal" section of the Contributing document.
You can find more information about useful tools for managing DCO sign-off in our Community Contributions Guide.
Where can I download updated models of InstructLab?¶
The latest version of InstructLab can be downloaded using the ilab download
CLI command, as well as from InstructLab on Hugging Face.
I have a question about the project. Where should I go?¶
Currently, the best method for communicating with peers and project maintainers is in the Community Discord/Slack servers. Visit our InstructLab Slack Workspace Guide, InstructLab Slack Workspace Guide for information on how to join.
See our community collaboration page, including information on our mailing list, meetings, and other ways of interacting with the community.
What are the software and hardware requirements for using InstructLab?¶
The local training is the most hardware intensive part of this process. Your hardware determines how fast/slow training the model locally will take.
To run and train InstructLab locally, you must meet the following requirements:
- A supported operating system
- A Linux-based operating system
- An Apple Silicon M1, M2, or M3 system
- A Windows system with WSL (Windows Subsystem for Linux)
- Python 3.9 or later, including the development headers
- Approximately 10GB of free disk space to get through the
ilab generate
step - Approximately 60GB of free disk space is needed to run the entire process locally on Apple hardware
- About 32 GB RAM
[!IMPORTANT] Some of our community members have reported challenges in working with Windows and WSL for InstructLab support. If possible, you may want to work with Linux or Mac for the smoothest experience. We are continuing to work on improvements across our supported operating systems for the local desktop InstructLab tooling experience.
Glossary¶
Term | Explanation | Additional Reference |
---|---|---|
Checkpoints | Snapshots during training. They are scored individually and the best is selected. | N/A |
CUDA | “Compute Unified Device Architecture” - A parallel computing platform and API for general computing on GPUs by NVIDIA. | Ref |
DeepSpeed | Deep learning optimization library for PyTorch | Ref |
Granite | Open source licensed LLM released by IBM | Ref |
FSDP | “Full Sharded Data Parallel” - A wrapper for sharding module parameters across data parallel workers, used within PyTorch | Ref |
LAB | “Large-Scale Alignment for ChatBots” | Ref |
Labradorite | LAB-enhanced Llama2 model | N/A |
Llama | LLM released by Meta | N/A |
Llama CPP | A C++ library for inference of Llama models, similar to vLLM | Ref |
LoRA | “Low Rank Adapter” - Fine-tuning algorithm used within PyTorch | Ref |
Merlinite | LAB-enhanced Mistral model developed by IBM | N/A |
Mistral | LLM released by Mistral AI | N/A |
Mixtral | LLM using Mixture of Experts by Mistral AI | N/A |
MMLU | “Massive Multitask Language Understanding” - An evaluation scheme used for knowledge benchmarking | Ref |
MLX | An array framework for machine learning research on Apple Silicon chips | Ref |
MPS | “Metal Performance Shaders” - A MacOS hardware accelerator, similar to CUDA kernels | N/A |
MT-Bench | “Multi-turn benchmark” - An evaluation scheme used for skills benchmarking | Ref |
PEFT | “Parameter Efficient Fine-Tuning” | N/A |
PR-bench | Evaluation scheme used for skills PR benchmarking | N/A |
PR-mmlu | Evaluation scheme used for knowledge PR benchmarking | N/A |
PyTorch | Library supporting tensors and dynamic neural networks in Python with strong GPU acceleration | Ref |
QLoRA | "“Quantized Low Rank Adapter” - Fine tuning algorithm used within PyTorch | Ref |
Quantization | Process of reducing resource needs for a model by decreasing the range of the data type | Ref |
SDG | “Synthetic Data Generation” - The process where a model artificially generates data based on provided examples. | N/A |
vLLM | A library for LLM inference and serving, similar to Llama CPP. Provides an OpenAI-compatible API. | Ref |
Additional Resources¶
Additional resources, including the Code of Conduct, Code of Conduct Committee members, how to contribute, how to join the Discord or Slack server, and more, can be found in the following repositories:
InstructLab Taxonomy Repository
InstructLab Community Repository
Discord and communication
Slack and communication