The Llama Hitchiking Guide To Local LLMs - Hackerllama
The Llama Hitchiking Guide To Local LLMs - Hackerllama
Here are some terms that are useful to know when joining the Local LLM community.
1. LocalLlama: A Reddit community of practitioners, researchers, and hackers doing all kinds of crazy
things with ML models.
2. LLM: A Large Language Model. Usually a transformer-based model with a lot of parameters…billions
or even trillions.
3. Transformer: A type of neural network architecture that is very good at language tasks. It is the basis
for most LLMs.
4. GPT: A type of transformer that is trained to predict the next token in a sentence. GPT-3 is an example
of a GPT model…who could tell??
4.1 Auto-regressive: A type of model that generates text one token at a time. It is auto-regressive
because it uses its own predictions to generate the next token. For example, the model might receive
as input “Today’s weather” and generate the next token, “is”. It will then use “Today’s weather is” as
input and generate the next token, “sunny”. It will then use “Today’s weather is sunny” as input and
generate the next token, “and”. And so on.
5. Token: Models don’t understand words. They understand numbers. When we receive a sequence of
words, we convert them to numbers. Sometimes we split words into pieces, such as “tokenization” into
“token” and “ization”. This is needed because the model has a limited vocabulary. A token is the
smallest unit of language that a model can understand.
6. Context length: The number of tokens that the model can use at a time. The higher the context length,
the more memory the model needs to train and the slower it is to run. E.g. Llama 2 can manage up to
4096 tokens.
6.1 LLaMA: A pre-trained model trained by Meta, shared with some groups in a private access, and
then leaked. It led to an explosion of cool projects. 🦙
6.2 Llama 2: An open-access pre-trained model released by Meta. It led to another explosion of very
cool projects, and this one was not leaked! The license is not technically open-source but it’s still quite
open and permissive, even for commercial use cases. 🦙🦙
6.3 RoPE: A technique that allows you to significantly expand the context lengths of a model.
6.4 SuperHot: A technique that allows expanding the context length of RoPE-based models even
more by doing some minimal additional training.
7. Pre-training: Training a model on a very large dataset (trillion of tokens) to learn the structure of
language. Imagine you have millions of dollars, as a good GPU-Rich. You usually scrape big datasets
from the internet and train your model on them. This is called pre-training. The idea is to end with a
model that has a strong understanding of language. This does not require labeled data! This is done
before fine-tuning. Examples of pre-trained models are GPT-3, Llama 2, and Mistral.
7.1 Mistral 7B: A pre-trained model trained by Mistral. Released via torrent.
7.2 Phi 2: A pre-trained model by Microsoft. It only has 2.7B parametrs but it’s quite good for its size!
It was trained with very little data (textbooks) which shows the power of high-quality data.
7.3 transformers: a Python library to access models shared by the community. It allows you to
download pre-trained models and fine-tune them for your own needs
8. Fine-tuning: Training a model on a small (labeled) dataset to learn a specific task. This is done after
pre-training. Imagine you have a few dollars, as a good fellow GPU-Poor. Rather than training a model
from scratch, you pick a pre-trained (base) model and fine-tune it. You usually pick a small dataset of
few hundreds-thousands of samples. You then pass it to the model and train it on it. This is called fine-
tuning. The idea is to end with a model that has a strong understanding of a specific task. For example,
you can fine-tune a model with your tweets to make it generate tweets like you! (but please don’t).
You can fine-tune many models in your gaming laptop! Examples of fine-tuned models are ChatGPT,
Vicuna, and Mistral Instruct.
8.2 Vicuna: A cute animal that is also a fine-tuned model. It begins from LLaMA-13B and is fine-tuned
on user conversations with ChatGPT.
8.3 Number of parameters: Notice the -13B in point 8.2. That’s the number of parameters in a model.
Each parameter is a number (with certain precision), and is part of the model. The parameters are
learned during pre-training and fine-tuning to minimize the error.
9. Prompt: A few words that you give to the model to start generating text. For example, if you want to
generate a poem, you can give the model the first line of the poem as a prompt. The model will then
generate the rest of the poem!
10. Zero-shot: A type of prompt that is used to generate text without fine-tuning. The model is not
trained on any specific task. It is only trained on a large dataset of text. For example, you can give the
model the first line of a poem and ask it to generate the rest of the poem. The model will do its best to
generate a poem, even though it has never seen a poem before! When you use ChatGPT, you often do
zero-shot generation!
User
Input:
Classification Task:
Classify the following text - "Rainy days make me want to stay in bed."
Output:
Label: Sentence about weather.
12. Instruct-tuning: A type of fine-tuning that uses instructions to generate text ending in more
controlled behavor in generating responses or performing tasks.
12.1 Alpaca: A dataset of 52,000 instructions generatd with OpenAI APIs. It kicked off a big wave of
people using OpenAI to generate synthetic data for instruct-tuning. It costed about $500 to generate.
12.2 LIMA: A model that demonstrates strong performance with very few examples. It demonstrates
that adding more data does not always correlate with better quality.
13. RLHF (Reinforcement Learning with Human Feedback): A type of fine-tuning that uses
reinforcement learning (RL) and human-generated feedback. Thanks to the introduction of human
feedback, the end model ends up being very good for things such as conversations! It kicks off with a
base model that generates bunch of conversations. Humans then rate the answers (preferences). The
preferences are used to train a Reward Model that generates a score for a given text. Using
Reinforcement Learning, the initial LM is trained to maximize the score generated by the Reward
Model. Read more about it here.
13.1 RL: Reinforcement learning is a type of machine learning that uses rewards to train a model. For
example, you can train a model to play a game by giving it a reward when it wins and a punishment
when it loses. The model will learn to win the game!
13.2. Reward Model: A model that is used to generate rewards. For example, you can train a model to
generate rewards for a game. The model will learn to generate rewards that are good for the game!
15. DPO: A type of training which removes the need for a reward model. It simplifies significantly the
RLHF-pipeline.
15.1 Zephyr: A 7B Mistral-based model trained with DPO. It has similar capabilities to the Llama 2
Chat model of 70B parameters. It came out with a nice handbook of recipes.
15.2 Notus: A trained variation of Zephyr but with better filered and fixed data. It does better!
15.3 Overfitting: occurs in ML when a model learns the training data too well, capturing noise and
specific patterns that do not generalize to new, unseen data, leading to poor performance on real-
world tasks.
15.4 DPO Overfits Although DPO shows overfitting behaviors after one behavior, it does not harm
downstream performance on chat evaluations. Did your ML teachers lie to us when they said
overfitting was bad?
15.5 IPO: A change in the DPO objective which is simpler and less prone to overfitting.
15.6. KTO: While PPO, DPO, and IPO require pairs of accepted vs rejected generations, KTO just
needs a binary label (accepted or rejected), hence allowing to scale to much more data.
15.7 trl: A library that allows to train models with DPO, IPO, KTO, and more!
16. Open LLM Leaderboard: A leaderboard where you can find benchmark results for many open-access
LLMs.
17.1 Benchmark: A benchmark is a test that you run to compare different models. For example, you
can run a benchmark to compare the performance of different models on a specific task.
17.2 TruthfulQA: A not-great benchmark to measure a model’s ability to generate truthful answers.
17.3 Conversational models: The LLM Leaderboard should be mostly to compare base models, not as
much for conversational models. It still provides some useful signal about the conversational models,
but this should not be the final way to evaluate them.
17. Chatbot Arena: A popopular crowd-sourced open benchmark of human preferences. It’s good to
compare conversational models
18. MT-Bench: A multi-turn benchmark of 160 questions across eight domains. Each response is
evaluated by GPT-4. (This presents limitations…what happens if the model is better than GPT-4?)
19. Mixture-of-Experts (MoE): A model architecture in which some of the (dense) layers are replaced
with a set of experts. Each expert is a small neural network. There is a small network, router, that
decides which expert to use for each token (read more here). Clarifications:
19.2 Mixtral: A MoE model released by Mistral. It has 47B parameters but only 12B parameters are
used at a time, making it very efficient.
20. Model Merging: A technique that allows us to combine multiple models of the same architecture into
a single model. Read more here.
20.2 Averaging: The most basic merging technique. Pick two models, average their weights. Somehow
it kinda works!
20.3 Frankenmerge: It allows to concatenate layers from different LLMs, allowing you to do crazy
things.
20.4 Goliath-120B: A frankenmerge that combines two Llama 70B models to achieve a 120B model
20.5 MoE Merging: (Not 100% about this one) Experimental branch in mergekit that allows building
a MoE-like model combining different models. You specify which models and which types of prompts
you want each expert to handle, hence ending with expert task-specialization.
21.2 Cognitive Computations: A community (led by Eric Hartford) that is fine-tuning a bunch of
models
21.3 Uncensored models: Many models have some strong alignment that prevent doing things such
as asking Llama to kill a Linux process. Training uncensored models aims to remove specific biases
engrained in the decision-making process of fine-tuning a model. Read more here.
21.5 GGUF: A format introduced by llama.cpp to store models. It replaces the old file format, GGML.
21.6 ggml: Tensor library in ML, allowing projects such as llama.cpp and whisper.cpp (not the same as
GGML, the file format).
21.10 MLX: A new framework for Apple devices that allows easy inference and fine-tuning of models.
22. A. Local LLM tools: If you don’t know how to code, there are a couple of tools that can be useful
22.1 Oobabooga: A simple web app that allows you to use models without coding. It’s very easy to
use!
22.2 LM Studio: A nice advanced app that runs models on your laptop, entirely offline.
22.3 ollama: An open-source tool to run LLMs locally. There are multiple web/desktop apps and
terminal integrations on top of it.
23. Quantization: A technique that allows us to reduce the size of a model. It is done by reducing the
precision of the model’s weights. For example, we can reduce the precision from 32 bits to 8 bits. This
reduces the size of the model by 4 times! The model will (sometimes) be less accurate but it will be
much smaller. This allows us to run the model on smaller devices such as phones.
23.1 TheBloke: A bloke that quantizes models. As soon as a model is out, he quantizes it! See their HF
Profile.
23.2 Hugging Face: A platform to find and share open-acces models, datasets, and demos. It’s also a
company that has built different OS libraries (and where I work!)
23.3. Facehugger: A monster from the Alien movie. It should also be an open source tool. It’s not yet.
23.6 EXL2: A different quantization format used by a library called exllamav2 (among many others)
23.7 LASER: A technique that reduces the size of the model and increases its performance by
reducindg the rank of specific matrices. It requires no additional training.
24. PEFT: Parameter-Efficient Fine-Tuning - It’s a family of methods that allow fine-tuning models without
modifying all the parameters. Usually, you freeze the model, add a small set of parameters, and just
modify it. It hence reduces the amount of compute required and you can achieve very good results!
24.1 peft: A popular OS library to do PEFT! It’s used in other projects such as trl .
25.1. Tim Dettmers: A researcher that has done a lot of work on PEFT and created QLoRA.
26. axolotl: A cute animal that is also a high-level tool to streamline fine-tuning, including support for
things such as QLoRA.
27. Nous Research: An open-source Discord community turned company that releases bunch of cool
models.
28. Multimodal: A single model that can handle multiple modalities. For example, a model that can
generate text and images at the same time. Or a model that can generate text and audio at the same
time. Or a model that can generate text, images, and audio at the same time. Or a model that can
generate text, images, audio, video, smells, tastes, feelings, thoughts, dreams, memories,
consciousness, souls, universes, gods, multiverses, and omniverses at the same time. (thanks ChatGPT
for your hallucination)
28.1 Hallucination: When a model cangenerates responses that may be coherent but are not actually
accurate, leading to the creation of misinformation or imaginary scenarios…such as the one above!
28.2 LlaVA: A multimodal model that can receive images and text as input and generate text respones.
29. Bagel: A process which mixes a bunch of supervised fine-tuning and preference data. It uses different
prompt formats, making the model more versatile to all kinds of prompts.
30. Code Models: LLMs that are specifically pre-trained for code.
30.1. Big Code Models Leaderboard: A leaderboard to compare code models in the HumanEval
dataset.
30.2. HumanEval: A very small dataset of 164 Python programming problems. It is translated to 18
programming languages in MultiPL-E.
30.3 BigCode: An open scientific collaboration working in code-related models and datasets.
30.4 The Stack: A dataset of 6.4TB of permissible-licensed code data covering 358 programming
languages.
30.5 Code Llama: The best base code model. It’s based on Llama 2.
30.6 WizardLM: A research team from Microsoft…but also a Discord community.
30.7 WizardCoder: A code model released by WizardLM. Its architecture is based on Llama
31. Flash Attention: An approximate attention algorithm which provides a huge speedup.
31.1 Flash Attention 2: An upgrade to the flash attention algorithm that provides even more speedup.
31.2. Tri Dao: The author of both techniques and a legend in the ecosystem.
I hope you enjoyed this read! Feel free to suggest new terms or corrections in the comments below. I’ll
keep updating this post as new terms come up.
Write Preview
Sign in to comment
Report an issue