Week 5 - LLM - RAG
Week 5 - LLM - RAG
- LLM Hallucinations
vs.
Semantic Search
vs.
RAG
CHUNKING
In the context of building LLM-related applications,
chunking is the process of breaking down large pieces
of text into smaller segments.
LLMs have limits on how much text we can pass to them — we call this limit
the context window.
Some LLMs have huge context windows, like Anthropic's Claude, with a
context window of 100K tokens.
With that, we could fit many tens of pages of text — so could we return many
documents (not quite all) and "stuff" the context window to improve recall?
Again, no. We cannot use context stuffing because this reduces the LLM's
recall performance.
The solution to this issue is to maximize retrieval
recall by retrieving plenty of documents and then
maximize LLM recall by minimizing the number of
documents that make it to the LLM.
We use two stages because retrieving a small set of documents from a large
dataset is much faster than reranking a large set of documents — we'll
discuss why this is the case soon — but TL;DR,
The model takes a pair of data, such as two sentences, as input, and
produces an output value between 0 and 1, indicating the similarity
between the two items.
These scores are then used to re-rank the candidates, ensuring that
the most relevant and informative responses are selected for the
generation step.
Re-Ranking is an open problem
and lot of work is needed to improve its accuracy
Is RAG a fool proof solution
to augment LLMs with new information?