RAG Fundamentals
RAG (Retrieval-Augmented Generation) retrieves relevant information from a knowledge base, which we then pass to an LLM to generate a response. It's like giving the LLM a cheat sheet of just the right reference material before asking it to answer.
Why RAG matters
Long texts can exceed context limits, but more importantly, adding noise makes models lose sight of detail and nuance—despite what needle-in-the-haystack benchmarks claim. RAG lets you supply only what's relevant.
RAG for Category Mapping
Say I want to map a product description against a set of predefined categories from a standardized nomenclature—hundreds of items long. RAG helps by pre-filtering those categories down to just the relevant ones before asking the LLM to pick the best match. This matters because A) it's cheaper, and B) the LLM won't miss the right category when it's not buried in noise. RAG acts as the smart filter that gives the LLM a focused shortlist instead of the whole haystack.
docs = [
"Electronics > Computers > Laptops",
"Electronics > Computers > Desktop Computers",
"Electronics > Mobile Devices > Smartphones",
"Electronics > Mobile Devices > Tablets",
"Electronics > Audio > Headphones > Wireless Headphones",
"Electronics > Audio > Headphones > Wired Headphones",
"Electronics > Audio > Speakers > Bluetooth Speakers",
"Home & Kitchen > Furniture > Office Furniture > Desks",
"Home & Kitchen > Furniture > Office Furniture > Chairs",
"Home & Kitchen > Appliances > Small Appliances > Coffee Makers",
"Clothing > Men's Clothing > Shirts",
"Clothing > Women's Clothing > Dresses",
"Sports & Outdoors > Exercise & Fitness > Yoga > Yoga Mats",
"Sports & Outdoors > Exercise & Fitness > Cardio > Treadmills",
"Books > Fiction > Science Fiction",
"Books > Non-Fiction > Business & Money",
]
query = "Noise-cancelling over-ear bluetooth headphones with 30-hour battery life and premium sound quality"
Using the answer.ai's reranker library
# %%bash
#
# pip install -U "rerankers[transformers]"==0.10.0
# pip install -U sentence-transformers
It is important to note that what we are trying to do is surface a handful or relevent results that can then be passed to an LLM to do something further (ie. apply a single category for the product description).
We can so this because models have been trained (via contrastive learning) to evaluate similarity by evaluating proximity between docs and query.
Retrieval Architectures Overview
Once we've decided to retrieve relevant documents, the question becomes: how do we actually compare a query to documents? All approaches boil down to encoding text into vectors and measuring similarity — but when and how we do that encoding matters a lot for both accuracy and speed.
Then a quick preview of the three approaches you'll cover:
- Bi-encoders — encode separately, compare vectors (fast, scalable)
- Cross-encoders — encode together, get direct relevance score (accurate, slow)
- Late interaction — encode separately, compare token-by-token (middle ground)
Bi-Encoders
Bi-encoders encode the query and each document separately, producing a single vector for each. We then compare these vectors using cosine similarity (or dot product) to find the most relevant documents.
Query: [CLS] "noise" "cancelling" "headphones" [SEP] → e_cls_query
Document: [CLS] "Electronics" ">" "Audio" ">" ... [SEP] → e_cls_doc
A single vector representation (embedding) is produced for each query and individual document. There are multiple ways to generate this embedding per query/document
Why this matters for scale: Because documents are encoded independently, we can pre-compute all document embeddings once and store them. At query time, we only encode the query and compare against the stored vectors — this makes bi-encoders very fast.
Pooling Strategies
When we fine-tune a model for similarity, it outputs embeddings for the entire input sequence. But we need a single vector to represent the text. Where does it come from?
CLS Pooling
Use the [CLS] token's embedding as the representation. Fine-tuned models are trained to update the [CLS] vector such that the loss is minimized — despite the model outputting embeddings for the entire sequence, we only use e_cls:
Input: [CLS] "noise" "cancelling" "headphones" [SEP]
Output: ↓ ↓ ↓ ↓ ↓
e_cls e_1 e_2 e_3 e_sep
→ use e_cls (discard the rest)
The [CLS] embedding is a "legacy" implementation from BERT models, where it was originally designed for classification tasks (e.g., sentiment analysis).
Mean Pooling
Average all token embeddings (excluding special tokens):
→ use mean(e_1, e_2, e_3)
Why mean pooling often works better: The [CLS] token must learn to summarize everything — a lot of pressure on one vector. Mean pooling distributes the representation across all tokens, which can be more robust, especially for longer sequences.
Notice how the
[CLS]and[SEP]embeddings are both dropped
Summary:
Input: [CLS] "noise" "cancelling" "headphones" [SEP]
Output: ↓ ↓ ↓ ↓ ↓
e_cls e_1 e_2 e_3 e_sep
CLS pooling: use e_cls
Mean pooling: use mean(e_1, e_2, e_3)
Mean pooling has become the de facto standard for bi-encoder models (e.g., all-MiniLM-L6-v2, E5, GTE, BGE).
In Practice: Sentence-Transformers
The sentence-transformers library is the most common way to use bi-encoder models. It wraps the encoding and pooling logic:
from sentence_transformers import SentenceTransformer
import numpy as np
# Load a bi-encoder model (mean pooling by default)
model = SentenceTransformer('all-MiniLM-L6-v2')
# Encode query and documents separately
query_embedding = model.encode(query)
doc_embeddings = model.encode(docs)
# Calculate cosine similarity
from sentence_transformers.util import cos_sim
similarities = cos_sim(query_embedding, doc_embeddings)[0].numpy()
# Get top-k most relevant categories
top_k = 3
top_indices = np.argsort(similarities)[-top_k:][::-1]
print(f"Query: {query}\n")
for idx in top_indices:
print(f"Score: {similarities[idx]:.4f} | {docs[idx]}")
Pre-computing doc_embeddings once, then comparing against new queries, is what makes bi-encoders practical for large-scale retrieval.
? Would love to see the contrastive learning approach used to train the [CLS] or mean pooled embeddings. It make sense at a high-level, but would struggle to go into any detail
Cross-Encoders
Cross-encoders take a fundamentally different approach: instead of encoding query and document separately, they encode them together as a single input and output a relevance score directly.
Input: [CLS] "noise" "cancelling" "headphones" [SEP] "Electronics" ">" "Audio" ">" ... [SEP]
↓
Relevance Score: 0.87
No need to discuss pooling with cross-encoders, since a single relevence score is output by design
The model sees both texts at once, allowing full attention between query and document tokens. This means the model can capture fine-grained interactions — like recognizing that "noise-cancelling" relates strongly to "Headphones" but not to "Speakers."
Why this is more accurate: Bi-encoders compress each text into a single vector before comparison. Cross-encoders delay that compression, letting the model reason about the relationship between query and document directly.
The trade-off: You can't pre-compute document embeddings. Every query requires running the model on every (query, document) pair. For 1,000 documents, that's 1,000 forward passes per query.
Popular cross-encoder models include ms-marco-MiniLM-L-6-v2 (fast, general-purpose), ms-marco-electra-base (stronger but slower), BGE-reranker (multilingual), and Cohere Rerank (commercial API).
Here we can use the Reranker lib by Answer.AI to use these models:
from rerankers import Reranker
ranker = Reranker(
'cross-encoder',
model_type='cross-encoder'
)
# Rerank to get top categories
results = ranker.rank(query=query, docs=docs, doc_ids=list(range(len(docs))))
# Display top 5 results
print(f"Query: {query}\n")
print("Top 5 matching categories:")
for result in results.top_k(5):
print(f"Score: {result.score:.4f} - {result.text}")
Late Interaction (ColBERT)
Late interaction models like ColBERT offer a middle ground: encode query and documents separately (like bi-encoders), but compare at the token level instead of collapsing to a single vector.
Query: [CLS] "noise" "cancelling" "headphones" [SEP] → [e_q1, e_q2, e_q3]
Document: [CLS] "Electronics" ">" "Audio" ">" ... [SEP] → [e_d1, e_d2, e_d3, ...]
Each query token finds its best match among the document tokens (MaxSim), and these scores are summed to produce the final relevance score:
Query embeddings: q₁ q₂ q₃
"noise" "cancelling" "headphones"
Document embeddings: d₁ d₂ d₃ d₄ d₅
"Elec" "Audio" ">" "Headphones" "Wireless"
MaxSim: each query token finds its best match in the document
d₁ d₂ d₃ d₄ d₅
───── ───── ───── ───── ─────
q₁ "noise" 0.2 0.3 0.1 0.4 0.2 → max = 0.4
q₂ "cancel" 0.1 0.2 0.1 0.3 0.1 → max = 0.3
q₃ "headph" 0.2 0.3 0.1 0.9 0.2 → max = 0.9
─────────
Score = Σ max = 1.6
This preserves token-level detail that bi-encoders lose when they collapse to a single vector.
Why this helps: A bi-encoder might struggle to distinguish "wireless headphones" from "wired headphones" because both compress to similar vectors. ColBERT keeps the individual token embeddings, so "wireless" can match (or not match) specific document tokens.
The trade-off: You can still pre-compute document embeddings (good for speed), but you now need to store all token embeddings per document, not just one vector. This increases storage significantly.
ColBERT = Colcontextualized Late Interaction over BERT
ranker = Reranker("colbert")
# Rerank to get top categories
results = ranker.rank(query=query, docs=docs)
# Display top 5 results
print(f"Query: {query}\n")
print("Top 5 matching categories:")
for result in results.top_k(5):
print(f"Score: {result.score:.4f} - {result.text}")
Score Interpretation in Retrieval Models
Below is a summary of the results from our simple test:
| Approach | Rank | Result | Score |
|---|---|---|---|
| Bi-encoder (MiniLM) | #1 | Wireless Headphones | 0.4245 |
| #2 | Wired Headphones | 0.3702 | |
| Cross-encoder (mxbai-rerank) | #1 | Wireless Headphones | -0.7980 |
| #2 | Bluetooth Speakers | -0.8287 | |
| ColBERT | #1 | Wireless Headphones | 0.4903 |
| #2 | Bluetooth Speakers | 0.4481 |
Interpreting these results
- Rankings: Reliable across all model types
- Absolute scores/thresholds: Only meaningful after empirical calibration on your specific data + model
This is why production RAG uses top-k (e.g., top 5) rather than score thresholds—thresholds require calibration and break when you change models or domains.
Only trust rankings, not score values
When to Use Which?
| Approach | Speed | Accuracy | Pre-computed Embeddings | Best For |
|---|---|---|---|---|
| Bi-encoder | ⚡⚡⚡ | Good | 1 vector/doc | First-pass retrieval over large corpora |
| Cross-encoder | ⚡ | Excellent | None (computed at query time) | Re-ranking a small candidate set |
| ColBERT | ⚡⚡ | Very Good | N vectors/doc (one per token; potentially storage intensive) | When you need better accuracy than bi-encoders but can't afford full cross-encoder passes |
The Dominant Pattern: Two-Stage Retrieval
Often the best setup is bi-encoder retrieval + cross-encoder reranking + LLM generation.
Query → Bi-encoder retrieves top-100 → Cross-encoder re-ranks to top-10 → LLM generates answer
The term "reranker" comes from the fact that you're literally re-ranking results from a cruder first pass — the bi-encoder does the initial ranking by vector similarity, then the cross-encoder refines that ranking with more accurate scoring.
When ColBERT Makes Sense
ColBERT combines the speed of bi-encoders with some of the contextual understanding of cross-encoders, making it suitable for tasks where both speed and precision are crucial.
This makes ColBERT suitable for large-scale applications where bi-encoders might miss nuanced matches—for example, a search for "car maintenance" could retrieve a document discussing "automobile care" by matching tokens at the individual level.
The trade-off: you need to store all token embeddings per document, which significantly increases storage requirements.