{cas}

RAG Fundamentals

RAG (Retrieval-Augmented Generation) retrieves relevant information from a knowledge base, which we then pass to an LLM to generate a response. It's like giving the LLM a cheat sheet of just the right reference material before asking it to answer.

Why RAG matters

Long texts can exceed context limits, but more importantly, adding noise makes models lose sight of detail and nuance—despite what needle-in-the-haystack benchmarks claim. RAG lets you supply only what's relevant.

RAG for Category Mapping

Say I want to map a product description against a set of predefined categories from a standardized nomenclature—hundreds of items long. RAG helps by pre-filtering those categories down to just the relevant ones before asking the LLM to pick the best match. This matters because A) it's cheaper, and B) the LLM won't miss the right category when it's not buried in noise. RAG acts as the smart filter that gives the LLM a focused shortlist instead of the whole haystack.

docs = [
    "Electronics > Computers > Laptops",
    "Electronics > Computers > Desktop Computers",
    "Electronics > Mobile Devices > Smartphones",
    "Electronics > Mobile Devices > Tablets",
    "Electronics > Audio > Headphones > Wireless Headphones",
    "Electronics > Audio > Headphones > Wired Headphones",
    "Electronics > Audio > Speakers > Bluetooth Speakers",
    "Home & Kitchen > Furniture > Office Furniture > Desks",
    "Home & Kitchen > Furniture > Office Furniture > Chairs",
    "Home & Kitchen > Appliances > Small Appliances > Coffee Makers",
    "Clothing > Men's Clothing > Shirts",
    "Clothing > Women's Clothing > Dresses",
    "Sports & Outdoors > Exercise & Fitness > Yoga > Yoga Mats",
    "Sports & Outdoors > Exercise & Fitness > Cardio > Treadmills",
    "Books > Fiction > Science Fiction",
    "Books > Non-Fiction > Business & Money",
]

query = "Noise-cancelling over-ear bluetooth headphones with 30-hour battery life and premium sound quality"

Using the answer.ai's reranker library

# %%bash
#
# pip install -U "rerankers[transformers]"==0.10.0
# pip install -U sentence-transformers

It is important to note that what we are trying to do is surface a handful or relevent results that can then be passed to an LLM to do something further (ie. apply a single category for the product description).

We can so this because models have been trained (via contrastive learning) to evaluate similarity by evaluating proximity between docs and query.

Retrieval Architectures Overview

Once we've decided to retrieve relevant documents, the question becomes: how do we actually compare a query to documents? All approaches boil down to encoding text into vectors and measuring similarity — but when and how we do that encoding matters a lot for both accuracy and speed.

Then a quick preview of the three approaches you'll cover:

  1. Bi-encoders — encode separately, compare vectors (fast, scalable)
  2. Cross-encoders — encode together, get direct relevance score (accurate, slow)
  3. Late interaction — encode separately, compare token-by-token (middle ground)

Bi-Encoders

Bi-encoders encode the query and each document separately, producing a single vector for each. We then compare these vectors using cosine similarity (or dot product) to find the most relevant documents.

Query:    [CLS] "noise" "cancelling" "headphones" [SEP]  →  e_cls_query
Document: [CLS] "Electronics" ">" "Audio" ">" ... [SEP]  →  e_cls_doc

A single vector representation (embedding) is produced for each query and individual document. There are multiple ways to generate this embedding per query/document

Why this matters for scale: Because documents are encoded independently, we can pre-compute all document embeddings once and store them. At query time, we only encode the query and compare against the stored vectors — this makes bi-encoders very fast.

Pooling Strategies

When we fine-tune a model for similarity, it outputs embeddings for the entire input sequence. But we need a single vector to represent the text. Where does it come from?

CLS Pooling

Use the [CLS] token's embedding as the representation. Fine-tuned models are trained to update the [CLS] vector such that the loss is minimized — despite the model outputting embeddings for the entire sequence, we only use e_cls:

Input:  [CLS] "noise" "cancelling" "headphones" [SEP]
Output:   ↓      ↓         ↓           ↓         ↓
        e_cls   e_1       e_2         e_3       e_sep

→ use e_cls (discard the rest)

The [CLS] embedding is a "legacy" implementation from BERT models, where it was originally designed for classification tasks (e.g., sentiment analysis).

Mean Pooling

Average all token embeddings (excluding special tokens):

→ use mean(e_1, e_2, e_3)

Why mean pooling often works better: The [CLS] token must learn to summarize everything — a lot of pressure on one vector. Mean pooling distributes the representation across all tokens, which can be more robust, especially for longer sequences.

Notice how the [CLS] and [SEP] embeddings are both dropped

Summary:

Input:  [CLS] "noise" "cancelling" "headphones" [SEP]
Output:   ↓      ↓         ↓           ↓         ↓
        e_cls   e_1       e_2         e_3       e_sep

CLS pooling:  use e_cls
Mean pooling: use mean(e_1, e_2, e_3)

Mean pooling has become the de facto standard for bi-encoder models (e.g., all-MiniLM-L6-v2, E5, GTE, BGE).

In Practice: Sentence-Transformers

The sentence-transformers library is the most common way to use bi-encoder models. It wraps the encoding and pooling logic:

from sentence_transformers import SentenceTransformer
import numpy as np

# Load a bi-encoder model (mean pooling by default)
model = SentenceTransformer('all-MiniLM-L6-v2')

# Encode query and documents separately
query_embedding = model.encode(query)
doc_embeddings = model.encode(docs)

# Calculate cosine similarity
from sentence_transformers.util import cos_sim

similarities = cos_sim(query_embedding, doc_embeddings)[0].numpy()

# Get top-k most relevant categories
top_k = 3
top_indices = np.argsort(similarities)[-top_k:][::-1]

print(f"Query: {query}\n")
for idx in top_indices:
    print(f"Score: {similarities[idx]:.4f} | {docs[idx]}")

Pre-computing doc_embeddings once, then comparing against new queries, is what makes bi-encoders practical for large-scale retrieval.

? Would love to see the contrastive learning approach used to train the [CLS] or mean pooled embeddings. It make sense at a high-level, but would struggle to go into any detail

Cross-Encoders

Cross-encoders take a fundamentally different approach: instead of encoding query and document separately, they encode them together as a single input and output a relevance score directly.

Input:  [CLS] "noise" "cancelling" "headphones" [SEP] "Electronics" ">" "Audio" ">" ... [SEP]
                        ↓
              Relevance Score: 0.87

No need to discuss pooling with cross-encoders, since a single relevence score is output by design

The model sees both texts at once, allowing full attention between query and document tokens. This means the model can capture fine-grained interactions — like recognizing that "noise-cancelling" relates strongly to "Headphones" but not to "Speakers."

Why this is more accurate: Bi-encoders compress each text into a single vector before comparison. Cross-encoders delay that compression, letting the model reason about the relationship between query and document directly.

The trade-off: You can't pre-compute document embeddings. Every query requires running the model on every (query, document) pair. For 1,000 documents, that's 1,000 forward passes per query.

Popular cross-encoder models include ms-marco-MiniLM-L-6-v2 (fast, general-purpose), ms-marco-electra-base (stronger but slower), BGE-reranker (multilingual), and Cohere Rerank (commercial API).

Here we can use the Reranker lib by Answer.AI to use these models:

from rerankers import Reranker

ranker = Reranker(
    'cross-encoder',
    model_type='cross-encoder'
)
# Rerank to get top categories
results = ranker.rank(query=query, docs=docs, doc_ids=list(range(len(docs))))

# Display top 5 results
print(f"Query: {query}\n")
print("Top 5 matching categories:")
for result in results.top_k(5):
    print(f"Score: {result.score:.4f} - {result.text}")

Late Interaction (ColBERT)

Late interaction models like ColBERT offer a middle ground: encode query and documents separately (like bi-encoders), but compare at the token level instead of collapsing to a single vector.

Query:    [CLS] "noise" "cancelling" "headphones" [SEP]  →  [e_q1, e_q2, e_q3]
Document: [CLS] "Electronics" ">" "Audio" ">" ... [SEP]  →  [e_d1, e_d2, e_d3, ...]

Each query token finds its best match among the document tokens (MaxSim), and these scores are summed to produce the final relevance score:

Query embeddings:          q₁           q₂            q₃
                       "noise"   "cancelling"   "headphones"

Document embeddings:   d₁       d₂      d₃       d₄        d₅
                     "Elec"   "Audio"  ">"    "Headphones" "Wireless"

MaxSim: each query token finds its best match in the document

              d₁      d₂      d₃      d₄      d₅
            ─────   ─────   ─────   ─────   ─────
q₁ "noise"   0.2     0.3     0.1     0.4     0.2   →  max = 0.4
q₂ "cancel"  0.1     0.2     0.1     0.3     0.1   →  max = 0.3
q₃ "headph"  0.2     0.3     0.1     0.9     0.2   →  max = 0.9
                                                      ─────────
                                           Score = Σ max = 1.6

This preserves token-level detail that bi-encoders lose when they collapse to a single vector.

Why this helps: A bi-encoder might struggle to distinguish "wireless headphones" from "wired headphones" because both compress to similar vectors. ColBERT keeps the individual token embeddings, so "wireless" can match (or not match) specific document tokens.

The trade-off: You can still pre-compute document embeddings (good for speed), but you now need to store all token embeddings per document, not just one vector. This increases storage significantly.

ColBERT = Colcontextualized Late Interaction over BERT

ranker = Reranker("colbert")
# Rerank to get top categories
results = ranker.rank(query=query, docs=docs)

# Display top 5 results
print(f"Query: {query}\n")
print("Top 5 matching categories:")
for result in results.top_k(5):
    print(f"Score: {result.score:.4f} - {result.text}")

Score Interpretation in Retrieval Models

Below is a summary of the results from our simple test:

Approach Rank Result Score
Bi-encoder (MiniLM) #1 Wireless Headphones 0.4245
#2 Wired Headphones 0.3702
Cross-encoder (mxbai-rerank) #1 Wireless Headphones -0.7980
#2 Bluetooth Speakers -0.8287
ColBERT #1 Wireless Headphones 0.4903
#2 Bluetooth Speakers 0.4481

Interpreting these results

  • Rankings: Reliable across all model types
  • Absolute scores/thresholds: Only meaningful after empirical calibration on your specific data + model

This is why production RAG uses top-k (e.g., top 5) rather than score thresholds—thresholds require calibration and break when you change models or domains.

Only trust rankings, not score values

When to Use Which?

Approach Speed Accuracy Pre-computed Embeddings Best For
Bi-encoder ⚡⚡⚡ Good 1 vector/doc First-pass retrieval over large corpora
Cross-encoder Excellent None (computed at query time) Re-ranking a small candidate set
ColBERT ⚡⚡ Very Good N vectors/doc (one per token; potentially storage intensive) When you need better accuracy than bi-encoders but can't afford full cross-encoder passes

The Dominant Pattern: Two-Stage Retrieval

Often the best setup is bi-encoder retrieval + cross-encoder reranking + LLM generation.

Query → Bi-encoder retrieves top-100 → Cross-encoder re-ranks to top-10 → LLM generates answer

The term "reranker" comes from the fact that you're literally re-ranking results from a cruder first pass — the bi-encoder does the initial ranking by vector similarity, then the cross-encoder refines that ranking with more accurate scoring.

When ColBERT Makes Sense

ColBERT combines the speed of bi-encoders with some of the contextual understanding of cross-encoders, making it suitable for tasks where both speed and precision are crucial.

This makes ColBERT suitable for large-scale applications where bi-encoders might miss nuanced matches—for example, a search for "car maintenance" could retrieve a document discussing "automobile care" by matching tokens at the individual level.

The trade-off: you need to store all token embeddings per document, which significantly increases storage requirements.