03 — When Dense Search Misses Keywords: Hybrid BM25 + Vectors¶
Problem: Pure embedding search can miss rare tokens: SKUs, error codes, internal project names. Lexical match still matters.
In this notebook: Small corpus where the answer hinges on exact-ish token ERR-5041. Compare vector-only vs RRF fusion of vector + BM25 rankings.
In [ ]:
import sys
from pathlib import Path
_REPO = Path.cwd().resolve()
if (_REPO / "src").is_dir():
sys.path.insert(0, str(_REPO / "src"))
from rag_series_utils import tokenize, reciprocal_rank_fusion, chroma_path, get_client
from sentence_transformers import SentenceTransformer
from rank_bm25 import BM25Okapi
chunks = [
"General guidance: retry failed requests with exponential backoff.",
"Incident ERR-5041: gateway timeout — increase client timeout to 60s or contact SRE.",
"Monitoring: latency spikes often correlate with deploy windows.",
"Authentication errors use codes AUTH-01 through AUTH-09 in logs.",
]
query = "What should we do for incident ERR-5041?"
model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")
token_corpus = [tokenize(c) for c in chunks]
bm25 = BM25Okapi(token_corpus)
q_tokens = tokenize(query)
bm25_scores = bm25.get_scores(q_tokens)
bm25_order = [chunks[i] for i in sorted(range(len(chunks)), key=lambda i: -bm25_scores[i])]
emb = model.encode(chunks, show_progress_bar=False).tolist()
qe = model.encode(query, show_progress_bar=False).tolist()
p = chroma_path("nb03_hybrid")
client = get_client(p)
try:
client.delete_collection("c")
except Exception:
pass
col = client.create_collection("c", metadata={"hnsw:space": "cosine"})
col.add(ids=[str(i) for i in range(len(chunks))], documents=chunks, embeddings=emb)
vres = col.query(query_embeddings=[qe], n_results=len(chunks))
vec_order = vres["documents"][0]
# Map document text -> id for RRF
id_by_text = {chunks[i]: str(i) for i in range(len(chunks))}
vec_ids = [id_by_text[t] for t in vec_order]
bm25_ids = [id_by_text[t] for t in bm25_order]
fused = reciprocal_rank_fusion([vec_ids, bm25_ids], k=60, top_n=3)
id_to_text = {str(i): chunks[i] for i in range(len(chunks))}
print("Vector top-3:", vec_order[:3])
print("BM25 top-3:", bm25_order[:3])
print("RRF top-3:", [id_to_text[i] for i, _ in fused])
Takeaways
- Hybrid (dense + sparse) is standard in production search for good reason.
- Tune fusion (RRF vs weighted linear) and normalization when scores are on different scales.
- Log which leg retrieved the hit — it helps debug regressions after model updates.