09 — RAG Is Not One Metric: Retrieval @k¶
Problem: A single “accuracy” number hides failures. You need retrieval metrics (is the gold chunk in top-k?) separate from generation.
In this notebook: Known relevant chunk id per question; compute Recall@k for a bi-encoder index.
In [ ]:
import sys
from pathlib import Path
_REPO = Path.cwd().resolve()
if (_REPO / "src").is_dir():
sys.path.insert(0, str(_REPO / "src"))
from rag_series_utils import chroma_path, get_client
from sentence_transformers import SentenceTransformer
chunks = {
"c1": "Error ERR-5041 means gateway timeout.",
"c2": "Error AUTH-02 means invalid API key.",
"c3": "Retries should use exponential backoff.",
}
qa = [
("What is ERR-5041?", "c1"),
("Why would AUTH-02 appear?", "c2"),
]
model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")
p = chroma_path("nb09_eval")
client = get_client(p)
try:
client.delete_collection("c")
except Exception:
pass
col = client.create_collection("c", metadata={"hnsw:space": "cosine"})
ids = list(chunks.keys())
texts = [chunks[i] for i in ids]
emb = model.encode(texts, show_progress_bar=False).tolist()
col.add(ids=ids, documents=texts, embeddings=emb)
def recall_at_k(k: int) -> float:
hits = 0
for q, gold in qa:
qe = model.encode(q, show_progress_bar=False).tolist()
r = col.query(query_embeddings=[qe], n_results=k)
retrieved = set(r["ids"][0])
hits += int(gold in retrieved)
return hits / len(qa)
for k in [1, 2, 3]:
print(f"Recall@{k} =", recall_at_k(k))
Takeaways
- Track Recall@k, MRR, and nDCG for retrieval; use human or LLM rubrics for answer faithfulness separately.
- Slice metrics by language, product area, and query length.
- Store golden sets in version control and run them on every embedding or chunking change.