07 — Stale Index: Incremental Updates and Tombstones¶
Problem: Users read answers from an index that lags the source system. Deletes must not “resurrect” as ghost chunks.
In this notebook: Chroma delete by id simulates tombstone; show query before/after delete.
In [ ]:
import sys
from pathlib import Path
_REPO = Path.cwd().resolve()
if (_REPO / "src").is_dir():
sys.path.insert(0, str(_REPO / "src"))
from rag_series_utils import chroma_path, get_client
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")
p = chroma_path("nb07_stale")
client = get_client(p)
try:
client.delete_collection("docs")
except Exception:
pass
col = client.create_collection("docs", metadata={"hnsw:space": "cosine"})
texts = ["Old policy: refunds within 7 days.", "New policy: refunds within 30 days."]
ids = ["doc_policy_v1", "doc_policy_v2"]
emb = model.encode(texts, show_progress_bar=False).tolist()
col.add(ids=ids, documents=texts, embeddings=emb)
q = "How long is the refund window?"
qe = model.encode(q, show_progress_bar=False).tolist()
def top1():
r = col.query(query_embeddings=[qe], n_results=1)
return r["documents"][0][0], r["ids"][0][0]
print("Before delete v1:", top1())
col.delete(ids=["doc_policy_v1"])
print("After tombstone v1:", top1())
Takeaways
- Treat the index as eventually consistent; expose freshness in the UI when needed.
- Use stable source IDs so updates are upserts and deletes propagate.
- Monitor ingestion lag (time from source change to searchable).