02 — Right Chunk, Wrong Context: Structural Chunking¶
Problem: Fixed-size windows often split mid-thought. Retrieval returns a “relevant” fragment that lacks the answer (e.g. policy number in the next chunk).
In this notebook: One synthetic policy document; compare fixed-size chunks vs paragraph chunks for the same query.
In [ ]:
import sys
from pathlib import Path
_REPO = Path.cwd().resolve()
if (_REPO / "src").is_dir():
sys.path.insert(0, str(_REPO / "src"))
from rag_series_utils import chunk_fixed_size, chunk_by_paragraphs, chroma_path, get_client
from sentence_transformers import SentenceTransformer
long_doc = '''
Refund Policy
Enterprise customers may request a full refund within 30 days of the invoice date.
The policy identifier for billing disputes is POL-ENT-7721. Include this ID in support tickets.
API Rate Limits
Standard tier allows 100 requests per minute. Burst limits may apply during incidents.
'''.strip()
q = "What is the policy ID for billing disputes?"
model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")
def index_and_query(chunks, slug):
p = chroma_path(f"nb02_{slug}")
client = get_client(p)
try:
client.delete_collection("chunks")
except Exception:
pass
col = client.create_collection("chunks", metadata={"hnsw:space": "cosine"})
emb = model.encode(chunks, show_progress_bar=False).tolist()
ids = [f"c{i}" for i in range(len(chunks))]
col.add(ids=ids, documents=chunks, embeddings=emb)
qe = model.encode(q, show_progress_bar=False).tolist()
res = col.query(query_embeddings=[qe], n_results=1)
return res["documents"][0][0], res["distances"][0][0]
fixed = chunk_fixed_size(long_doc.replace("\n", " "), chunk_size=120, overlap=20)
para = chunk_by_paragraphs(long_doc, max_chars=400)
top_fixed, d_fix = index_and_query(fixed, "fixed")
top_para, d_par = index_and_query(para, "para")
print("Query:", q)
print("\nFixed-size top chunk:\n", top_fixed)
print("distance:", d_fix)
print("\nParagraph top chunk:\n", top_para)
print("distance:", d_par)
print("\nGround truth contains POL-ENT-7721:", "POL-ENT-7721" in top_fixed, "POL-ENT-7721" in top_para)
Takeaways
- Prefer structure-aware splitting (headings, paragraphs, tables) when documents have clear sections.
- Overlap reduces boundary cuts but increases index size; tune with real layouts.
- Evaluate with questions whose answers sit on chunk boundaries — that is where naive chunking fails.