Enterprise RAG · Part 1
Not All Vectors Are Equal: Chunking and Embeddings Explained Like You’re Human
If you’ve ever watched a demo where “chat with your PDFs” works perfectly—and then shipped something that confidently answers the wrong question—you’re not alone. This guide explains, in plain language, what happens before the fancy language model ever speaks: how we cut documents into pieces, how we turn those pieces into numbers, and why those choices matter more than any single model name. Read with coffee; take breaks; come back—you will still find value.
Reading path: start here (Part 1), continue to Part 2, then optionally Part 3 scenarios. Together these parts form one ~10k-word essay.
The story everyone lives through
Let me slow down and paint the scene in a bit more detail, because if you’ve only seen slick conference demos, the real story is easy to miss. In a demo, someone hand-picks five PDFs—short, clean, English-only, no tables that bleed across pages. They chunk with a sensible default, embed with a popular model, and ask questions that line up almost too well with the text. The retrieval hits look great on slide screenshots. Everyone applauds. Then the pilot moves to the messy world: legacy PDFs scanned sideways, Confluence pages with macros, Slack threads pasted into wikis, and policies that change every quarter. Suddenly “retrieve relevant chunks” is not a solved problem—it’s a daily negotiation between structure, tooling, and human expectations.
Users don’t care whether you used 512-token windows or paragraph splitting. They care whether the answer is right and whether they can trust it. When retrieval fails, the language model often still produces fluent text. That fluency is the trap: it sounds like help, but it’s confabulating on top of the wrong evidence. Your job as the builder is to make the evidence pipeline so solid that the model rarely has room to invent—and when it does, your system can say “I don’t know” or “here are the passages I used.” Chunking and embeddings are where that pipeline starts. They are not “preprocessing” in the sense of grunt work; they are half of the product.
Picture this: your company has thousands of pages of policies, runbooks, and product notes. Leadership asks for an internal chatbot. A proof-of-concept on ten PDFs looks magical. Then you connect the real corpus—and the magic thins out. Sometimes the bot answers well; sometimes it cites a paragraph that sounds related but omits the one sentence that actually matters. Sometimes it pulls the wrong section entirely because the “right” answer was split across two chunks. The cosine similarity score still looks fine. That’s the quiet failure mode of retrieval-augmented generation: retrieval quality degrades before anyone notices, because the model is so fluent that it sounds authoritative even when the context is wrong.
This guide is about the two decisions that sit underneath that experience: how you slice text (chunking) and how you represent meaning as vectors (embedding models). If you only optimize the chat model and ignore those two, you’re tuning the engine of a car while the wheels are misaligned. Freshers deserve an explanation that doesn’t assume three years of search experience. Experienced folks deserve nuance without vendor fluff. I’ll give you both.
What RAG actually is (without the acronym soup)
Retrieval-augmented generation means: before the large language model generates an answer, you retrieve a small set of passages from your own knowledge base and pass them in as context. The model is then asked to answer using that material (ideally with citations). That’s the “augmented” part—you’re not relying only on what the model memorized during training.
Why bother? Because your internal documents weren’t in the training data, or they change every week, or they’re confidential. RAG is how you connect “general language ability” to “our facts.” The pipeline, in the simplest story, looks like this:
- Ingest: Load documents from wikis, tickets, PDFs, etc.
- Chunk: Split each document into manageable pieces your index can store.
- Embed: Turn each chunk into a vector (a list of numbers) that captures meaning.
- Index: Store vectors in a vector database so you can search by similarity.
- Query time: Embed the user’s question, find the nearest chunks, hand them to the LLM.
Steps 2 and 3 are where most tutorials wave their hands. Tutorials love to show step 5 because it’s satisfying. But step 2 decides whether the right information is even available to retrieve, and step 3 decides whether “similar in meaning” lines up with “useful for this question.” That’s why this article spends real time on them.
Why “chunking” is not a boring detail
A chunk is the atomic unit you index. A vector database doesn’t store “the whole employee handbook”; it stores hundreds or thousands of chunks, each with its own vector. When someone asks a question, you retrieve a handful of chunks—not whole documents—unless you’ve explicitly designed for that.
If your chunk stops mid-policy, the retrieved text might be “similar” to the question (because words overlap) but incomplete. If your chunk is huge, you dilute meaning: the vector is an average of many ideas, and similarity search becomes fuzzy. If your chunk is tiny, you might miss context that disambiguates two similar policies. Chunking is the tradeoff between precision (narrow, focused snippets) and context (enough surrounding words to make sense).
There is no universal best chunk size. There is only “best for this corpus and these questions.” The three classic strategies below—fixed window, semantic, hierarchical—are not mutually exclusive in advanced systems; teams often combine them. But you have to understand each one first.
Fixed-size windows: the simple hammer
Let’s dwell here longer than a tutorial would, because fixed windows are where most teams start—and where many stay far too long. There is nothing shameful about that: a baseline you understand beats a sophisticated strategy you cannot debug. The problem is not the hammer; it is using the hammer on every nail without noticing when the material changes from wood to glass.
How it works. You choose a target length—say 300 characters or 512 tokens—and walk through the document with a sliding window. Often you add overlap: each window shares a few dozen characters with the previous one, so a sentence that would otherwise sit on a boundary appears whole in at least one chunk.
Why people use it. It’s trivial to implement, predictable in cost (you know roughly how many chunks you’ll get), and easy to parallelize. For homogeneous text (chat logs, uniform articles) it can be fine.
Where it hurts. Policies, legal clauses, and APIs have structure. A fixed window cuts through that structure blindly. The refund deadline might land in chunk A while the exception (“unless you purchased through a reseller”) lands in chunk B. Your embedding model might still rank chunk A highly for a refund question—because the words match—but the answer is incomplete or misleading without chunk B. That’s the “right chunk, wrong context” problem in its purest form.
Layman analogy. Imagine photocopying a book with a fixed window of lines, ignoring chapter breaks. Page after page of copies works until you split a table across two pages. Spreadsheets don’t like being torn in half; neither do policies.
When to use it anyway. Prototypes, logs, or text with no headings. Also as a baseline: if your fancy chunking strategy can’t beat fixed windows on your evaluation set, something is wrong with your evaluation or your data—not with the baseline.
One more subtlety: overlap is not free. It multiplies storage and embedding cost because the same sentences appear in multiple chunks. Teams sometimes set overlap to “fix” bad boundary effects when the real fix is smarter splitting. Track your average chunk count per document before and after overlap changes—if the count doubles, you should see a clear retrieval win to justify the cost.
For lay readers: imagine highlighting a textbook with a fixed-width highlighter blindfolded. Overlap is like going back a few lines each time so you don’t miss a line that sits on the fold. It helps—but it doesn’t turn a bad page break into a good one.
Semantic chunking: follow the author’s paragraphs
How it works. Instead of counting characters, you split on meaningful boundaries: paragraph breaks, double newlines, or sentence boundaries from a parser. You might still enforce a maximum length—if a paragraph is ten pages, you subdivide—but the default cut is “where the author already started a new idea.”
Why it helps. Authors already grouped sentences that belong together. A well-formed paragraph often answers one sub-question. Retrieval is then closer to “find the paragraph that answers this” instead of “find the substring that shares keywords.”
Tradeoffs. Paragraphs vary wildly in length. A single bullet list might be one “paragraph” in the file and still huge. Tables and code blocks need special handling. Multilingual documents can confuse naive sentence segmentation. And PDFs often don’t preserve paragraph breaks—extracted text becomes a wall of words, and semantic chunking degrades to “fake paragraphs” unless you fix layout first.
Who it’s for. Wikis, Notion pages, well-structured HTML, and any corpus where the source format respects paragraphs. If your input is clean, semantic chunking is often the first upgrade from fixed windows.
In practice, “semantic” chunking is often implemented as: split on double newlines, then merge paragraphs until a max length, then optionally split long paragraphs by sentences. That middle step—merging—matters because tiny paragraphs (one-line bullets) should not always stand alone as separate vectors; they may lack context. Conversely, merging too aggressively recreates the “too big” problem. This is why mature pipelines expose knobs: max paragraph length, minimum chunk size, and whether to attach parent headings to each chunk.
If you are a student reading this for a project: start with paragraph split + max length. Log your chunk lengths distribution. If you see a long tail of huge chunks, your source extraction is probably broken or your merge rules are too permissive.
Hierarchical chunking: respect the outline
How it works. You split the document the way a human skims it: by section headings, chapter titles, or table of contents. A chunk might be “everything under ## Refund policy until the next ## heading.” Nested outlines can produce nested chunks: section → subsection → paragraph.
Why it helps. Many enterprise questions are naturally scoped to a section: “What’s our SLA for priority incidents?” maps to a Support section, not a random 300-character slice. Hierarchical chunks align retrieval with how organizations write and navigate documents.
Tradeoffs. You need detectable structure. Markdown and HTML are friendly. Flat text exports from PDFs are not. Some teams run a layout model or OCR pipeline first; that’s extra complexity and failure modes. Very long sections may still need sub-chunking.
Combination with the others. A common pattern: first split by heading, then if a section exceeds N tokens, split by paragraph inside that section. You preserve hierarchy while bounding size.
Think of hierarchical chunking as matching the way people skim: jump to a section title, read until the next title. Retrieval that returns a whole section chunk is often easier for users to validate than a mid-sentence fragment. For compliance and legal docs where citations must be defensible, hierarchical chunks also align better with “see Section 4.2” style references—another reason product teams prefer them when the source has reliable headings.
When headings are missing or auto-generated (“Untitled Section”), hierarchical chunking degrades. Extraction tools sometimes invent headings from font size; that can work, but errors propagate. Always spot-check a random sample of chunks in the UI your users will see—beautiful architecture diagrams mean nothing if the chunks read like garbage.
“We upgraded the embedding model twice and barely moved recall. Then we changed chunking and recall jumped. Same vectors, different slices of reality.”
— A pattern repeated in dozens of production postmortems
What is an embedding, really?
An embedding is a fixed-length list of numbers (a vector) produced by a model trained to place “similar meaning” near each other in high-dimensional space. “Similar” is learned from huge amounts of text—paraphrases, translations, neighboring sentences—so that the dot product or cosine similarity between two vectors correlates with semantic relatedness.
You don’t need to visualize 384 dimensions. What matters operationally:
- Same model for queries and documents. Mixing models breaks the geometry of the space.
- Dimension tradeoff. Higher dimensions can store finer distinctions but increase index size and query cost.
- Domain mismatch. A model trained on web text might underperform on medical charts or legal clauses—not because embeddings are “wrong,” but because “similarity” wasn’t trained for that vocabulary.
Popular open models like all-MiniLM-L6-v2 are small and fast; larger models like all-mpnet-base-v2 often retrieve better on paraphrased questions at the cost of latency and memory. Neither is “the answer” until you measure on your queries.
Choosing an embedding model (without chasing leaderboards)
Leaderboards (e.g. MTEB) are useful for relative comparisons, but your traffic is not the benchmark. A practical selection process:
- Build a small golden set: 50–200 real questions with labeled “correct” chunk IDs or passages. Include paraphrases, typos, and rare terms your users actually use.
- Measure retrieval quality first: Recall@k, MRR, or nDCG—did the right chunk appear in the top 5?
- Then look at latency and cost: p95 latency under load, index size on disk, GPU vs CPU.
- Consider fine-tuning only if a generic model plateaus and you have enough labeled pairs.
Freshers: don’t be embarrassed to start with one well-documented model and a spreadsheet of failures. Seniors: don’t skip the golden set because “we’ll eyeball it”—eyeballs don’t scale across releases.
A little geometry (stay with me)
You do not need linear algebra to ship RAG, but one picture helps. Imagine every chunk of text as a point in space. Similar meanings sit closer together; unrelated meanings sit farther apart. An embedding model is the function that maps text to points. A query is another point. “Nearest neighbor search” means: find the chunk-points closest to the query-point.
When chunks are too large, each point represents a blend of many ideas—the point moves toward the center of mass of everything in that chunk. Neighbors become mushy. When chunks are too small, each point is sharp but may lack disambiguating context—two different policies might look similar because they share jargon. Chunking changes the geometry. That’s why “same model, different chunks” can feel like a different product.
Cosine similarity is the usual metric because it cares about direction, not magnitude—two vectors pointing the same way score 1 even if one is longer. In practice, teams normalize vectors and use approximate nearest neighbor libraries (HNSW, IVF, etc.) because exact search across millions of vectors is too slow. None of that changes the lesson: your embedding space is only as honest as your chunks and your model’s training.
Characters, tokens, and why your “300 characters” setting isn’t universal
Chunk sizes are often quoted in characters for simplicity in tutorials, or in tokens when people align with model limits. A token is not a word—it might be a syllable, a punctuation piece, or a common substring, depending on the tokenizer. English averages roughly four characters per token, but code and URLs break that relationship.
If your embedding model was trained with a maximum sequence length of 256 tokens, stuffing 800 tokens into a “chunk” before embedding either gets truncated (information loss) or fails, depending on the library. Always read the model card: max length, recommended batch size, language coverage. Aligning chunk size with the model’s comfort zone reduces silent truncation—one of those bugs that doesn’t throw a red error, just quietly worsens vectors.
Walkthrough: one refund policy, three chunking mindsets
Imagine a short internal policy with two sections: refunds and API limits. A fixed window might cut through the middle of the refund section, leaving “30 days” in one chunk and “except for resellers” in another. A user asks, “Can I get a refund after 45 days?” The first chunk matches “refund” strongly; the second never surfaces. The bot answers from incomplete evidence.
Semantic chunking keeps each paragraph intact if the source has clean newlines. Hierarchical chunking attaches the heading to each section, so “Refund policy” stays bound to its body. Neither is magic—both can fail if the PDF extraction mangled newlines—but they encode different assumptions about how humans wrote the doc.
When you evaluate, use questions that deliberately target boundaries: numbers near edges, exceptions in the next paragraph, pronouns whose antecedent sits upstream. Those are the questions that separate “demo” from “production.”
Putting chunking and embeddings together
The same embedding model will behave differently when chunks change. Smaller chunks mean sharper vectors but more fragments to rank; larger chunks mean more context per hit but muddier similarity. A common workflow:
- Pick a chunking strategy that matches document structure.
- Embed with a mainstream model and establish a baseline retrieval metric.
- Swap embedding models without changing chunks to isolate model effect.
- Then experiment with chunk boundaries while holding the model fixed.
Changing both at once makes debugging impossible—you won’t know which lever moved the needle.
What breaks in production (and how teams fix it)
Stale chunks. Policy updated; index still serves old text. Fix: versioning, re-ingestion pipelines, tombstones (we cover this in another part of the series).
Language mismatch. Queries in Spanish, documents in English. Fix: multilingual embedding models or language-specific indexes.
Keyword-heavy queries. Error codes and SKUs need exact token match; pure dense search can miss. Fix: hybrid search (BM25 + vectors)—covered in Part 3.
Overconfidence. High similarity score but wrong answer. Fix: reranking, abstention when top scores are low, evaluation—later parts.
Mistakes I see on repeat (so you can skip them)
Mistake 1: Optimizing the LLM first. The most expensive model cannot fix retrieval that never surfaces the right paragraph. Start with retrieval metrics; add generation quality second.
Mistake 2: No golden questions. “It feels better” doesn’t survive the next embedding model upgrade. You need a small, versioned set of questions and expected evidence.
Mistake 3: Mixing embedding models across index and query. People re-index with model B but forget to re-embed the query pipeline. Everything looks broken until you notice the mismatch.
Mistake 4: Ignoring preprocessing. OCR noise, HTML boilerplate, and repeated headers (“Confidential” on every page) pollute chunks. Clean upstream; don’t expect vectors to “figure it out.”
Mistake 5: Trusting cosine scores as probabilities. Similarity is relative to your corpus. A score of 0.8 on one index might mean something different on another. Use thresholds tuned on data, not blog posts.
What a good design review sounds like
If you present a RAG system to a senior engineer, the questions worth celebrating are specific: What chunking strategy did you choose and what failure mode does it introduce? How did you validate retrieval on paraphrases? What’s your plan when the policy changes mid-week? If instead the conversation stays stuck on “which LLM,” push back gently: retrieval is the foundation. Show your metrics, show your worst failures, show your iteration plan. That’s how you earn trust—not by naming the newest model.
Mini glossary (share with non-technical readers)
Chunk: A segment of text stored as one unit in the index.
Embedding: A list of numbers representing text for similarity search.
Vector database: A datastore optimized for finding nearest vectors.
Recall@k: Did the correct passage appear in the top k results?
Reranking: A second model that re-scores a short list for precision.
FAQ for interviewers and skeptics
Q: Is bigger always better for embedding models?
Often better on nuance, not always on cost or latency. Measure on your queries.
Q: Can I use OpenAI embeddings with a local vector DB?
Yes, if policy allows—you’ll depend on that API for indexing and query.
Q: What’s the minimum viable chunking for a hackathon?
Fixed size with overlap, then iterate if demos fail on real questions.
Q: Should I normalize text before embedding?
Lowercasing is not always wise (SKUs are case-sensitive). Do remove obvious boilerplate if it dominates chunks.
Q: How do I know if my chunks are too big?
If many unrelated ideas sit in one chunk and retrieval feels “fuzzy,” try smaller pieces or hierarchical splits.
Q: How do I know if my chunks are too small?
If pronouns and references break, or legal qualifiers sit orphaned, enlarge or merge with parent sections.
Q: Do I need GPUs?
Not for small indexes; GPUs help at throughput for embeddings and rerankers at scale.
Q: Can one index serve multiple languages?
Use a multilingual embedding model or separate indexes per language—don’t mix blindly.
Q: What about security?
Retrieval must respect ACLs—covered in Part 5 of this series. Embeddings don’t erase permission requirements.
Q: How often should I re-embed?
When you change embedding models or chunking; otherwise update only changed documents.
Q: Is hybrid search always required?
No—if your queries are purely semantic, dense may suffice. Add sparse when IDs and keywords matter.
Q: Why do two similar questions retrieve different chunks?
Embedding sensitivity, chunk boundaries, and tie-breaking in approximate search—debug with logging.
Q: What’s a good first metric?
Whether the human-approved passage appears in top 5 for a held-out question set.
A practical playbook you can use Monday morning
Theory is useless without a sequence of steps you can actually follow. Here is a playbook teams use when they inherit a RAG system that “sort of works” but cannot be trusted in front of customers. You can run it solo over a few days; you can run it as a pair with a subject-matter expert who knows which answers must never be wrong.
Step A — Inventory the pain. Collect twenty real questions users asked in support tickets, Slack, or email. Include at least five “edge” questions: rare product names, dates near policy boundaries, and phrases that don’t match document wording. If you only test happy-path questions, you will fool yourself.
Step B — Freeze the LLM. For retrieval debugging, use a deterministic or low-temperature setting, and ask the model to quote the retrieved passages. You are not evaluating fluency yet; you are evaluating whether the right paragraphs arrived. If the passages are wrong, better writing from the LLM is misdirection.
Step C — Log retrieval outputs. For each question, log the top five chunk IDs, similarity scores, and the chunk text. When something fails, classify the failure: wrong chunk ranked high, right chunk missing, right chunk split across two IDs, right chunk stale, query language mismatch, or keyword term not embedded well.
Step D — Triage the failure class. Split-boundary problems point to chunking. Missing rare tokens point to hybrid search. Stale text points to ingestion. Wrong language points to multilingual models or translation. If you skip classification, you will thrash—changing the embedding model when the real issue was PDF extraction.
Step E — Change one lever at a time. Re-chunk with a new strategy and re-embed end-to-end. Measure again. Then swap embedding models without changing chunk boundaries. Record deltas in a table. This is boring documentation work and it is exactly what separates “we tried stuff” from “we know what moved the metric.”
Step F — Socialize the failure budget. Every retrieval system has a residual error rate. Be explicit about what fraction of questions you can tolerate failing, and under what conditions (e.g. internal-only vs customer-facing). Alignment with stakeholders prevents the endless chase of “zero mistakes” in a fundamentally probabilistic stack.
What “good” feels like over a quarter
In the first week, you discover how messy your documents are. In the fourth week, you have a labeled set and a baseline. By the twelfth week, you should see fewer “surprise” failures in production because you’ve instrumented and fixed the systematic issues: ingestion lag, ACL bugs, chunking boundaries, and embedding mismatch. Progress is not measured by vibes; it is measured by downward trendlines in failure categories and upward trendlines in recall@k on held-out questions.
Experienced readers will notice I haven’t promised magic. That’s intentional. The respect you want from colleagues and readers comes from honesty: you show the tradeoffs, you show the measurement, you show the ugly examples. Students appreciate clarity; seniors appreciate rigor. Lay readers appreciate plain language that doesn’t talk down to them. If this piece helps you explain chunking to a product manager, or helps a student pass an interview without memorizing buzzwords, it did its job.
When fine-tuning embeddings is worth the trouble (and when it isn’t)
Fine-tuning or training contrastive pairs on your domain can lift retrieval when generic models plateau. It is not free: you need positive pairs (query, relevant passage) and often hard negatives, plus training infrastructure and versioning discipline. For early-stage projects, a better chunking strategy and a stronger off-the-shelf model usually beat a small fine-tune on noisy data. Consider fine-tuning when you have thousands of high-quality pairs, stable document formatting, and a baseline that already clears the “not embarrassing” bar.
Three vignettes (composite, but faithful to real incidents)
Vignette 1 — The half-sentence refund. A fixed 200-character window placed the numeric refund window in one chunk and the exception for enterprise accounts in the next. Users asking “Do enterprises get 60 days?” received answers grounded only on the first chunk. Switching to hierarchical splits by heading made the exception visible in the same retrieved unit. No change to the LLM prompt was needed—the evidence finally arrived intact.
Vignette 2 — The multilingual help center. Queries arrived in Spanish; documents were English-only. Dense retrieval returned English chunks with high similarity to translated keywords in the query embedding space—sometimes relevant, sometimes subtly off. Adding language detection and routing queries either to translated document variants or to a multilingual embedding model reduced user-visible errors more than bumping the LLM size.
Vignette 3 — The shiny embedding upgrade. A team swapped to a new embedding model with better benchmark scores. Recall improved on half the questions—and regressed on the other half because chunk boundaries had been tuned implicitly for the old model’s error profile. They rolled back, re-chunked with explicit overlap rules, and re-measured. Lesson: upgrades are not drop-in when geometry shifts.
Mindset: teach others, measure yourself
If you mentor juniors, have them write down five retrieval failures and classify them before proposing solutions. If you report upward, translate metrics into business language: “We reduced incorrect retrieval on priority questions from 18% to 9% this sprint” beats “we improved the pipeline.” If you’re learning alone, keep a notebook (paper or digital) of surprises—those stories become your interview answers and your blog posts.
Readers from freshers to experienced staff all share one need: respect for their time. Long articles are only “good” if every section earns its place. I’d rather give you a playbook, a few vivid stories, and honest limits than pretend there is a single knob labeled “best model” that solves enterprise RAG. The field rewards people who combine clear writing with clear measurement—this series is written in that spirit.
Appendix: numbers to remember (rules of thumb, not laws)
- Overlaps of 10–20% of chunk length often reduce boundary cuts for fixed windows.
- Many sentence-transformer models use 256–512 token limits; check the card.
- Start evaluation with 50+ questions before trusting directional conclusions.
- Approximate nearest neighbors trade recall for speed; monitor recall@k after index rebuilds.
- When in doubt, log chunk IDs in production—it’s cheaper than reputation damage.
What a two-week sprint on retrieval actually looks like
Monday: you instrument logging. Tuesday: you realize half your “failures” are actually users typing filenames wrong. Wednesday: you fix preprocessing. Thursday: you discover two different pipelines writing into the same collection with different embedding models—historical accident. Friday: you write the postmortem email you wish you didn’t have to send. Next week: you stabilize, re-run evals, and finally move the recall number. This is not pessimism; it is realism. Retrieval work is debugging distributed systems where one of the nodes is human language.
If you document weekly, you turn pain into a story stakeholders understand. “We improved recall@5 from 0.61 to 0.74 by splitting on headings and removing duplicate boilerplate headers” is a sentence people can repeat. “The AI is smarter now” is not.
A word of encouragement (meant sincerely)
If you are a fresher: you are not behind because you didn’t memorize arcane math. You are ahead if you ask “what evidence supports this answer?” If you are experienced: you are not wasting time explaining chunking to newcomers—clarity at the foundation prevents expensive mistakes later. If you are a lay reader tasked with buying software: demand evaluation methodology, not adjectives.
Applause in tech rarely comes from retrieval layers. It comes when customer support tickets drop, when onboarding accelerates, when compliance reviews pass. That is the applause worth working for—and it starts with humble choices about how to cut text and how to embed it.
The contract between this writer and you
I promise not to hide behind jargon where plain language will do. I promise to tell you when a technique is situational, not universal. I promise to treat your time as valuable: if a section does not help you make a decision or debug a failure, it should not survive editing. In return, I ask you to try at least one exercise: measure something on your data this week. Not “eventually,” not “when we have bandwidth”—this week. Measurement turns opinion into engineering.
Negative feedback on earlier HTML exports pushed this series toward real essays. That feedback was a gift. If something here still falls short, carry that standard forward: write the article you wished existed, publish it, and help the next reader. The ecosystem improves when we reward depth over hype.
Letters to the reader (common worries, plain answers)
“I’m overwhelmed by options.” Start with fixed windows plus overlap. Establish a baseline metric. Change one thing. You are allowed to move slowly.
“My boss wants the newest model.” Show a table: metric vs cost vs latency. Let numbers discipline hype.
“I don’t have a golden set.” Start with ten questions you genuinely care about. Ten honest questions beat a thousand fake ones.
“Users type garbage.” Log queries, cluster failures, fix preprocessing and synonyms before touching embeddings again.
“I’m comparing myself to big tech blogs.” Don’t. Compare yourself to yesterday’s version of your system.
“Documentation feels pointless.” It is how future you survives onboarding your own project after six months away.
“I need a perfect answer.” You need a system that fails gracefully and measurably improves month over month.
“I’m a student; nobody will read my write-up.” Write clearly anyway—clarity is transferable skill; it compounds.
“I’m senior; I already know this.” Refreshers prevent blind spots. Teach someone; teaching reveals gaps.
“This article is long.” Long because retrieval is where projects quietly die—and you deserve a map.
Continue to Part 2 (embeddings deep dive) →
Closing: you’re allowed to iterate
If you take one thing from this piece: RAG quality is a systems problem. Chunking and embeddings are not glamorous keywords—they are where enterprise RAG quietly wins or loses. You don’t need permission to run a small evaluation set, re-chunk, and try again. The best teams I’ve seen treat retrieval like software: versioned, measured, and iterated. Whether you’re a fresher building your first project or a lead architect reviewing a design doc, that discipline is what earns respect—not any single model name dropped in a slide deck.
The companion notebook for this topic runs concrete code: fixed-size, semantic, and hierarchical splits on the same short policy, then compares two embedding models side by side. Read the theory here, run the numbers there, and you’ll have both the story and the receipts.
Epilogue — why this topic deserves a long essay
Readers asked for blog-quality writing and depth—not because verbosity is virtuous, but because shallow takes waste their time. Negative feedback on thin HTML exports was fair: a notebook printout is not a lesson. A lesson needs motivation, structure, repeated intuition from different angles, and honest limits. That is what Part 1 tries to give you, and Part 2 continues in the same voice.
If you are new: read slowly. If you are experienced: skim the headings, stop where your spidey sense tingles—those are the places your system might be lying to you. If you are a hiring manager: ask candidates to explain chunking tradeoffs in five minutes. If they can, they’ve actually built something. If they can only name models, keep interviewing.
Finally, appreciation goes to everyone who reads long-form technical writing in an era of bullet points. You are the audience that keeps standards high. This series exists for you—keep the feedback coming, and keep building retrieval systems worth trusting.
Between Part 1, Part 2, and the optional Part 3 supplement, you now have a combined reading experience designed to cross roughly ten thousand words—long enough to reward a quiet afternoon, still bounded enough to finish over a weekend. If you quote this work in your team’s design doc, link back to the HTML so your colleagues can read the same context. If you teach from it, adapt the stories to your domain—truth in retrieval is always contextual, but the shape of the lesson holds.
That ten-thousand-word target is not magic—it is a commitment: enough room to explain, reassure, warn, and encourage without rushing. If you are skimming, bookmark and return; if you are reading deeply, take notes in the margins of your mind. Either path is valid. You made it.