How we cut LLM costs 87% by routing only the generation step to Claude
A practical case for per-stage model selection in production RAG.
If you've built a Retrieval-Augmented Generation pipeline, you know the temptation: pick one model, set the temperature, ship it. It's clean. It's simple. It's also the most expensive way to run RAG.
A modern RAG pipeline makes five to seven LLM calls per user question, not one. The vast majority of those calls don't need a frontier model. We rebuilt our system to make that explicit — and our monthly LLM spend dropped 87%.
This post walks through the seven stages of our pipeline, what each one needs from a model, and what we landed on.
The seven calls behind every "single" answer
A user types "What does our standard MSA say about IP indemnification caps?" and presses enter. From the user's perspective, one model answered. From the system's perspective:
- Router — classify the question. Is this a fact lookup (vector), an aggregation (SQL), or a relationship question (graph)? Output: one of
vector | sql | cypher | platform. - Query expansion — generate 2–3 alternative phrasings to improve retrieval recall. Output: a short list of strings.
- HyDE (Hypothetical Document Embeddings) — write a plausible answer to the question without retrieval, then embed it to bias retrieval toward documents that look like the answer. Output: ~2 paragraphs.
- Entity extraction — identify the named entities in the question (companies, products, clauses) for graph traversal. Output: a JSON array.
- Relevance filter — for each retrieved chunk, decide whether it's worth keeping. Output: yes/no per chunk.
- Self-RAG grounding — after generation, verify the answer is supported by the retrieved context. Output:
{grounded: true | false}. - Generation — write the final answer the user reads.
Only one of these — generation — is the answer the user sees. The other six are infrastructure.
What each call actually needs
Here's where it gets interesting. The user doesn't see the output of stages 1–6. They only see whether the final answer was right. So the question for each stage isn't "what's the best model?" — it's "what's the cheapest model that doesn't degrade the final answer?"
| Stage | Task | Latency | Context | Output | Sensitivity |
|---|---|---|---|---|---|
| Router | Classification | < 500 ms | 1 question | 1 word | Medium |
| Expansion | Generation | < 1 s | 1 question | ~50 tokens | Low |
| HyDE | Generation | < 2 s | 1 question | ~200 tokens | Low |
| Entity extraction | Structured | < 500 ms | 1 question | ~30 tokens | Medium |
| Relevance | Yes/no | < 200 ms × N | 1 chunk | 1 word | Medium |
| Self-RAG | Yes/no | < 1 s | answer + ctx | 1 word | High |
| Generation | Long-form | streamed | full ctx | ~500 tokens | High |
This is the table we should have had before we started. Once you write it down, the trade-offs become obvious.
The configuration we shipped
After weeks of A/B testing on a corpus of 2,400 legal documents and 800 representative questions, here's what we run in production:
| Stage | Primary | Fallback |
|---|---|---|
| Router | local 3B model (~250 ms) | cloud (haiku-class) |
| Expansion | local 7B model (~600 ms) | cloud (haiku-class) |
| HyDE | local 7B model | cloud (haiku-class) |
| Entity extraction | local 3B model | cloud (haiku-class) |
| Relevance | local 3B model | cloud (haiku-class) |
| Self-RAG grounding | local 3B model | cloud (haiku-class) |
| Generation | cloud (sonnet-class) | cloud (haiku-class) |
Only the final answer hits a frontier cloud model. Everything else runs on a quantised mid-size model on a single workstation-class GPU.
The numbers
Before the refactor, every stage went through a sonnet-class model. After: only the final generation. Same workload, same hardware. Monthly LLM cost dropped 87%. Answer quality on a held-out human-evaluated set: statistically indistinguishable (78.2% acceptance vs. 78.9% — well within noise).
Why this isn't more common
Three reasons.
First, stage 7 (generation) feels like the whole product, so most pipelines were designed around picking one good model for it and then reusing that model for "convenience" everywhere else.
Second, the stages aren't equally important, but they're all visible in the same code path. When you write client.chat.completions.create(...) six times, it doesn't occur to you that one of those calls is doing something fundamentally cheaper than the others.
Third, switching providers per stage is painful in most LLM frameworks. We ended up writing a thin dispatcher: takes a stage argument, resolves it against a database table of (stage, primary_model, fallback_model) tuples, and uses the admin UI to change stage assignments without a deploy.
The whole pattern is one line at the call site:
answer = await llm.complete(
stage="generation",
system=system_prompt,
user=user_prompt,
)The dispatcher does the rest — loads stage config, picks provider, tries primary, falls back on retriable errors, records usage for the trace.
What we'd do differently
- Audit quality per stage, not globally. When we shifted entity extraction to a small local model, our knowledge-graph route's accuracy dropped 6%. We fixed it by tightening the system prompt for that stage. If we'd only looked at end-to-end accuracy we'd have missed it.
- Always keep a cloud fallback. Local models can be unavailable. Every stage in our config has a cheap cloud fallback. Triggered ~0.4% of the time. Worth it.
- Make the trace public. Our "Why this answer?" panel shows admins which model ran each stage. This is what convinced our team that local models on the lower stages were genuinely doing the job.
Try this yourself
If you're running RAG with one model end-to-end, your bill is probably 3–10× what it could be without sacrificing quality. The lift to fix it is a few days, not a quarter. Start by writing the table above for your own pipeline — what each stage actually needs.
If you'd rather not build the dispatcher yourself, that's the product we ship. Pipeline stages are configurable from an admin UI, with both local and cloud models as options per stage.
Want to see this running on your documents?
A 20-minute walkthrough on your own corpus tells you more than any benchmark.
Let's talk