May 14, 202612 min read

How we cut LLM costs 87% by routing only the generation step to Claude

A practical case for per-stage model selection in production RAG.

If you've built a Retrieval-Augmented Generation pipeline, you know the temptation: pick one model, set the temperature, ship it. It's clean. It's simple. It's also the most expensive way to run RAG.

A modern RAG pipeline makes five to seven LLM calls per user question, not one. The vast majority of those calls don't need a frontier model. We rebuilt our system to make that explicit; and our monthly LLM spend dropped 87%.

This post walks through the seven stages of our pipeline, what each one needs from a model, and what we landed on.

The seven calls behind every "single" answer

A user types "What does our standard MSA say about IP indemnification caps?" and presses enter. From the user's perspective, one model answered. From the system's perspective:

Router: classify the question. Is this a fact lookup (vector), an aggregation (SQL), or a relationship question (graph)? Output: one of vector | sql | cypher | platform.
Query expansion: generate 2–3 alternative phrasings to improve retrieval recall. Output: a short list of strings.
HyDE (Hypothetical Document Embeddings): write a plausible answer to the question without retrieval, then embed it to bias retrieval toward documents that look like the answer. Output: ~2 paragraphs.
Entity extraction: identify the named entities in the question (companies, products, clauses) for graph traversal. Output: a JSON array.
Relevance filter: for each retrieved chunk, decide whether it's worth keeping. Output: yes/no per chunk.
Self-RAG grounding: after generation, verify the answer is supported by the retrieved context. Output: {grounded: true | false}.
Generation: write the final answer the user reads.

Only one of these, generation, is the answer the user sees. The other six are infrastructure.

What each call actually needs

Here's where it gets interesting. The user doesn't see the output of stages 1–6. They only see whether the final answer was right. So the question for each stage isn't "what's the best model?"; it's "what's the cheapest model that doesn't degrade the final answer?"

Stage	Task	Latency	Context	Output	Sensitivity
Router	Classification	< 500 ms	1 question	1 word	Medium
Expansion	Generation	< 1 s	1 question	~50 tokens	Low
HyDE	Generation	< 2 s	1 question	~200 tokens	Low
Entity extraction	Structured	< 500 ms	1 question	~30 tokens	Medium
Relevance	Yes/no	< 200 ms × N	1 chunk	1 word	Medium
Self-RAG	Yes/no	< 1 s	answer + ctx	1 word	High
Generation	Long-form	streamed	full ctx	~500 tokens	High

This is the table we should have had before we started. Once you write it down, the trade-offs become obvious.

The configuration we shipped

After weeks of A/B testing on a corpus of 2,400 legal documents and 800 representative questions, here's what we run in production:

Stage	Primary	Fallback
Router	local 3B model (~250 ms)	cloud (haiku-class)
Expansion	local 7B model (~600 ms)	cloud (haiku-class)
HyDE	local 7B model	cloud (haiku-class)
Entity extraction	local 3B model	cloud (haiku-class)
Relevance	local 3B model	cloud (haiku-class)
Self-RAG grounding	local 3B model	cloud (haiku-class)
Generation	cloud (sonnet-class)	cloud (haiku-class)

Only the final answer hits a frontier cloud model. Everything else runs on a quantised mid-size model on a single workstation-class GPU.

The numbers

Before the refactor, every stage went through a sonnet-class model. After: only the final generation. Same workload, same hardware. Monthly LLM cost dropped 87%. Answer quality on a held-out human-evaluated set: statistically indistinguishable (78.2% acceptance vs. 78.9%; well within noise).

Why this isn't more common

Three reasons.

First, stage 7 (generation) feels like the whole product, so most pipelines were designed around picking one good model for it and then reusing that model for "convenience" everywhere else.

Second, the stages aren't equally important, but they're all visible in the same code path. When you write client.chat.completions.create(...) six times, it doesn't occur to you that one of those calls is doing something fundamentally cheaper than the others.

Third, switching providers per stage is painful in most LLM frameworks. We ended up writing a thin dispatcher: takes a stage argument, resolves it against a database table of (stage, primary_model, fallback_model) tuples, and uses the admin UI to change stage assignments without a deploy.

The whole pattern is one line at the call site:

answer = await llm.complete(
    stage="generation",
    system=system_prompt,
    user=user_prompt,
)

The dispatcher does the rest; loads stage config, picks provider, tries primary, falls back on retriable errors, records usage for the trace.

What we'd do differently

Audit quality per stage, not globally. When we shifted entity extraction to a small local model, our knowledge-graph route's accuracy dropped 6%. We fixed it by tightening the system prompt for that stage. If we'd only looked at end-to-end accuracy we'd have missed it.
Always keep a cloud fallback. Local models can be unavailable. Every stage in our config has a cheap cloud fallback. Triggered ~0.4% of the time. Worth it.
Make the trace public. Our "Why this answer?" panel shows admins which model ran each stage. This is what convinced our team that local models on the lower stages were genuinely doing the job.

Try this yourself

If you're running RAG with one model end-to-end, your bill is probably 3–10× what it could be without sacrificing quality. The lift to fix it is a few days, not a quarter. Start by writing the table above for your own pipeline; what each stage actually needs.

If you'd rather not build the dispatcher yourself, that's the product we ship. Pipeline stages are configurable from an admin UI, with both local and cloud models as options per stage.

Want to see this running on your documents?

A 20-minute walkthrough on your own corpus tells you more than any benchmark.

Let's talk