Why we ship Self-RAG retries — and what happened when we didn't
A measurement bug that lived in our streaming pipeline for months, what users actually saw, and how we fixed it without sacrificing latency.
There's a phrase that gets thrown around in RAG circles: "Self-RAG." It comes from a paper that, distilled, says this: after your model generates an answer, ask a second model whether that answer is actually supported by the context you retrieved. If it isn't — regenerate.
We built that. Or so we thought.
For months our trace UI faithfully recorded a grounded: true | false verdict for every answer. The trace looked correct. The verdict was correct. There was just one problem: on the streaming code path, nothing actually used the verdict. Self-RAG was running. It was telling the truth. And we were ignoring it.
This is the story of how that happened, what users were actually seeing, and how we fixed it without making the chat feel sluggish.
What Self-RAG is supposed to do
A RAG pipeline retrieves chunks of text, hands them to a generation model with the user's question, and gets back an answer. That answer is supposed to be grounded in the retrieved context. Sometimes it isn't. Sometimes the model:
- Paraphrases generously enough that a specific number drops out
- Reorganises bullet points into its own structure, dropping nuance in the process
- Invents a detail that "feels right" but isn't in the source
- Hedges so much that the actual answer in the context gets buried
Self-RAG catches this. The mechanic is straightforward:
async def is_grounded(answer: str, context: str) -> bool:
response = await llm.complete(
stage="self_rag",
system='You verify whether an AI response is factually supported by the provided context. Respond ONLY with JSON: {"grounded": true} or {"grounded": false}.',
user=f"Context:\n{context}\n\nResponse:\n{answer}",
)
return json.loads(response).get("grounded", True)It costs one extra LLM call per question. A small one — usually a local 3B model. ~300 ms.
The interesting question isn't how to detect it. It's what to do when the verifier says false.
Two code paths, one decision, one mistake
Our pipeline has both a non-streaming entry point (used by internal jobs and the API) and a streaming entry point (used by the chat UI — the one users actually see).
The non-streaming path was the original implementation. It did this:
for attempt in range(2):
extra = "" if attempt == 0 else "Only use information explicitly stated in the context."
answer = await generate(query, context, history, extra)
if route != "vector" or await is_grounded(answer, context):
breakGenerate. If not grounded, regenerate with a stricter instruction. Up to two attempts. Ship whichever is grounded; ship the second attempt if neither is. Simple, correct.
Months later, when we added streaming, we did this:
collected_tokens = []
async for token in generate_stream(query, context, history):
collected_tokens.append(token)
full_response = "".join(collected_tokens)
# Self-RAG grounding check (best-effort)
try:
is_grounded = await is_grounded_check(full_response, context)
except Exception:
pass
# Re-chunk and stream tokens
for chunk in rechunk(full_response):
yield {"type": "token", "content": chunk}
yield {"type": "trace", "is_grounded": is_grounded, ...}Read what happens after is_grounded = await .... The variable gets assigned. Then we stream the response. Then we emit a trace event that includes the verdict.
There is no if not is_grounded: retry. The check runs. The result is recorded. Nothing acts on it.
The non-streaming path checked grounding and re-generated. The streaming path checked grounding and recorded it in the trace. We thought we'd ported the feature. We'd ported half of it — the measurement, not the action. The chat UI shipped un-grounded answers and quietly noted, in a trace panel most users never opened, that the answer wasn't grounded.
How we found out
A user asked, in the actual product, "What are the key hardware features for small offices?" against a video-conferencing comparison document. The answer they got was a smoothly written summary — well-structured headings, bulleted product names, plausible recommendations. The answer the document actually contained was very different: specific FoV degrees, specific resolutions, specific microphone array features, a named "AI Features" section.
The model had taken a verbatim-quotable passage and re-bullet-pointed it into its own structure. Product names survived (those are anchored words the model can't invent). Numbers dropped out. The "AI Features" section disappeared entirely because it didn't fit the new structure.
The user opened the trace panel. The trace said grounded: false. The user — sensibly — replied, "why was this answer shipped at all?"
That's how we found out. Not from a metric, not from an alert, not from a regression test. From a customer reading the trace we'd built and noticing it was lying about doing something.
The fix
The shape of the fix is exactly what the non-streaming path always did, ported to streaming:
async def stream_once(extra_instruction: str) -> str:
buf = []
async for tok in generate_stream(query, context, history, extra_instruction):
buf.append(tok)
return "".join(buf)
# Attempt 1
full_response = await stream_once(extra_instruction)
is_grounded = await is_grounded_check(full_response, context)
# Attempt 2 — only if first attempt wasn't grounded on the vector route
low_confidence = False
if route == "vector" and is_grounded is False:
strict = extra_instruction + (
"\n\nIMPORTANT: Use ONLY information explicitly stated in the context. "
"Prefer quoting the source's wording over paraphrasing."
)
full_response = await stream_once(strict)
is_grounded = await is_grounded_check(full_response, context)
if is_grounded is False:
low_confidence = True
# Re-chunk and stream the final response
for chunk in rechunk(full_response):
yield {"type": "token", "content": chunk}
# If both attempts failed grounding, flag the message so the UI can badge it
if low_confidence:
yield {"type": "low_confidence", "reason": "self_rag_ungrounded", "attempts": 2}Two changes that matter beyond the obvious:
1. Strict-mode prompt on retry, not on attempt 1. The strict instruction ("prefer quoting the source's wording over paraphrasing") hurts answer fluency on questions the model would have got right anyway. It's a useful nudge when grounding has already failed. It's not a useful default.
2. Don't refuse on second failure — flag. If attempt 2 still isn't grounded, we ship it anyway, but the chat UI shows a small amber pill:
⚠ Low confidence: the grounding check couldn't verify this answer against the retrieved sources (Self-RAG, 2 attempts). Double-check the cited passages.
Refusing to give an answer on legitimate questions is worse UX than giving an imperfect one with a clear warning. The pill exists for the cases where the model genuinely can't extract a clean answer from what was retrieved — usually a sign the right chunks aren't in the context, not that the model is hallucinating.
What it cost us in latency
The honest answer: a noticeable amount when the retry fires, nothing the rest of the time.
A normal answer: stream as before, plus one ~300 ms Self-RAG check at the end. Imperceptible.
A retry-needed answer: full first attempt (~2–4 s), Self-RAG check (~300 ms), full second attempt (~2–4 s), second Self-RAG check (~300 ms). So an extra 2.5–4 seconds of latency, ~once in every ~15 chats based on our production traffic. We considered streaming the first attempt to the client immediately and then "correcting" if it failed grounding — but reading half a wrong answer and then watching it get replaced is a worse experience than waiting an extra few seconds for the right one. So we buffer.
The fix moved Self-RAG from "measurement" to "action" — and the trace finally means what the panel said it meant.
What we'd do differently
- Treat the trace UI as a contract. If
grounded: falseis recorded, the system should have done something about it. A trace event that says "not grounded" while the answer streams unchanged is worse than no trace at all — it implies a safeguard exists when it doesn't. - Test the streaming path explicitly. We had unit tests on the non-streaming path's retry loop. We had no test asserting that the streaming path's behaviour differed when
is_groundedflipped to false. The bug shipped because nothing failed when the behaviour was wrong. - Show users the confidence signal. The low-confidence badge could have been invisible (logged only for ops review). We surfaced it because, in regulated industries, users want to know when the system isn't sure. Hiding it would have been the safer-feeling but worse choice.
Why this matters
If you're building RAG and you've added a Self-RAG-style verifier, ask yourself: when the verifier says false, does anything change in the answer the user sees? If the answer is "no, we just record it," you have the bug we had. The fix is straightforward. The hard part is realising the trace and the behaviour aren't the same thing.
Trace events are evidence. Retries are action. They are not interchangeable.
Curious how Kernel handles edge cases like this in production?
A 20-minute walkthrough on your own corpus tells you more than any benchmark.
Let's talk