← Back to Blog
AI Architecture

The Fidelity Judge: How We Built an LLM to Grade Its Own Content Against the Source

Share:

AI hallucination is not a model quality problem. It is a verification gap. Here's the fidelity judge Bloomberry runs after every URL-based generation β€” an LLM that checks whether the output is actually about the source.

By Sadok Hasan

AI ARCHITECTUREThe Fidelity Judge: How We Built an LLM to Grade …

The generation model does not know when it is hallucinating.

That sounds obvious when you say it. But the practical implication is more important than most people realize: you cannot fix hallucination by telling the model to be more careful. The model cannot introspect on its own accuracy. It is optimized to produce fluent, coherent, appropriate-sounding text β€” not to track whether each claim can be traced back to the source material.

This is why Bloomberry runs a separate verification step after every URL-based generation: a fidelity judge. A second LLM call whose only job is to score the output against the source and identify content that could not have come from the source material.

The hallucination problem is structural

When you give a model a URL and ask it to write a LinkedIn post about the article, the model faces a challenging task with incomplete scaffolding.

It has to: read the source, identify what is specific versus general, decide which specific details to highlight, structure those details into a compelling post, match your voice, and format appropriately for the platform. All in one call.

The gap that hallucination fills is inference: what the model fills in when the source material is ambiguous, when specific details are missing, or when the general topic pattern from training data is more available to the model than the specific details in the source.

A post about a company launching a product gets generalized to "this kind of company launching this kind of product" β€” because that pattern is heavily represented in the model's training data. The specific details β€” the real company name, the real location, the exact product, the specific story behind it β€” get replaced by plausible-sounding generics.

The model does not know this is happening. It is generating the most probable continuation of the prompt.

What the fidelity judge does

After generation, Bloomberry runs a second LLM call β€” the fidelity judge β€” with a specific scoring task.

The judge receives the source content and the generated post. It has no context about the original prompt, the user's voice profile, or the platform. Its only task is to answer three questions:

1. How many specific claims in this post can be directly verified against the source?

The judge looks for named entities β€” people, places, products, organizations, specific numbers, dates, and events β€” that appear in both the source and the generated post. Each verified named entity is a positive fidelity signal.

2. How many claims in this post cannot be verified against the source?

Claims that are topically appropriate but not sourced to the article β€” industry generalizations, generic startup-growth patterns, unnamed statistics β€” are flagged as unverifiable. The judge does not call these hallucinations outright; some are valid context. But a post that contains zero verified named entities and many unverifiable claims is a strong hallucination signal.

3. What is the fidelity score?

A 0-100 score that represents the ratio of verified-specific to total-specific claims in the post. A score below a configurable threshold triggers a regeneration rather than showing the user the low-fidelity output.

The architecture: judge vs. generator

The key architectural decision is using a separate model call rather than asking the generating model to verify its own output.

This is important for two reasons.

The generator is optimized for the wrong thing. A model that just produced a piece of content has committed to it. If you ask it to verify whether that content is accurate, it will tend to find it accurate β€” because the output it produced looked accurate to itself at the time of generation. The verification and generation are not independent.

Constraints work differently in verification mode. A judge running in verification mode can be given explicit, constrained instructions: "return only JSON, with exactly these fields, each field containing only what can be directly sourced from the provided text." A constrained extraction task is a fundamentally different mode of operation than open-ended generation. Using a separate call lets you enforce those constraints without interfering with the generation.

This is the same logic behind the Haiku pre-pass architecture: extraction and generation are different tasks that benefit from different model configurations and different prompting strategies. You do not use a generation-optimized model for extraction, and you do not use a generation-optimized call for verification.

What this catches β€” and what it misses

The fidelity judge is effective at catching hallucinations that replace specific sourced facts with generic pattern-matching.

It catches: named entity substitution (real company replaced with "a leading AI company"), fabricated statistics ("the company has grown 300%"), generalized location references ("a Bay Area startup" for a company the article places in a specific city), and topically accurate but unsourced framing ("riding the wave of enterprise AI adoption").

It does not catch: accurate claims that happen to be generalizations the model correctly inferred from the source, or cases where the source itself is vague and the model extrapolated from that vagueness.

The judge is not a perfect hallucination detector. It is a practical filter. A post that scores below the fidelity threshold is regenerated β€” typically once, with the judge's failure analysis fed back as negative constraints on the second generation. The practical effect is that low-fidelity outputs do not reach users without at least one retry pass.

The broader lesson: verification requires a different call

The pattern generalizes. If your AI pipeline needs to verify something β€” factual accuracy, instruction compliance, style match, sentiment classification β€” the verification should be a separate call with a narrowly scoped instruction, not a question you add to the generation prompt.

The generation model is trying to write something good. The judge model is trying to catch where it fell short. These are adversarial tasks, and they are better served by separate calls than by asking the same call to do both.

Bloomberry runs three verification layers: the Haiku pre-pass before generation (extraction and outline), the fidelity judge after generation (accuracy scoring), and a structure check against the platform format requirements (LinkedIn post length, Twitter character limits, thread coherence). Each is a separate call with a specific scope.

None of them require a frontier model. Fast, cheap, constrained models running verification tasks at scale are more reliable than expensive models running combined generation-and-verification in a single overloaded call.

What it means for users

You should not have to think about any of this when you generate a post.

The architecture exists so that the output you see has already passed through a verification layer. If the post about your company's launch mentions the actual city, the actual product name, and the actual story β€” that is not because the model guessed right. It is because the pipeline verified that those specific elements made it into the output before it reached you.

Try Bloomberry on any URL and the fidelity pipeline runs automatically. You will not see a score or a verification status β€” you will just see a post that is actually about the article you gave it.


Related: How we use Claude Haiku to stop Gemini from hallucinating before it starts Β· How AI learns your writing voice Β· The AI Dialects research

Ready to write sharper?

Bloomberry turns your ideas into publish-ready thought leadership.

Try Bloomberry free

Related Bloomberry tools

Browse examples

Related guides

More from the blog