About

Origin

This methodology came out of ultrasonic non-destructive testing. In phased array inspection you fire sound waves into a piece of steel and read the echoes. Intact material returns a predictable back-wall echo — the wall. Any signal that deviates from the wall is a flaw.

The same geometry applies to language models. Token probabilities propagating through a model's weight matrices produce a measurable wall, and deviations from that wall signal fabrication. The weight matrices are the material under inspection. The tokens are the sound wave. The wall regression is the back-wall echo. The residual is the flaw indication.

What would a perfect model look like?

A model that genuinely knew the answer to everything would not score low on this index. It would score at the top. It would answer confidently — and that confidence would track perfectly with correctness. Every confident answer would be a correct answer, every uncertain moment would be on a question at the edge of knowledge, and our measurement would see a clear, consistent pattern across every domain.

High confidence is not the problem. Confidence that has no relationship to whether the answer is right — that is the problem. One of the largest models we tested, GPT-OSS 120B at 120 billion parameters, demonstrates this: large, confident, and still clearly measurable on the standard scale. Size does not determine the score. The relationship between confidence and correctness does.

What the scores measure

Every token a language model processes produces a probability distribution across its vocabulary. That distribution has a shape, and the shape looks different depending on whether the model is retrieving a fact or making one up. Retrieval concentrates the mass — one or two tokens dominate. Fabrication spreads it — many plausible-sounding tokens compete because the model is picking between interchangeable guesses instead of recalling a specific answer. We measure that shape. No ground truth. No repeated generations. No access to weights or training data. One pass per prompt.

Each model gets two measurements. Recognition is the signal from the model processing your prompt, before it writes anything. Accuracy is the signal from the tokens it generates in response. They're independent. Recognition tells you whether the model recognizes the topic. Accuracy tells you whether it commits to specific facts or hedges between synonyms. Recognition carries about fifteen times more signal per token than accuracy.

Reliability is a composite: 60% recognition, 40% accuracy. Recognition gets the larger weight because it's the stronger signal.

Reading the scorecard

Confidence

Confidence refers to how much probability a model places on its top choice at each token. A model that assigns 99% to one word is highly confident. A model that spreads probability across many competing words is uncertain. Our measurement works by comparing how confidence behaves when a model knows the answer versus when it does not.

Recognition (0–100)

Measures the model's probability patterns while processing your prompt — before it generates a single word. When you ask a question, the model predicts what comes next at every word of your question. On topics it knows, these predictions form coherent domain vocabulary. On topics it will fabricate, the predictions scatter into unrelated word associations. The recognition score captures this difference. Higher means the model consistently recognizes the topics it's asked about. This is the stronger of the two measurements — approximately fifteen times more signal per token than accuracy.

Accuracy (0–100)

Measures the model's probability patterns during response generation. At each word the model produces, it assigns probabilities across its entire vocabulary. When retrieving known information, one or two words dominate — the model is committed. When fabricating, many plausible-sounding words compete — the model is choosing between interchangeable embellishments rather than recalling a specific fact. The accuracy score captures this difference. Higher means the model generates with knowledge geometry rather than fabrication geometry.

Two models (DeepSeek V3 and DeepSeek V3.1) show no accuracy score. This is not a gap in our coverage — it is a result. When most models fabricate, their output looks measurably different from when they are stating facts. These two models do not show that difference. They answer with the same confidence and fluency whether they are right or wrong. We cannot separate the two signals because the model itself does not separate them. Independent benchmarks and user reports confirm this: DeepSeek is widely noted for producing hallucinations that are difficult to catch because they sound exactly like factual responses.

Reliability (0–100)

A weighted composite of recognition (60%) and accuracy (40%). Recognition receives higher weight because it carries a stronger signal per token. When only one measurement is available, reliability equals that measurement alone. 50 is the geometric boundary — the point where the probability geometry transitions from retrieval to fabrication. Scores above 50 trend toward factual retrieval. Strong models score 55–80.

Consistency (0–100%)

How stable a model's reliability is within a domain. A model scoring 72 with 90% consistency is safer than one scoring 75 with 50% consistency. The inconsistent model might give a factual response or a fabricated response on any given prompt — you cannot predict which.

Tier Breakdown

Reliability decomposed by question difficulty — Foundational, Undergraduate, Professional, Expert, and Frontier. The shape of the curve matters more than any single number. A model that scores higher on Frontier than Professional is well-calibrated — it gets cautious when it reaches the edge of its knowledge rather than confidently fabricating. A model that declines steeply from Foundational to Frontier fabricates more as difficulty increases.

Why smaller models can score higher

This scorecard does not measure how smart a model is. It measures how catchable a model is when it makes things up.

A smaller model that hesitates, hedges, or stumbles when it does not know something is easy to catch. The difference between its confident answers and its uncertain ones is large and obvious. A bigger, more polished model that sounds equally smooth whether it is right or wrong is harder to catch — and scores lower here because of it.

Llama 3.1 8B scores higher than Llama 3.3 70B despite being a fraction of the size. The smaller model's uncertainty is visible when it fabricates. The larger model has been trained to sound confident on everything, which makes its mistakes invisible to our measurement without extended precision.

If you are deploying AI where factual reliability matters, you want the model whose mistakes are visible — not the one that fails silently.

The Name

Spectral — the measurement is spectral analysis of probability distributions. Every token a model produces carries a geometric signature in its probability spectrum. We read that spectrum the way an NDT inspector reads an ultrasonic waveform: the shape tells you what's inside without cutting it open.

Graph — the output is a graph of what the model actually knows. Not a benchmark score, not a leaderboard position. A map of knowledge and fabrication across domains, tiers, and time.

Spectralgraph sees what's invisible in normal use — the geometry hiding inside every probability distribution, at every token, on every question.

Limitations

The current index covers 18 open-weight models, all measured on the same inference engine, the same probe library, and the same measurement parameters. This matters — our testing shows that different inference stacks running the same model can produce measurements that differ by 8–41× in absolute value. Rankings hold, but mixing instruments on one scorecard would be misleading. We hold measurement conditions constant across all models.

Closed-weight models like GPT-4o, Claude, and Gemini aren't here yet — not because the methodology doesn't work on APIs (it does), but because we can't run them on the same infrastructure as every other model on the scorecard. When we can standardize those measurements, they'll be added.

The index is updated quarterly. Models can change between updates — providers push modifications, fine-tunes shift behavior, infrastructure changes alter output distributions. Each score reflects the model as it was at the time of measurement.

Contact

Model providers — detailed profiling reports available

Regulators — full methodology disclosure under NDA

contact@spectralgraph.ai

← AI Reliability Index Beyond the Scorecard →