This methodology came out of ultrasonic non-destructive testing. In phased array inspection you fire sound waves into a piece of steel and read the echoes. Intact material returns a predictable back-wall echo — the wall. Any signal that deviates from the wall is a flaw.
The same geometry applies to language models. Token probabilities propagating through a model's weight matrices produce a measurable wall, and deviations from that wall signal fabrication. The weight matrices are the material under inspection. The tokens are the sound wave. The wall regression is the back-wall echo. The residual is the flaw indication.
Every token a language model processes produces a probability distribution across its vocabulary. That distribution has a shape, and the shape looks different depending on whether the model is retrieving a fact or making one up. Retrieval concentrates the mass — one or two tokens dominate. Fabrication spreads it — many plausible-sounding tokens compete because the model is picking between interchangeable guesses instead of recalling a specific answer. We measure that shape. No ground truth. No repeated generations. No access to weights or training data. One pass per prompt.
Each model gets two measurements. Reading is the signal from the model processing your prompt, before it writes anything. Writing is the signal from the tokens it generates in response. They're independent. Reading tells you whether the model recognizes the topic. Writing tells you whether it commits to specific facts or hedges between synonyms. Reading carries about fifteen times more signal per token than writing.
Reliability is a composite: 60% reading, 40% writing. Reading gets the larger weight because it's the stronger signal. A few very large models produce writing-phase distributions too concentrated to resolve at this precision — those show reading only.
Measures the model's probability patterns while processing your prompt — before it generates a single word. When you ask a question, the model predicts what comes next at every word of your question. On topics it knows, these predictions form coherent domain vocabulary. On topics it will fabricate, the predictions scatter into unrelated word associations. The reading score captures this difference. Higher means the model consistently recognizes the topics it's asked about. This is the stronger of the two measurements — approximately fifteen times more signal per token than writing.
Measures the model's probability patterns during response generation. At each word the model produces, it assigns probabilities across its entire vocabulary. When retrieving known information, one or two words dominate — the model is committed. When fabricating, many plausible-sounding words compete — the model is choosing between interchangeable embellishments rather than recalling a specific fact. The writing score captures this difference. Higher means the model generates with knowledge geometry rather than fabrication geometry. Some very large models produce distributions too concentrated to resolve at this measurement precision — these show “—” instead of a score.
A weighted composite of reading (60%) and writing (40%). Reading receives higher weight because it carries a stronger signal per token. When only one measurement is available, reliability equals that measurement alone. 50 is the geometric boundary — the point where the probability geometry transitions from retrieval to fabrication. Scores above 50 trend toward factual retrieval. Strong models score 55–80.
How stable a model's reliability is within a domain. A model scoring 72 with 90% consistency is safer than one scoring 75 with 50% consistency. The inconsistent model might give a factual response or a fabricated response on any given prompt — you cannot predict which.
Reliability decomposed by question difficulty — Foundational, Undergraduate, Professional, Expert, and Frontier. The shape of the curve matters more than any single number. A model that scores higher on Frontier than Professional is well-calibrated — it gets cautious when it reaches the edge of its knowledge rather than confidently fabricating. A model that declines steeply from Foundational to Frontier fabricates more as difficulty increases.
All models are measured identically. But larger models produce extremely concentrated probability distributions — on most tokens they assign over 99.9% to their top choice. At this compression, the standard wall regression cannot resolve differences. The magnified scale stretches this compressed range using a nonlinear transform so reliability differences become visible. Same principle as switching a multimeter to a finer range. Scores are comparable within each group but not across groups.
Spectral — the measurement is spectral analysis of probability distributions. Every token a model produces carries a geometric signature in its probability spectrum. We read that spectrum the way an NDT inspector reads an ultrasonic waveform: the shape tells you what's inside without cutting it open.
Graph — the output is a graph of what the model actually knows. Not a benchmark score, not a leaderboard position. A map of knowledge and fabrication across domains, tiers, and time.
Spectralgraph sees what's invisible in normal use — the geometry hiding inside every probability distribution, at every token, on every question.
Model providers — detailed profiling reports available
Regulators — full methodology disclosure under NDA