Every serious industry has a way to test what it can't see.
AI finally has one.

We measure how often the language model you run makes things up, using your own questions, and hand you a signed report your governance file can use.

Read the sample report See our public index

No customer data touched · Nothing installed · Fixed fee · Five business days

US Patent Pending App. No. 19/724,790 Published methodology · DOI 10.5281/zenodo.21365655 Proof of method · 18 models measured, 15 domains Spectralgraph LLC · Enid, Oklahoma

The question that is coming

Someone is going to ask for your AI testing evidence. The only question is when.

Insurance examiners

OID Bulletin 2024-11: document testing and validation of AI systems, before and after deployment.

Bank & CU examiners

The revised model-risk guidance (April 2026) puts generative AI outside its scope, so a bank's own risk management governs it. Nobody is going to hand you the test.

Joint Commission

Responsible Use of AI in Healthcare (June 2026) asks health systems to show they monitor and validate AI performance.

Courts & the ABA

Courts have sanctioned lawyers over AI-fabricated citations. ABA Formal Opinion 512 put the duty in writing.

When that question arrives, what's in the file?

The instrument

We don't grade the model's answers. We read its internal signals while it writes them.

When a language model knows something, its token probabilities concentrate. When it makes something up, they scatter across interchangeable guesses. That difference is measurable at every token, without ground truth and without your data. The method grew out of ultrasonic non-destructive testing, which is how you inspect a weld without cutting it open.

In testing so far it catches fabricated answers at 0.852 AUC on models it has never seen before: fabricated-entity discrimination, length-controlled, leave-one-model-out. Published in the lab's method paper (DOI: 10.5281/zenodo.21365655).

Fabrication detection, token by token

What is the recommended dosage of Celtrazine for pediatric patients with acute bronchial inflammation?

Token DR signal

—

"Celtrazine" does not exist. Watch the model invent an approval year, a dosage, and a loading dose, and watch the signal catch each one. Simulated for the web; the real instrument runs against your model, on your questions.

The engagement

Five business days. Nothing installed. No meetings required.

Send your question types

The kinds of questions your model faces in production. No customer records, no member data, no PHI.

COVERED BY WRITTEN CONFIDENTIALITY AGREEMENT

Designate an endpoint

We measure against a deployment you control, inside systems you govern. Your data never leaves your hands.

NO-PHI VENDOR BY DESIGN

Receive the signed report

Signed and dated, with the complete findings file, so your team can check every flagged answer.

DELIVERED ≤5 BUSINESS DAYS FROM MATERIALS-COMPLETE

The deliverable

Judge the work before you spend a dime.

Page one of the sample LLM Validation Report: document control, scope of measurement, and results at a glance with per-domain reliability bars.

SIGNED & DATED VERIFIABLE FINDINGS

The sample report is real measurement on a fictional client. Read it first.

Per-domain reliability, mapped to the question types you actually run
The specific flagged answers, so your team can check each one
Findings file included, so every number in the report is verifiable

Read the sample report

Delivered by email, same day. No call required.

Who this is for

Built for organizations that answer to someone.

Insurers

Evidence for OID Bulletin 2024-11, on the models you run.

Banks & credit unions

A dated baseline and quarterly trend record, ready before the examiner asks.

Health systems

Validation evidence for Joint Commission RUAIH. No PHI by design.

Law firms

Confidentiality-safe measurement for self-hosted models. Client files never touched.

Sovereign & data-sovereignty organizations

The measurement runs entirely inside systems you control, for organizations where that is not negotiable.

PROOF OF METHOD
We ran this instrument across 18 open-weight models and 15 regulated domains under identical conditions, and published every score. Judge the method in the open before you put it on your model. More on how it compares to published interpretability research on The lab. AI Reliability Index →

What the lab stands behind

Four commitments, in writing.

Every report is signed by name and dated.

Any factual or methodological error your reviewers find is corrected and the report re-issued at no cost.

Errors and omissions coverage is bound before any client work begins.

Every technical question your examiners or reviewers raise gets a written answer, on the record.

Engagements & pricing

The price is published, and it never depends on what I find.

Below: measurement for organizations that run AI. Further down: evidence for companies that sell it.

The founding cohort, by selection: three organizations will found this practice at $4,500, with Continuous Assurance locked at $4,500/quarter for two years, in exchange for a short case study you approve in writing, named or anonymized at the same rate. Invoiced only on delivery. If the report gives your governance file nothing you can use, say so in writing within fourteen days and the invoice is cancelled. That guarantee is about the deliverable, never about what the measurement finds. One per client, three ever, then list.

LLM Validation Report

$9,500

one-time · list

One model, your question types. Signed report, complete findings file, five business days. Your fabrication-risk baseline.

Where the value compounds

Continuous Assurance

$6,500

per quarter · list

Quarterly re-measurement, a trend record your governance file can point to, change gates when you swap or fine-tune models, and an incident allowance.

One production model: $6,500. A second model or heavy change cadence: $8,000. Founding clients lock $4,500 and $6,000 for two years. Your exact number is in the engagement letter before you sign.

Model Selection Study

$4,500–7,500

one-time

Comparative measurement across candidate models before you commit, chosen on evidence instead of vendor claims.

Two candidates: $4,500. Each additional candidate: $1,500, up to four for $7,500.

For companies that sell AI

Independent evidence your buyers will actually accept.

Enterprise questionnaires grew an AI section. Hospital committees ask for validation evidence. The Vendor Assurance line is priced separately from everything above, with its own founding slots.

Vendor Evidence Pack

$12,500

one-time · list

One product, one model configuration: the signed evidence report, a shareable summary letter, a questionnaire-ready evidence brief, and twelve months of written answers to your buyers' reviewers.

Founding cohort: three vendors, by selection, at $6,500 invoiced only on delivery under the same usefulness guarantee as the founding client program, in exchange for case-study permission (named preferred, anonymized accepted). One post-remediation re-measurement within 60 days included.

Evidence Refresh

$5,000

per quarter · list

Keeps the evidence inside the freshness window enterprise reviews ask about, re-measured as your model changes.

Founding vendors lock $4,000 per quarter for two years.

Ingestion Screen

$8,500

per training event

A ratio-based screening measurement of whether a fine-tune absorbed a specified corpus, with positive and negative controls and its limits stated plainly.

$4,500 alongside an Evidence Pack. CHAI Applied Model Card supplement: $2,500. Full line on the AI vendors page.

No work begins without a signed engagement letter. Everything you share is held confidential under a written agreement first.

Straight answers

The questions a careful buyer should ask.

What access does this require?

A sample of your question types and a model endpoint you designate. That's the whole list. No customer, member, patient, policyholder, or client data. Nothing installed. We'll put the no-PHI design in writing for your vendor file.

Is this an audit or a certification?

No. It's an independent measurement. Audits and attestations are licensed work; measurement is ours. The report states what was measured, how, and what was found, in language your examiners read without translation.

Who does the work?

The person who built the instrument. Spectralgraph was founded by Joseph Stephens in Enid, Oklahoma; the method grew out of ultrasonic non-destructive testing and is patent pending (US App. No. 19/724,790). Engage the lab and the founder runs your measurement, not an analyst three layers down.

What does the measurement not do?

It doesn't decide truth, and it doesn't replace your people. It measures the model's fabrication signals and flags exactly what deserves human eyes. Every report says so on page one.

More questions answered on the FAQ: hospital BAAs, confidential supervisory information, the founding terms in full, and what the lab stands behind.

The next step costs nothing and commits you to nothing.

One email gets you the complete sample report. Put it in front of your risk committee and judge the deliverable itself. Worst case, you've spent five minutes and you know exactly what independent AI testing evidence looks like.