Insights

Where governed AI changes the economics of insurance operations

Three use cases. Measurable outcomes. Every claim backed by evidence.

UC1

FNOL Web Forms

Structured extraction from first-notice packets — policy identifiers, claimant data, injury details — without an LLM call. 95%+ field accuracy on digital documents.

2,300submissions/mo

95%+non-LLM accuracy

$0.00015per document

UC2

Police Report Extraction

VIN, plate, DOB, DL number, citations — extracted deterministically from 50-state form variations. Scanned documents handled via T1+T2.5 stack with DPI-resilient table recognition.

88reports/day

80–88%Cat 1–2 accuracy

0external API calls

UC3

IA Report Intelligence

Coverage values, settlement amounts, and reserve flags extracted with full provenance. Subrogation analysis — the one genuinely inferential field — handled by a scoped local inference call, never an external API.

55reports/day

1 fieldrequires LLM

T4carrier-premise only

Evidence Lab

Benchmarks that can be replicated

Every finding links to runnable code. Results are reproduced deterministically — not from a single run.

Case Study #001 Published

The Extraction Intelligence Benchmark

Zero-LLM vs. Grounded LLM vs. Single-Prompt on a WC FNOL

Three extraction architectures tested on a 7-page NY WC FNOL with a dual-employer layout trap. The architecture with character-interval grounding hallucinated the claimant's name and date of injury. The zero-LLM approach extracted 19 of 19 deterministic fields correctly with full document provenance.

19/19Cat 1–2 correct, zero LLM

3Hallucinations, "grounded" approach

$0API cost per document (Approach C)

Read the case study ↓

Case Study #001 · WC FNOL Document Extraction · May 2026

The Extraction Intelligence Benchmark

A controlled comparison of three extraction architectures on a single high-complexity WC FNOL reveals that the approach marketed for its audit trail hallucinated the claimant's name, last name, and date of injury. The zero-LLM approach extracted every deterministic field correctly and is the only architecture defensible under regulatory examination.

19/19

Deterministic fields correct
Approach C, zero API calls

Grounded hallucinations
Approach B, 101 intervals cited

86%

Of fields are non-LLM addressable
Only Cat 4 inference requires LLM

10×

Latency penalty, grounded LLM
42.7s vs. 4.3s (single-prompt)

Exhibit 1 — Three architectures under test

Metric	Approach A Groq · llama-3.3-70b · T5	Approach B LangExtract · Gemini · T5	Approach C Docling + Regex · T0+T2
External API calls	1	7	0
Latency	4.3s	42.7s	8.2s
Fields extracted (of 36)	36	36	29 Cat 1–3 only
Cat 1–2 deterministic fields	19 / 19	15 / 19	19 / 19
Hallucinations	0	3 ▲	0
Document-provenance grounded	0 values	101 intervals †	29 / 29
Deterministic (zero variance)	No	No	Yes
Cost per document	~$0.04–0.06	~$0.28–0.42	~$0.00015
PHI-safe for data posture B/C	No	No	Yes

▲ Approach B grounding intervals point to real text from wrong entities — see Finding 1. † 29/36 fields for Approach C = Cat 1–3; the 7 missing Cat 4 inference fields were left empty, not hallucinated.

Finding 1

Grounding intervals do not guarantee correct field attribution

LangExtract returned a character-interval citation for every extracted value — the feature distinguishing its audit architecture from a standard LLM call. Three of those intervals pointed to real text at the cited position. The text belonged to the wrong entity or the wrong date context.

A grounding interval proves a string exists in the document. It does not prove the string is the correct value for the correct field. Under NYDFS Part 216 examination, this distinction is the difference between passing and failing provenance review.

Exhibit 2 — Three grounded hallucinations (Approach B)

Field	Extracted	Cited source text	Actual source
first_name	"Franklin"	"Franklin Logistics Inc…"	Third-party shipper — not the claimant
last_name	"Mr."	"Dear Mr. Johnson,"	Broker salutation — not a surname
date_of_injury	"March 4"	"filed March 4 by prior counsel"	Attorney filing date — injury was March 18

Exhibit 3 — Category 1 & 2 field results: 19 deterministic fields

Field	Ground truth	A: Groq	B: LangExtract	C: Docling+Regex
policy_number	AP-2026-WC-9214	✓	✓	✓
claim_number	WCH250721001	✓	✓	✓
employer_fein	47-2381094	✓	✓	✓
naics_code	561320	✓	✓	✓
date_of_injury	2026-03-18	✓	✗ "March 4"	✓
date_reported	2026-03-28	✓	✓	✓
date_of_birth	1988-07-15	✓	✓	✓
ssn_last4	4721	✓	✓	✓
hourly_rate	$28.50	✓	✓	✓
avg_weekly_wage	$1,140.00	✓	✓	✓
reporting_delay_days	10	✓	✗	✓
attorney_contact_date	2026-03-21	✓	✓	✓
first_name	Terrence	✓	✗ "Franklin"	✓
last_name	Jackson	✓	✗ "Mr."	✓
employer_name	Apex Staffing Solutions	✓	✓	✓
body_part_primary	Lower back / lumbar	✓	✓	✓
injury_mechanism	Lifting / exertion	✓	✓	✓
occupation_class	Warehouse / labor	✓	✓	✓
state_of_injury	NY	✓	✓	✓
Score — Category 1 & 2		19 / 19	15 / 19	19 / 19

Shaded rows = fields where Approach B returned a grounded hallucination. Approach C Cat 4 fields (claim type, RTW status, attorney flag, same body part, delay flag) were left empty by design — not hallucinated.

Finding 2

Section-aware extraction makes the error category structurally unreachable

The document contains three business entities — Apex Staffing, Excel Manufacturing, and Franklin Logistics — before the Employee Information section that contains the claimant's name. Approach B scanned the full document; Approach C partitioned it.

Each regex pattern runs only against its assigned section pool. The first_name pattern sees only text under the Employee Information header. "Franklin" exists only in the Employer Information pool. The two pools never intersect. The attribution error is not a probability to manage — it is a structural impossibility.

Section partitioning (Python)

SECTION_MAP = {
  "employer": re.compile(
      r"employer\s+information", re.I),
  "employee": re.compile(
      r"(?:injured\s+)?employee\s+information", re.I),
  "injury":   re.compile(
      r"(?:injury|incident)\s+(?:information|details)", re.I),
}

# first_name runs in "employee" pool only.
# "Franklin" exists in "employer" pool only.
# No overlap. Attribution error is impossible.

Finding 3 — Audit defensibility

Only one architecture passes regulatory examination

Architecture	Answer to "where did this value come from?"	Examination result
A — Groq	"The language model extracted policy number AP-2026-WC-9214 from the document. Confidence: high."	No provenance
B — LangExtract	"The value 'Franklin' was extracted from characters 412–419, which reads 'Franklin'."	Misleading — wrong entity
C — Docling+Regex	"First name matched by `First\s+Name[.:\s]+([A-Z][a-z]+)\b` in the Employee Information section. Deterministic. Reproducible on every run."	Passes examination

Implications

Field classification precedes tool selection. The category of a field — deterministic, categorical, verbatim, or inferential — determines the correct extraction tier. Applying LLM to Cat 1 deterministic fields introduces hallucination risk on the most auditable class of data in a claim file.

Grounding intervals are necessary but not sufficient for audit defensibility. A character offset that cites real source text does not prove correct entity attribution on complex, multi-entity documents. Section-aware deterministic extraction provides a stronger correctness guarantee for Cat 1–2 fields.

The LLM-required zone is 11–14% of this document's field set. Five of 36 fields require genuine inference: claim type, RTW status, attorney flag, same body part comparison, and delay threshold interpretation. The optimal architecture is not LLM vs. non-LLM — it is governing which fields go to which tier.

PHI data posture is a tier-selection constraint, not a post-design concern. For carriers with posture B (carrier-cloud) or C (air-gapped), external APIs are not available for PHI documents regardless of accuracy results. T0+T2 for Cat 1–3 combined with T4 local inference (Ollama, carrier VPC) for Cat 4 is the only architecture that satisfies these constraints end-to-end.

This benchmark is anchored to an active enterprise POC covering 79,908 annual documents across three use cases. Phase 1 recommendation: run Approach C on 50 real labeled documents before any GPU or LLM infrastructure investment.

Discuss the extraction architecture →