What changed: models read pages more like humans do
Older OCR-first stacks flattened layout and lost tables, checkboxes, and multi-column semantics. Modern vision-language models preserve spatial relationships and reason about messy scans, when prompts, evaluation harnesses, and post-processing rules are disciplined. That shift matters most on long-tail variants where template rules used to explode in complexity.
What did not change: money fields still need paranoia
No model removes the need for cross-field checks, payer-specific sanity bounds, and audit trails from extracted values back to pixels. Silent money errors are still the fastest way to lose trust, and the hardest to detect if you only measure happy-path accuracy.
What did not change: humans still own rare tails
The business question is how small you can make the review queue while keeping precision where finance requires it. IDP programs still live in spreadsheets of exception reasons, and the best teams use those spreadsheets to train rules and to decide which clusters deserve new model prompts.
What changed faster than buyers expected: throughput expectations
Leadership now asks for weekly improvement curves, not quarterly vendor roadmap slides. That expectation is healthy if it is paired with instrumentation. Without dashboards for per-field drift, teams confuse model updates with progress.
Opinion: buy evaluation discipline before you buy “more AI”
We would rather ship a smaller model with a ruthless eval harness and explicit failure budgets than chase the newest weights without telemetry. In IDP, the demo is easy; production is the adversary.
What to do this quarter
Pick one document family with painful volume, build a labeled set from real traffic, and publish precision/recall by field. Run a Rapid POC if you need an external team to accelerate that harness. If you cannot measure it, do not fund a production rollout.
Where Databotiq fits
We implement extraction plus validation plus routing plus monitoring as one system, not a model bolted onto a rules engine as an afterthought. If you are evaluating vendors, ask for their last thirty days of drift charts, not their best conference poster.