What Healthcare Prior Authorization Taught Us

Most writing about AI in regulated decisions is speculative. This post is not. It is a field report from building deterministic decision infrastructure for spine surgery prior authorization. The architecture has been in design for roughly 18 months; the software build began in earnest at the start of 2026 and has been moving quickly ever since, with a working product deployed against real clinical documents at a design partner practice. The post describes what we learned during that build, where we got things wrong, and what turned out to matter more than expected. For the architectural foundation, see The Bridge Pattern and Encoding Expert Rules.

The reason for writing this publicly is not to recount accomplishments. It is that the lessons from this specific domain generalize to other domains where AI is now entering consequential decisions, and the lessons are not the ones people who have not built in these domains tend to assume. The assumptions of AI-focused teams about how healthcare works, how payer policies are structured, and how coordinators actually use tools are usually wrong in ways that produce systems that do not work in practice even when they demo well.

This post is written for engineers, product leaders, and executives building AI systems for regulated workflows. The healthcare specifics are illustrative. The patterns apply to insurance underwriting, logistics exception handling, government benefits determination, regulatory compliance, and any other domain where expert rules meet messy inputs and audit requirements are real.

The domain, in brief

Prior authorization is the process by which a healthcare provider requests insurance coverage approval for a procedure or service before performing it. For spine surgery, a typical workflow involves a practice coordinator reviewing a surgeon's plan, gathering clinical documentation, interpreting the patient's insurance payer's policy, and submitting a request packet that demonstrates the procedure meets the payer's coverage criteria.

The failure modes of this process are well documented. Denial rates on initial submission hover around 15-25% for spine procedures, with most denials attributable to documentation gaps that could have been identified before submission. Each denial triggers a peer-to-peer review, an appeal, or a resubmission, each of which consumes coordinator time, delays patient care by days or weeks, and carries a real risk that the procedure is ultimately denied or abandoned.

The administrative burden is substantial. A mid-size spine practice runs one to three full-time coordinators, each handling 20-40 prior authorization requests per week. The coordinators are experienced; they know the surgeons' practices, the major payers' quirks, and the common documentation gaps. But the rules they are working against are vast. A single major payer's spine policy may run 40 pages, with criteria that update multiple times per year and that vary by plan type within the same payer. The coordinators are working from memory, from notes, from tribal knowledge passed between them.

An AI system in this domain has a specific opportunity: read the surgeon's operative plan and the patient's clinical record, determine which codes apply, check them against the payer's current policy and the national bundling rules, and produce either a submission-ready packet or a specific list of gaps that need to be resolved before submission. This is the problem we built our product to solve. What follows is what we learned while building it.

Lesson one: the rules are harder than the AI

We started the project assuming the hard part would be the AI. Reading an operative plan is messy work; the notes are often handwritten, dictated, or produced in various EMR templates with heavy abbreviation and surgeon-specific conventions. We expected to spend most of our effort on making a language model produce a reliable structured interpretation of these notes.

That expectation was wrong. The reading work is not trivial, but modern language models handle it adequately with careful prompting and a structured output schema. The coordinator can correct any errors in a few seconds at the confirmation checkpoint. The interpretation stage, once the prompt templates are tuned, is the smallest ongoing maintenance cost in the system.

The rules are where the work lives. Encoding a single payer's spine policy correctly requires a coder or coordinator who understands the policy language, a product engineer who understands the Criterion object schema, and iterative testing against real case examples to confirm the encoded rules produce the same decisions as the human expert. A single payer's lumbar fusion policy might have 15 to 30 distinct criteria, plus exclusion rules, plus step therapy requirements, plus modifier logic. Getting those encoded correctly takes days of focused work per payer.

Multiply this by the 120+ major payers we support, multiply by the number of procedure categories (fusion, decompression, disc replacement, instrumentation, revisions, etc.), multiply by the update cycles (typically two to four policy updates per payer per year), and the rule-maintenance workload dwarfs everything else in the system.

The implication for teams building in similar domains is that the architecture must be optimized for rule-authoring throughput, not for model training. The engineering investments that pay off are the ones that make rules easier to encode, test, and update. A fast model with a badly structured rule encoding produces a system that crawls. A modest model with a good rule encoding produces a system that covers a new payer every few weeks.

We did not realize this at the start of the software build. We realized it early on, when we had a working model but only a handful of payers encoded, and understood that continuing at that rate meant shipping a product no real practice could use for a long time. Everything about our rule authoring workflow (the schema, the tooling, the test corpus, the review process) was rebuilt around rule-authoring velocity as the primary design constraint.

Lesson two: coordinators know things we do not

The practice coordinators at our design partner practice are some of the most informed experts in this domain, and the AI industry consistently underestimates this. A senior coordinator at a spine practice has seen thousands of cases, knows the idiosyncrasies of every surgeon they work with, knows which payers deny which codes under which conditions, and has developed intuitions that no published policy document captures.

Early in the project, we built a version of the product that tried to minimize coordinator involvement. The AI would read the case, produce a recommendation, and the coordinator would click approve. This design failed in the obvious way: the coordinators did not trust the AI's reasoning, and they could not verify it quickly enough to use the tool faster than they could just do the work themselves.

We rebuilt the product around the coordinator's expertise rather than around minimizing their role. The current version presents the AI's interpretation prominently, highlights the specific evidence from the clinical documents, surfaces what is uncertain, and lets the coordinator modify any field with one click. The coordinator stays in the driver's seat. The AI is a fast research assistant, not a decision maker.

The coordinators use this version. They use it because it respects their expertise rather than attempting to replace it. They catch errors the AI makes. They add context the AI did not see. They push back on suggestions that do not match how the surgeon actually practices. The system gets better because of this, not worse. The coordinators' corrections become training data, test cases, and rule-pack improvements.

There is a broader lesson in this. In regulated domains, the operators are experts. The AI industry often talks about replacing them, which is a misunderstanding of the problem. The problem is not that the operators are insufficient. The problem is that they are overwhelmed by volume, stuck on mechanical work that does not require their expertise, and under-supported by tools that do not amplify their judgment. A system that replaces them is a threat. A system that makes them faster at the work they already do well is a tool they will adopt.

This distinction matters because it shapes adoption. Our product is used because it respects the expertise of its users. A competing product that treated the coordinators as obstacles would not be used, regardless of how sophisticated the underlying AI was.

Lesson three: determinism is a trust-building feature

When we first described the deterministic architecture to clinical users, we expected the response to be either skepticism (it is hard to explain why determinism matters until you have been burned by non-determinism) or indifference (most users do not care about implementation details). The actual response was neither. The coordinators understood determinism instantly and valued it more than we had expected.

The reason is that coordinators have been burned, repeatedly, by prior software that seemed to work inconsistently. Claims software that produced different results on the same patient. Coding suggestions that varied based on which menu path the user took. Authorization tools that approved a case one day and denied a nearly identical case the next. The coordinators have spent careers navigating software that behaves unpredictably, and they have come to distrust any system that cannot produce the same answer twice.

When we demonstrate that running the same case through our engine produces byte-identical output every time, the coordinators recognize this as a qualitative change in the relationship between themselves and the software. They can learn the system. They can build mental models of how it works. They can trust that a code that was suggested today will be suggested tomorrow under the same conditions. This trust is the foundation on which they can actually rely on the tool in their workflow, rather than treating it as a guess that requires manual verification every time.

This translates to a specific product consequence. When we hit a bug in which one of our rule evaluation paths produced non-deterministic output (due to an ordering issue in a dictionary iteration that had worked correctly for months before a library update exposed it), we treated the bug as a severity-one defect and prioritized it above every other engineering task. The coordinators who reported the bug did not describe it in technical terms. They said the tool seemed to be losing its mind. That framing is exactly right. A deterministic tool that goes non-deterministic is experienced by its users as a kind of betrayal, and repairing that trust requires both a technical fix and a communication about how the fix was verified.

The broader lesson is that determinism is not only a technical property. It is a user experience property. It is the foundation of a particular kind of trust that regulated-domain users need in order to actually depend on a tool rather than second-guess it. Teams building in these domains should treat determinism invariants as user-facing commitments, not just engineering concerns.

Lesson four: the policy updates are the real product

A payer policy is a moving target. Major payers update their spine policies two to four times per year. Smaller regional payers update more irregularly but sometimes more aggressively. A policy that is correctly encoded today is probably already drifting by the time it ships. A policy that is correctly encoded a year ago is likely wrong in meaningful ways.

Teams building systems that depend on encoded policies often underestimate the ongoing maintenance cost. The policies are not a one-time investment. They are an ongoing operational commitment that scales with the number of payers, the rate of policy change, and the depth of the encoding. A system that claims to cover 120 payers but is running stale rule packs for 80 of them is worse than useless; it produces confident but wrong answers.

This realization shaped our product roadmap more than any other single insight. We now consider the policy-maintenance pipeline the actual product. Everything else is infrastructure that supports the pipeline. The pipeline includes: monitoring payers' policy publication channels, downloading and diffing new policies against prior versions, routing changed criteria to the encoding team, testing updated rule packs against the regression suite, versioning the rule packs with effective dates, and deploying new versions to production with an audit trail of what changed and when.

A team building a system like ours without this pipeline will ship v1 and then discover that v1 was the easy part. Keeping it current is where the operational cost lives, and where the competitive moat is. A competitor can copy our architecture in six months. They cannot easily replicate the accumulated policy encodings, the test corpus, and the maintenance discipline that keep the system current.

For other domains, the analogous concerns are likely similar. Insurance underwriting rules update constantly. Logistics SLA terms change with each contract renegotiation. Regulatory compliance criteria evolve with each new guidance document. The ongoing maintenance of the encoded rules is the real work in any of these domains, and it is work that has to be designed into the product from the start rather than bolted on later.

Lesson five: the hardest bugs are in the edges

Our engine has 381 tests at the time of this writing. Most of them were written to cover specific edge cases we found while validating against real clinical documents. Almost none of them were written from first principles during initial design.

The pattern of how these edge cases emerge is consistent. The engine works well on the 80% of cases that look like the training examples we started with. A coordinator then runs a case that is slightly unusual: a multi-level fusion with one level being a revision, a procedure bundled with hardware that is itself a revision, an interspinous device combined with a decompression at the same level. The engine produces an answer that is wrong or partially wrong. The coordinator flags it. We trace through the rule evaluation, find the gap in our logic, and add both a fix and a regression test.

Over the course of the build, we have accumulated 381 tests because we have seen 381 distinct edge cases that taught us something. The tests are the most valuable asset the engine has. They encode accumulated understanding of what real clinical cases look like, where the rule encoding can go subtly wrong, and what shapes of input require special handling.

The lesson for teams building in regulated domains is that the test corpus is the product. The engine without the test corpus is a fragile prototype. The test corpus is what makes the engine something you can update with confidence, hire new engineers to work on, and deploy against higher-stakes workflows over time. Investing in the test corpus early, even when it feels premature, pays off exponentially as the system matures.

A related lesson is that the tests have to be against real cases, not synthetic ones. Early in the project we tried to accelerate testing by generating synthetic clinical cases. This produced tests that covered the space we could imagine, which is the space our engine was already designed for. The tests that mattered turned out to be the ones derived from actual cases, because actual cases contained combinations and phrasings that we would never have thought to synthesize. The test corpus only becomes valuable when it represents the domain rather than the engineers' model of the domain.

Lesson six: conference attendance is a feature

I did not expect to write this one, but it is true. Attending industry conferences is one of the highest-leverage activities for building a product in a regulated domain, because the conferences are where domain experts discuss the current state of their field in ways that do not appear in publications.

Spine surgery has an annual coding conference where payer policy updates, NCCI bundling rule changes, and documentation standard evolution are discussed in detail by people who work in the space full time. Attending this conference (we go to the annual North American Spine Society coding meeting) is how we learn about changes that will affect our rule packs months before they appear in written form. It is also how we identify new criteria that should be added to our engine, new edge cases to test against, and new documentation patterns that the interpretation stage needs to handle.

More broadly, conferences in any regulated domain are the primary venue where the domain's working experts make their tacit knowledge legible. The presentations, the hallway conversations, the panel discussions, the questions asked during Q&A, all of this is material that cannot be extracted from written sources. A team building in a regulated domain that does not send people to the domain's conferences is missing a significant intelligence stream.

The broader lesson is that building software for a specialized field requires immersion in that field. The engineering team should include or work closely with people who attend the conferences, read the trade publications, and talk to the practitioners. The AI industry often assumes that domain expertise can be replaced by technical sophistication. In regulated decision domains, this assumption is wrong. Technical sophistication without domain immersion produces systems that are elegant, fast, and wrong.

Lesson seven: the business case is the person, not the institution

When we started the project, we framed the value proposition institutionally. The system would save the practice money by reducing denials, cutting coordinator time, and improving first-pass approval rates. These were real benefits and we had data to back them up.

Over time the framing shifted. The most compelling thing our product does is not save the practice money. It is that it catches documentation gaps before a patient is denied. A patient whose surgery is delayed three weeks because the coordinator missed a criterion is a patient suffering unnecessarily. The coordinator is not to blame; the information volume is simply too large to process reliably under time pressure. A system that catches what the coordinator cannot catch under that pressure converts abstract efficiency gains into specific human outcomes.

The conversations go better when we lead with the patient outcome and treat the institutional savings as a downstream consequence. This is not spin. It is the correct order of explanation, because the patient outcome is the actual reason the work matters. The institutional savings follow from caring about the patient outcome, not the other way around.

This applies more broadly. In any regulated decision domain, the decision affects an individual: a patient, a claimant, an applicant, a person in a specific situation. Systems that improve the decision process have institutional buyers, but the value those systems create is ultimately absorbed by the individual. Building with the individual as the primary value unit makes better products and produces better conversations with the institutional buyers, because most institutional buyers are, themselves, people who care about the outcomes their institutions produce.

What generalizes to other domains

Spine surgery prior authorization is a specific domain with specific characteristics, but the lessons from building in it generalize further than I expected. For teams building AI systems in other regulated decision domains, the applicable takeaways are:

The rule encoding is the hard part. Architect the system around rule-authoring throughput.
The domain experts who currently do this work manually know things the AI does not. Build for them, not instead of them.
Determinism is a trust property, not just a technical one. Defend it fiercely. Communicate about it clearly.
The ongoing maintenance of encoded rules is the real product. Ship the pipeline, not just the engine.
The test corpus is the accumulated wisdom of the system. Invest in it. Test against real cases, not synthetic ones.
Go to the domain's conferences. Read the domain's publications. The tacit knowledge is not online.
Frame value in terms of the person affected by the decision, not the institution. The institution is the buyer but the person is the reason the work matters.

None of these are revolutionary insights taken individually. What is non-obvious is that they are the actual work of building reliable AI-in-consequential-decisions systems, and that AI-focused teams routinely underweight all of them in favor of model sophistication, architectural elegance, or go-to-market velocity. The teams that succeed in these domains are the ones that treat the boring, operational, domain-heavy work as the main thing and the AI as a capability that supports it.

What we got wrong

It is worth being explicit about what we misjudged during the project, since most field reports are written as success narratives and the failures are where the most useful learning lives.

We underestimated the encoding workload. We knew rule encoding would be real work. We did not appreciate that it would be the dominant workload of the product, and that it would need its own tooling, its own review process, and its own velocity targets. The early weeks of the build were spent on the engine. A substantial chunk of the subsequent build was spent rebuilding the encoding pipeline we should have built first.

We overestimated the value of rare edge cases in early design. Early in the project, we spent engineering effort on handling unusual cases that a coordinator would encounter maybe once a year. This was wasted effort. The coordinator handles such cases manually anyway, and doing so is not a meaningful burden. The high-frequency cases that the coordinator does dozens of times per week are where time savings actually come from. Focus on frequency first, edge cases later.

We treated the human operator as a review gate rather than a peer. Our early designs put the coordinator at the end of the pipeline reviewing a complete AI recommendation. This did not work. The coordinator needs to be the decision maker, not the reviewer. The AI prepares material for their decision. This is the bridge pattern applied correctly, and we rediscovered it the hard way by shipping a version that treated the coordinator as a rubber stamp and watching them refuse to use it.

We launched without enough coverage. An early version of the product handled lumbar fusion well but had thin coverage for cervical procedures. A practice that saw both kinds of cases could not use the product for their real workflow. We now build coverage horizontally (all major procedure categories) before going deep on any single category, because a product that handles 60% of a practice's cases well is more useful than one that handles 95% of a narrow slice and nothing else.

We underinvested in operator training early on. The product has a learning curve. Coordinators use it more effectively after a few weeks than they do on day one. We initially expected the UI to be self-explanatory. It is not, and trying to make it so added complexity without solving the problem. Now we budget time for direct training with each practice, which is both more effective and less expensive than trying to engineer away the learning curve.

Where this goes next

At the time of this writing, our product is shipped to a design partner practice, validated against real clinical documents, holding 100% pass rate on 381 automated tests, and covering 120+ payer rule sets. We have a second vertical for logistics exception handling built on the same architecture, which is itself validation that the patterns documented here generalize.

What I want to leave readers with is not a conclusion but an invitation. Building AI systems for regulated decisions is hard in ways that are specific, learnable, and underrepresented in the AI industry's current conversation. The teams doing this work seriously are accumulating knowledge that should become shared discipline, not proprietary expertise. Every team that publishes lessons from their own field report makes the next team's work easier.

For teams currently building in similar domains or considering doing so: the architecture is more important than the AI, the domain experts are smarter than you think, the rules are harder than they look, and the test corpus is the whole game. Invest accordingly. The systems you build will matter to the people on the receiving end of their decisions, and that is ultimately the only metric that counts.