Encoding Expert Rules — Avectic Engineering

If you are building a deterministic decision engine, the most important design decision you will make is how you represent rules. Get this right and the engine becomes maintainable, testable, and extensible across domains. Get it wrong and you end up with rules scattered across conditional statements in Python, rules hidden inside SQL queries, rules encoded in configuration files that nobody can read, and rules that exist only in the heads of the engineers who wrote them.

This post is about how we represent rules at Avectic. We call the core structure a Criterion object, and the architecture built around it is what lets our engine encode 120+ healthcare payer policies, 67 NCCI bundling rules, and an expanding set of logistics SLA criteria in a consistent, testable, version-controlled way. The structure is not complicated. The discipline of using it correctly across a growing codebase is the hard part.

The related post on the bridge pattern explains why our architecture separates AI interpretation from deterministic decision evaluation. This post is a deeper look at the decision side. If you have not read that post, the short version is that our engine takes a structured representation of a case (we call it a CaseIntent) and evaluates it against an encoded rule pack to produce a compliance assessment with a full audit trail. Everything below is about how that rule pack is structured.

Why rules are hard to encode

Before getting to the solution, it is worth being specific about what makes rule encoding difficult. Most first attempts at building a rule engine fail not because the engineering is hard but because the modeling is underspecified. Engineers reach for an abstraction (a dictionary, a class, a JSON schema) that feels flexible but does not capture the actual structure of expert rules.

Expert rules have a specific shape that is easy to miss if you have not looked closely at a lot of them. Consider a real example from spine surgery prior authorization, paraphrased from an actual payer policy:

"For lumbar fusion to be covered, the patient must have documented evidence of at least 6 weeks of conservative therapy within the past 12 months, with at least two of the following: physical therapy, NSAID trial, activity modification, or epidural steroid injection. Imaging must demonstrate pathology consistent with the proposed surgical level, dated within 6 months of the request. BMI must be less than 40 documented within 90 days of the request. Tobacco use status must be documented, and active smokers require counseling documentation or a cessation attempt."

This is one criterion from one payer for one procedure. There are many things going on in it:

A threshold (6 weeks minimum duration)
A recency constraint (within the past 12 months)
An enumerated set with a minimum count (at least two of the following four modalities)
A cross-reference to another data element (imaging must match the surgical level)
Multiple independent recency constraints (6 months for imaging, 90 days for BMI)
A conditional requirement (active smokers require additional documentation)
A source qualifier (the documentation must come from acceptable source types)

A first-pass attempt to encode this as a function produces something like the following, which is a mess that will be unmaintainable within three payers:

def check_lumbar_fusion_criteria(patient):
    if patient.conservative_therapy_weeks < 6:
        return False
    if (datetime.now() - patient.therapy_date).days > 365:
        return False
    modalities_count = 0
    if patient.had_pt: modalities_count += 1
    if patient.had_nsaid: modalities_count += 1
    if patient.had_activity_mod: modalities_count += 1
    if patient.had_esi: modalities_count += 1
    if modalities_count < 2:
        return False
    # ... continues for 40 more lines
    # ... then copy and modify for the next payer
    # ... then copy and modify for the next procedure
    # ... then try to update all of them when the policy changes

This code works for one rule, but it cannot be tested systematically, cannot be versioned independently of the engine code, cannot be authored by non-engineers, and cannot be reasoned about across the full rule set to detect overlaps or contradictions. After 10 payers it is unmaintainable. After 100 payers it has probably been thrown out and rewritten three times.

The Criterion object

A Criterion object is a data structure that represents a single rule in a consistent schema. Every rule in our system, across every domain, uses the same structure. The engine evaluates Criterion objects uniformly. New rules are added by authoring new Criterion instances, not by writing new code.

The core structure has five fields:

@dataclass
class Criterion:
    criterion_id: str              # "lumbar_fusion.conservative_therapy_duration"
    data_element_ref: str          # "patient.prior_treatments.duration_weeks"
    comparison_operator: str       # "greater_than_or_equal"
    threshold: Threshold           # Threshold(value=6, unit="weeks")
    recency_constraint: Recency    # Recency(max_age_days=365)
    evidence_source_qualifier: list[str]  # ["physical_therapy_note", "physician_note"]

This is the simple case. A criterion with this shape says: look at this field in the case intent, compare it to this threshold using this operator, verify the supporting documentation is within this time window, and confirm the evidence came from an acceptable source type. The engine evaluates criteria of this shape deterministically: given a case intent and a criterion, the satisfaction status is computed purely from the data.

The five fields were not chosen arbitrarily. Each one represents a distinct dimension of what makes a rule testable:

Data element reference

What field in the case intent does this criterion examine? The reference is a path expression into the structured intent, and it must resolve to a field that the interpretation stage is designed to populate. If a criterion references a field the intent does not carry, the system flags it as a modeling error at rule-pack load time, not at evaluation time. This is a deliberate choice. Broken rule references should fail loudly and early, not silently and late.

Comparison operator

How should the extracted value be compared to the threshold? We support a specific vocabulary of operators: equals, not_equals, greater_than, less_than, greater_than_or_equal, less_than_or_equal, within_range, contains, matches_ontology, within_time_window, is_present. These are the operators that actually appear in practice across the rule sets we have encoded. We have resisted adding operators beyond this set, because every new operator adds complexity to the engine and makes it harder to reason about rule-set behavior in aggregate. When a rule seems to need a new operator, 9 times out of 10 it can be restructured to use one of the existing ones.

Threshold

The value against which the data element is compared. Threshold is itself a small structured type, because a threshold is often more than a single scalar. A weight threshold is a numeric value with a unit. An ontology threshold is a set of acceptable concept codes. A range threshold has a minimum and maximum. Representing thresholds as their own structured type lets the engine handle unit conversion, ontology lookups, and range semantics consistently.

Recency constraint

How old can the supporting documentation be and still count? Nearly every real-world rule has a recency requirement, and encoding it as a first-class field rather than as an ad-hoc check in the operator logic makes recency uniformly testable. The engine computes document age relative to the workflow request date, not the current wall-clock time, so that historical evaluations are reproducible.

Evidence source qualifier

What documentation source types are acceptable for satisfying this criterion? A payer may require that BMI come from a recent office visit note, not from patient self-report. A regulatory compliance criterion may require that financial records come from an audited source, not a spreadsheet. The source qualifier is how the rule encodes that constraint, and it links each criterion evaluation to the specific document type that satisfied it, which is essential for the audit trail.

Composition: groups and logic

A single criterion is rarely the full picture. Real policies require combinations: all of these criteria must be satisfied (AND), any of these will do (OR), these are disqualifying (NOT), these apply only under certain conditions (conditional inclusion). Encoding these combinations naively inside a criterion leads to the same explosion of special cases that the atomic criterion structure was designed to avoid.

We handle composition with a separate layer on top of the criterion structure, called an Indication Criteria Group:

@dataclass
class IndicationCriteriaGroup:
    group_id: str
    operator: str                       # "ALL", "ANY", "AT_LEAST_N", "NONE"
    n: Optional[int]                    # Used when operator is AT_LEAST_N
    criteria: list[Criterion | IndicationCriteriaGroup]
    applies_when: Optional[Criterion]   # Precondition for this group to apply

A group is a Boolean combinator. The ALL operator requires every member criterion to be satisfied. ANY requires at least one. AT_LEAST_N requires a specified count. NONE requires all members to be unsatisfied (for exclusion rules). The members of a group can themselves be groups, so arbitrary nesting is possible, but in practice most real-world rule sets are shallow, rarely more than three levels deep.

The applies_when field is how we handle conditional rules. The earlier example of "active smokers require cessation counseling" is encoded as a group with an applies_when precondition that checks smoking status. If the precondition is false, the group is skipped. If true, the group evaluates normally. This pattern is how we avoid bloating the criterion structure itself with condition logic.

A full rule set for a policy is a tree of groups and criteria, organized by clinical indication or business scenario. Encoding the lumbar fusion example as this tree produces something readable:

lumbar_fusion_group = IndicationCriteriaGroup(
    group_id="lumbar_fusion_v3",
    operator="ALL",
    criteria=[
        # Conservative therapy requirements
        IndicationCriteriaGroup(
            operator="ALL",
            criteria=[
                Criterion(
                    criterion_id="conservative_duration",
                    data_element_ref="prior_treatments.total_duration_weeks",
                    comparison_operator="greater_than_or_equal",
                    threshold=Threshold(value=6, unit="weeks"),
                    recency_constraint=Recency(max_age_days=365),
                    evidence_source_qualifier=["physical_therapy_note", "physician_note"]
                ),
                IndicationCriteriaGroup(
                    group_id="modality_requirement",
                    operator="AT_LEAST_N",
                    n=2,
                    criteria=[
                        pt_criterion, nsaid_criterion,
                        activity_mod_criterion, esi_criterion
                    ]
                )
            ]
        ),
        # Imaging requirements
        imaging_requirement_group,
        # Patient factor requirements
        bmi_requirement,
        tobacco_requirement
    ]
)

This tree is serializable. It can be stored as JSON or YAML, versioned in a repository, diffed across versions, code-reviewed by a clinician who is not an engineer, and loaded by the engine at runtime. That is the payoff of the structure: rules become data, not code, and everything that comes with data (version control, review, testing, distribution) becomes possible.

Why this specific schema

The schema above is the result of about 18 months of iteration. We did not arrive at it by design in a whiteboard session. We arrived at it by encoding real payer policies, hitting structural limits, refactoring, and trying again. A few specific choices deserve explanation because they are not obvious.

Criterion identifiers are hierarchical strings, not UUIDs

Every criterion has a hierarchical ID like "lumbar_fusion.conservative_therapy.duration" rather than a random UUID. This seems like a small choice. It is not. Hierarchical IDs make audit traces readable, make rule-pack diffs semantically meaningful, and let new engineers reason about the rule set by reading the IDs. UUIDs force you to build a separate layer of documentation to map them to human-understandable names, and that layer always drifts out of sync with the actual rules.

Data element references are paths, not identifiers

A reference like "prior_treatments.total_duration_weeks" resolves to a specific field in the case intent. We considered using numeric identifiers for the fields instead, with a separate mapping, but rejected it for the same reason we use hierarchical criterion IDs. Debuggability matters. When a criterion fails to evaluate, the error message should say what field it was looking for, not "field ID 47382 not found."

The comparison operator is a string, not a function reference

We considered storing the comparison as a reference to a function or lambda. We chose to store it as a string from a fixed vocabulary. The engine dispatches on the string to the appropriate internal comparison logic. This means rule packs can be serialized and deserialized as plain data without any code references, which is essential for storing them, transmitting them, and loading them dynamically. If the engine needs to evolve a comparison operator's behavior, we version the operator. We have done this exactly twice in production; the mechanism has held up.

Recency is separated from threshold

An obvious first design would fold recency into the threshold object (threshold = "6 weeks, within the past 12 months"). We considered this and rejected it. Recency is about when the evidence was produced; threshold is about what the evidence says. They are orthogonal, and entangling them makes rule authoring harder and rule evaluation subtly buggy (because recency checks apply even to criteria that have no numeric threshold, like "is documented" criteria).

Evidence source qualifier is a list, not a single value

Many criteria can be satisfied by more than one document type. "Tobacco use documented" can come from a physician note, a pre-operative history form, or a nursing assessment. Representing the qualifier as a list lets the criterion match whichever acceptable source happens to be present, without having to enumerate rules for each variation.

Versioning rule packs

Real rule sets change. Payers update their policies. Regulations evolve. A bug is found in an existing rule and needs to be fixed. Without versioning, every change invalidates the audit trail of prior decisions, which is unacceptable.

We version rule packs as first-class entities. A rule pack has an identifier, a version number, an effective date range, and a content hash. Every decision is stamped with the exact rule-pack version that produced it. When the rule pack is updated, the new version has a new effective date. A historical decision made under a prior version remains reproducible by loading that version from the repository and re-running the evaluation.

This capability, historical reproducibility, is the property that most distinguishes this architecture from one where rules live inside application code. If the rules are code, reproducing an old decision requires finding and running the old code. If the rules are data stored in a versioned repository, reproducing an old decision is a configuration lookup.

@dataclass
class PolicyRuleSet:
    pack_id: str                  # "cigna_lumbar_fusion"
    version: str                  # "3.2.1"
    effective_date: date          # when this version became active
    expiration_date: Optional[date]
    source_document: str          # URL or reference to authoritative source
    groups: list[IndicationCriteriaGroup]
    content_hash: str             # SHA256 of serialized content

The content hash is computed from a canonical serialization of the rule pack. If two rule packs have the same hash, they are semantically identical. This hash is included in the audit trail of every decision, so an auditor can verify that the rule pack referenced was actually the one used (and not modified after the fact).

Testing rule packs

A rule set is a program. It can be tested. This is another major benefit of the structured approach: a rule pack, being data, can be accompanied by a companion test suite that is also data.

For each rule pack we maintain a test corpus: a set of case intent examples with expected evaluation outcomes. The test corpus is part of the rule pack's versioned artifact. When the rule pack is updated, the test corpus must still pass; if it does not, either the test corpus needs to be updated (because the update was intentional) or the rule pack change has a bug.

@dataclass
class CriterionTestCase:
    test_id: str
    case_intent: CaseIntent
    expected_result: dict   # {criterion_id: status}
    description: str        # Human-readable explanation of what's being tested

A well-designed test corpus covers not just the happy path but the edge cases that the rule set is specifically designed to handle. If a rule says "patient must have tried at least 2 of these 4 modalities," the test corpus should include a case with 1 modality (should fail), a case with 2 (should pass), a case with 3 (should pass), and a case with an unusual modality combination that previously confused the system. The discipline of maintaining a test corpus this way turns rule pack authoring from intuition-based work into empirical work.

At Avectic, our flagship rule pack (spine surgery prior authorization) currently has 381 tests across multiple test suites. When a rule changes, the suite runs. A green suite means the change did not break any encoded expectation. A red suite means the change broke something, and the change is not merged until the break is understood.

Common mistakes when encoding rules

Teams that adopt this pattern often make the same three mistakes in their first few rule packs. It is worth naming them so you can avoid them.

Mistake 1: Criteria that do multiple things

The temptation to merge multiple logical checks into a single criterion is strong. A criterion that checks "BMI less than 40 and documented within 90 days" looks efficient. It is also untestable at the level of the individual check. Split it. Two criteria in an ALL group: one for the BMI threshold, one for the recency. The audit trail becomes more informative, the tests become more granular, and the rule pack becomes diff-able at a meaningful level.

Mistake 2: Conditional logic smuggled into thresholds

The threshold field is for values, not for conditions. A threshold that varies based on patient age is not a single threshold; it is multiple criteria gated by applies_when preconditions. Smuggling condition logic into threshold objects makes the criterion untestable without running the full case, which defeats the purpose of the structure.

Mistake 3: Missing source qualifiers

Early rule packs often omit the evidence source qualifier because the authors assume any documentation is acceptable. This assumption is almost always wrong in regulated domains. Payers, regulators, and auditors care which document type provided the evidence. An operative note claiming a lab result is not equivalent to a lab report showing the same result. Encode the source qualifier from the beginning, even if the initial value is permissive. Adding it later requires re-evaluating the entire test corpus, which is expensive.

What this structure enables

The value of encoding rules this way compounds. The first rule pack is slow. By the tenth, the patterns are established. By the hundredth, new rule packs can be authored by domain experts working with engineers in a review pattern that looks more like content authoring than software development.

Specifically, the structure enables the following capabilities that code-based rule implementations cannot deliver:

Non-engineers can author rules. A clinician who knows spine surgery policy can learn this structure in a few hours and begin authoring criteria directly. They write the JSON (or YAML, or whatever serialization we use). An engineer reviews for schema correctness. The engineer is in a review role, not an authorship role, which scales dramatically better.

Rules can be reviewed diff-style. A payer updates their policy. The new version of the rule pack is diffed against the prior version. The diff is semantically meaningful: "the BMI threshold changed from 40 to 45," not "line 137 in policy_engine.py changed from if x > 40 to if x > 45." The diff can be reviewed by a clinical compliance reviewer, not just an engineer.

Rules can be tested without running the engine. A test case is just a case intent plus an expected outcome. Running the test corpus is faster than running integration tests, which means we can run the full corpus on every commit and catch regressions immediately.

Rules are portable across domains. The spine surgery rule pack and the logistics exception rule pack share no domain knowledge, but they share the same schema. The engine evaluating them is the same code. A new domain (insurance underwriting, regulatory compliance, government benefits) fits into the same structure, just with a different rule pack.

This portability is the most important property for the long-term architecture, and the one that most rewards the discipline of using a consistent structure. Every rule you encode correctly in this structure is an asset the whole platform can leverage, not a commitment to a specific vertical.

Closing

Rule encoding looks like a small architectural decision. It is not. It is the foundation the entire deterministic decision engine rests on. A rule pack encoded well is testable, auditable, versionable, and portable. A rule pack encoded as code entangled with application logic is none of these things.

The Criterion object structure we use at Avectic is one specific answer to this problem. It is not the only possible answer, and teams in other domains may find variations that work better for their specific rule shapes. But the principles the structure encodes, separating data element references from comparison operators, making recency first-class, treating rule packs as versioned data, maintaining a test corpus as a peer to the rule pack, are general. Any team building a deterministic decision engine will, in my view, converge on something close to this structure, and the sooner the convergence happens, the less painful the journey is.

If you are starting a project like this, spend a week on the schema before you write any rules. The schema is the thing you will regret most if you get it wrong, and the thing you will compound the most if you get it right.