Audited Calibration Under Regime Shift: A Computational Test of Support-Structured Broadcast
- InstANT

- Apr 1
- 5 min read
This paper is available on arXiv: Audited Calibration Under Regime Shift as a Computational Test of Support-Structured Broadcast.
This was the first simulation paper in what has since become the support-sufficiency program. It was written as a companion to an earlier theoretical manuscript and predates the Constraint Geometry framework that now anchors the broader project. Even so, the core result holds up well and does real work in setting up what came after. This post describes what the paper tests, what it finds, and how it connects to the program as it stands now.
The Question
Here is a simple version of the problem. Suppose a system receives evidence from two channels and has to make a binary decision. One channel is reliable. The other sometimes degrades without warning. The system doesn't get to see the degradation directly. It just gets noisy observations and has to decide.
Standard approaches map the combined evidence into a confidence estimate and act on it. If the mapping assumes stable conditions, then when one channel degrades the system can remain confidently wrong. It doesn't know that its confidence is no longer tracking reality. And because it still feels confident, it doesn't seek additional information. It just commits.
The question this paper asks is whether giving the system access to a compact support summary, something as simple as a flag indicating the current reliability regime, can change that picture. Not by changing what the system believes at the content level, but by changing how it calibrates its confidence and what it does with that confidence downstream.
The Setup
The simulation uses a two-channel cue-integration task with a latent binary state. Channel A is stable. Channel B alternates between a good regime and a bad regime, where its observation noise increases substantially. The key design choice is that all model families share the same content decision rule. They compute the same log-likelihood ratio from the same evidence. They arrive at the same first-order decision. The only thing that differs is how they map that decision into a confidence estimate.
Three model families are compared. The first is an uncalibrated baseline that maps evidence strength directly to confidence with no adjustment. The second is a globally calibrated baseline that fits a single temperature parameter across all conditions, which is essentially the standard post hoc calibration approach. The third is an auditor architecture that fits separate calibration parameters for the good and bad regimes, conditioned on a compact support summary.
Confidence is then coupled to a simple control policy. If the system is confident enough, it acts immediately. If not, it requests one additional evidence sample before committing. This is important because it means calibration differences don't just live inside the system's head. They show up as different behavior.

What the Simulation Shows
The results are clean. All three model families produce the same content-level accuracy, both overall and within the degraded regime. The differences show up entirely in calibration and control.
The auditor achieves dramatically better calibration under regime shift. In the bad-regime subset, it reduces expected calibration error by more than an order of magnitude relative to the best content-dominated baseline. Its reliability diagrams stay close to the diagonal precisely where the other models deviate most. When the system says it's 80% confident, it really is correct about 80% of the time, even in the degraded regime.
These calibration gains translate directly into different behavior. Under the act-versus-sample policy, the auditor requests additional evidence far more often in the degraded regime. It recognizes, through its support-conditioned confidence mapping, that confidence should be lower when support conditions are worse. The content-dominated models either under-sample because they're confidently wrong, or over-sample indiscriminately because their single global temperature can't distinguish regimes.
The paper also includes an online learning variant showing that the bad-regime calibration parameter can be adapted through stochastic gradient descent on outcome feedback. This matters because it demonstrates that the auditor doesn't need the regime label handed to it as a permanent fixture. It can learn the appropriate confidence mapping from its own prediction-outcome history. That's the audit trail.
What This Established for the Program
This paper made a specific architectural claim concrete. When you hold content inference fixed and vary only whether a compact support summary is available for calibration and control, you get measurable behavioral differences. The system doesn't just "feel" differently about its decisions. It acts differently. It seeks information differently. And those differences are concentrated exactly where they should be, in the regime where support conditions have degraded.
That result became a building block. The Constraint Geometry paper took the underlying concern and developed it into a full theoretical framework, introducing the three-regime hierarchy of arbitration, the four-type uncertainty typology, and a set of dissociation predictions. The more recent consequence-sensitivity paper then asked the next question: if support must survive compression, how much of it should survive, and when should that change?
But this earlier simulation already contained the seed of both later developments. The auditor architecture is a simple version of what the consequence-sensitivity paper calls arbitration memory. The regime-conditioned calibration mapping is a simple version of support-conditioned policy. And the finding that content-dominated systems can remain confidently wrong under shift is exactly the under-resolution failure mode that the later paper formalizes.
Limitations
The paper is deliberately minimal, and it says so. The support summary is a binary regime flag, not the richer continuous structure that real systems would need. The calibration mechanism is a single temperature parameter, not a distributed mapping. The control policy is act-or-sample, not the full verify/abstain/defer space that the later papers develop.
These simplifications were the right call for a first computational test. The point was not to build a complete system. It was to isolate one mechanism and show that it produces the predicted dissociation. That it did.
Where It Sits Now
This paper remains the most direct empirical anchor point in the program. The Constraint Geometry paper is theoretical. The consequence-sensitivity paper includes a richer simulation but tests a different set of claims about dynamic resolution control. This paper is the one that shows, in the most stripped-down setting possible, that support-aware calibration and content-level accuracy can come apart, and that the difference is behaviorally real.
For readers coming to the program through the later work, this is useful context. The intuition that support structure matters beyond content and confidence didn't start as an abstract claim. It started with a simulation showing that a system can be right about what it believes and still wrong about how much to trust that belief, and that the difference changes what it does next.
The paper is available at https://arxiv.org/abs/2602.23382.



Comments