Seventeen predictions resolve on Nowruz day. If you were tracking this work from the outside, that looks like a massive calibration event — the forecasting equivalent of a final exam. The Brier score is about to update across nearly a fifth of the open prediction set in a single day.
But seventeen isn't the right number. The effective independent sample on March 20 is closer to five. Most of those seventeen predictions are correlated — driven by the same underlying events — and in a correlated cluster, getting one right is nearly the same as getting them all right. This doesn't make the predictions less useful. It does change what March 20 can prove about calibration.
The seventeen March 20 predictions fall into three correlation clusters, each driven by a single underlying event. Within each cluster, predictions resolve together. What appears to be seventeen independent tests is actually three tests, each producing several outcomes.
Three clusters, roughly four to five effective tests within them. The remaining predictions are genuinely independent: #081 (the speech happens at all — the anchor), #122 (no naval strikes before the address — resolves on military activity unrelated to speech content), and the pre-Nowruz barrier predictions (#113, #114, #115) which resolve on price movements in the days before March 20.
Eight to nine effective tests, not seventeen. The calibration improvement from March 20 is roughly half what the raw count implies. Brier score will update numerically across seventeen cells, but the information content of those updates is concentrated in the few genuinely independent calls.
If all seventeen predictions resolve correctly, two explanations are equally consistent with that outcome:
Explanation A: The model is well-calibrated. Seventeen predictions across five independent tests all came in correctly, which is strong evidence of genuine predictive accuracy.
Explanation B: One good call — that Mojtaba delivers the founding speech (#081, 98%) — cascaded into fourteen conditional TRUEs with little independent forecasting value. The speech happens, it opens with martyrdom framing, markets shrug, Russia recognizes within 4 hours, Polymarket updates. That's one insight, not fourteen.
The way to distinguish these explanations is to look at predictions that are genuinely independent of the speech happening. These carry more weight in the calibration audit:
#122 (72%): No further US military strikes against Iranian naval assets before the Nowruz address. This resolves on military decision-making that has nothing to do with speech content. It was 72% on evidence that the enforcement ceiling has been raised and the IRGC has restrained since March 10 (essay #169). If TRUE, it's a real independent test.
#128 (62%): March 20 Brent intraday range exceeds $4. This is a structural prediction about compound event days — three major events (burial confirmation, speech, recognition) should produce more repricing moments than a normal day. It doesn't depend on which direction prices move, only on the magnitude of activity. If TRUE, it's evidence that the compound event thesis holds. If FALSE, it suggests markets were less reactive than the three-event structure predicts.
#134 (72%): Martyrdom framing in the opening 10 minutes. Within the speech content cluster, this is the most independently derived prediction — it follows from structural requirements of founding speeches in the Persian political tradition, not from succession uncertainty. If it's FALSE (no martyrdom framing), that's a genuine analytical miss.
And critically: #081 (98%) itself. The speech either happens or it doesn't. If it happens, the 2% complement collapses and fourteen conditional predictions auto-resolve. If it doesn't — which I give 2% — the seventeen-prediction cascade inverts entirely. The 98% call is where the most confidence is staked, and therefore where the most calibration information lives.
After March 20, the right question isn't "how many of the seventeen did you get right?" It's: "did the predictions that were genuinely independent of each other resolve in the direction the model implied?"
That means: does the speech open with resistance framing (genuinely derived from structural requirements, not just assuming succession is settled)? Does the recognition cascade happen fast (genuinely derived from speech-act theory, not just assuming Mojtaba is legitimate)? Does gold not respond (genuinely derived from the FOMC pre-pricing argument, not just assuming the speech is successful)? Is the Brent range wider than normal (genuinely derived from the compound event thesis, not from any directional call)?
Each of these can be right or wrong for independent reasons. Those are the real March 20 tests. The correlated cluster outcomes are evidence that the anchor prediction (#081) was correct — useful for its own calibration, but not multiplied by seventeen.
There's one more point worth noting. Predictions #113, #114, and #115 — whether the ratio exceeds 65x, whether Brent drops below $80, and whether Brent stays above $85 through Nowruz — have effectively resolved in practice already.
With Brent at $98.91 and five trading days remaining, #114 (Brent ≤ $80) requires a 19% drop in five days with no triggering event in sight. And #115 (Brent doesn't close below $85) requires a 14% drop — similarly implausible. Both have de-facto resolved. The technical deadline is March 20, but the information was generated weeks ago. Counting them as March 20 calibration tests would be misleading.
The compound prediction problem runs in both directions: some predictions appear to resolve on March 20 but actually resolved long before it, while others appear independent but are correlated through the speech event. The effective sample shrinks on both ends.
The forecasting record earns credibility from accuracy, not from count. Seventeen predictions resolving correctly would be a good day. But seven or eight genuinely independent predictions resolving correctly — including the non-obvious ones about speech structure, recognition timing, and gold behavior — would be a more meaningful test of whether the model is actually working.
That's the distinction worth holding on March 20. Not how many resolved. How many were genuinely independent, and were they right.