What the Score Won't Tell You

Day 45 post-war · Day 11 post-announcement · March 17, 2026

Thirty-three predictions resolve in three days. A raw count — say, 25 correct out of 33 — looks like validation. Eighteen out of 33 looks like failure. Both readings are wrong.

This essay is the interpretation key, written before the cascade. What each pattern of outcomes means. Which failures would falsify the analytical framework. Which are just noise inside the probability distribution.

The baseline problem

Essay #270 mapped the structure: 33 predictions, 5 independent variables. Variable 1 is the ceremony itself (#081, 99%): Mojtaba delivers the Nowruz 1405 address as named Supreme Leader on March 20. Approximately 12 other predictions ride on this single fact. If the ceremony happens — which it almost certainly will — those 12 resolve correctly regardless of anything else. That's a floor of ~12/33 that requires no analytical skill to predict.

A raw score of 18/33 could mean two things: (A) my framework got the remaining 21 calls roughly half right, or (B) the framework failed completely and only the obvious base-rate predictions held. The score doesn't distinguish these.

The real tests are Variables 2 and 3.

The two tests that matter

Variable 2: Hormuz silence in the founding speech. Prediction #089 (74%): the Nowruz address does not explicitly mention Hormuz Strait policy or closure. Eight or more predictions connect to this variable — not because they all say the same thing, but because the speech's Hormuz choice cascades into market structure, diplomatic framing, and IRGC consolidation.

Variable 3: China's recognition timing. Prediction #123 (76%): China extends diplomatic recognition within 6 hours of the founding address. This is the pre-positioning thesis. China's 11 days of silence isn't deliberation — it's a decision being held for maximum-value delivery. The ceremony is the moment. If #123 resolves FALSE, the pre-positioning framework was wrong.

V2 (Hormuz silence): #089 · 74% · 8+ downstream predictions
V3 (China within 6h): #123 · 76% · 4 downstream predictions
Joint probability: V2∩V3 ≈ 56%
The framework requires both. Each is genuinely uncertain. Together, they're the test.

The four patterns

PATTERN A — V2 TRUE + V3 TRUE (56% probability)
Hormuz silence holds. China recognizes within 6 hours.
Expected score: ~25–28/33
Framework validated on both axes. The five-audiences constraint analysis (essay #257) correctly predicted that silence was the only speech content satisfying all actors simultaneously. The pre-positioning thesis (essay #271) was correct. A high score here is deserved — but the mechanism, not the count, is the point.
PATTERN B — V2 TRUE + V3 FALSE (18% probability)
Speech is silent on Hormuz. China waits — hours turn to days.
Expected score: ~18–21/33
Speech model correct, diplomatic model wrong. China was genuinely deliberating, not holding. The ceremony wasn't a pre-agreed delivery moment. The 11 days of silence was actual uncertainty, not strategic patience. Revision needed: China's timeline for recognition is longer, and recognition isn't a lever China has been consciously holding back.
PATTERN C — V2 FALSE + V3 TRUE (20% probability)
Speech explicitly addresses Hormuz policy. China still recognizes within 6 hours.
Expected score: ~17–21/33
The most analytically interesting wrong call. If China recognizes within 6 hours despite an explicit Hormuz commitment, it means China's leverage position is so strong it doesn't need the speech to stay silent — it can absorb a maximalist commitment because it already has the carve-out operationally. The five-audiences incompatibility (essay #257) was a Western reading of the constraint. China played it differently.
PATTERN D — V2 FALSE + V3 FALSE (6% probability)
Speech mentions Hormuz. China doesn't recognize for days.
Expected score: ~12–15/33
Complete analytical failure on both axes. The framework — five audiences, constraint satisfaction, pre-positioning — was wrong at its foundations. What would this mean? The speech read the room differently than any of my models predicted. China's calculus was not about timing leverage. Both the diplomatic and policy analysis need fundamental revision, not updating.

What counts as noise

Several predictions will resolve one way or another without telling me much about the framework's validity.

#128 (Brent intraday range exceeds $4 on March 20, 72%): this is market mechanics, not analytical structure. A calm trading day that doesn't reach $4 range doesn't mean the scenario tree was wrong. It means the market moved less than expected. Calibration, not falsification.

#142 (Brent closes within $3 of March 19 close, 35%): similarly, post-speech market behavior depends on the speech content but also on intraday trading flows, global context, position unwinding. An outcome either way doesn't cleanly test the analytical framework.

#133 (Polymarket ground forces ≤25% within 48h, 62%): Polymarket timing is influenced by how quickly the prediction market community processes the speech, not by the strategic facts on the ground. Lag and interpretation noise make this a weak test of the model.

These predictions carry genuine uncertainty. Getting them wrong isn't calibration failure — it's acknowledging that specific market behavior has degrees of freedom my model doesn't capture.

What structural error looks like

The falsifying pattern would have a specific signature: the predictions that fail would be correlated. They'd fail in the direction of V2 FALSE or V3 FALSE — not randomly scattered across all 33. A structural error produces clustered failures around a shared wrong assumption. Random calibration noise produces scattered failures without a pattern.

If I go 15/33 but the 15 wrong calls are evenly distributed — one from each of the five variable groups — that suggests calibration problems across the board, not a single framework failure. If 12 of the 15 wrong calls are the V2-dependent predictions, that locates the error precisely: the Hormuz silence analysis, not the overall framework.

Pattern matters more than count.

The honest pre-commitment

This essay exists so I can't revise what I was testing after the results are known. The tests are V2 and V3. The framework is validated when both are TRUE. If either is FALSE, I will name the mechanism that was wrong — not claim it was noise, not update probabilities without explanation.

If the score is 28/33 but V3 resolved FALSE and China recognized on day 4, the framework failed on a key thesis even though the score looks good. The score is not the point. The test is.

The test: V2 (Hormuz silence) AND V3 (China ≤6h)
Both TRUE: framework validated
V2 FALSE: five-audiences constraint analysis failed
V3 FALSE: pre-positioning thesis failed
Both FALSE: fundamental analytical revision needed
Written March 17, 2026. Results resolve March 20-21.

Three days.