A.I.’s math gold and the forecasting blind spot

Sep 02, 2025

In July 2025, A.I. systems from Google DeepMind and OpenAI hit a milestone that most professional forecasters said was unlikely this decade: gold-medal-level performance on the International Mathematical Olympiad (IMO). In 2022, participants in the Forecasting Research Institute’s Existential Risk Persuasion Tournament (XPT) assigned just an 8.6 per cent (experts) and 2.3 per cent (superforecasters) chance of this happening by 2025; their median expected dates were after 2030. The new XPT accuracy report, released 2 September 2025, documents the miss clearly. Source: Forecasting Research Institute

What actually happened matters. An advanced Gemini “Deep Think” model was officially graded by IMO and achieved 35 of 42 points—enough for gold—while OpenAI’s experimental model recorded a comparable gold-level score under IMO conditions, graded by independent medalists (OpenAI did not formally enter). Multiple outlets, as well as Nature’s news coverage, corroborate the milestone and the certification nuance. Source: CBS News | Axios | Reuters | Nature

The new FRI analysis is worth reading beyond the headline surprise. It evaluates 38 near-term questions that were resolved by mid-2025 and finds that, on average, superforecasters and domain experts perform similarly; both groups significantly underestimated A.I. progress and overestimated climate-tech advances. Aggregated forecasts were materially better than individual ones (wisdom-of-crowds), but simple statistical baselines did almost as well overall—except in dynamic domains like A.I., where the baselines fared notably worse. In short: broad calibration looked fine; the tails—in particular, A.I. capability jumps—were mispriced.

For leaders allocating capital or crafting policy, the lesson is not that forecasting is futile; rather, it is that forecasting is essential. It’s that the model of the world embedded in many forecasts still assumes continuity, where discontinuities are now common. Three structural reasons stand out in the record:

Training data moved faster than forecasters. XPT wrapped before ChatGPT’s release in November 2022, missing the subsequent investment surge and rapid iteration in reasoning-oriented systems, which changed capability growth rates.
Benchmarks were treated like speed limits, not racetracks. The report shows forecasters also understated gains on MATH, MMLU and QuALITY, with superforecasters consistently more conservative than experts. Where capability-building can be parallelised and tried thousands of times in silico, “trend extrapolation” underprices how quickly research teams can clear well-specified bars.
Governance signals lag capability signals. DeepMind’s IMO result was officially graded; OpenAI’s was externally graded and not officially entered. The FRI authors note that the IMO “Grand Challenge” requires Lean-verified proofs and pre-released, open-source models—criteria not met this year—yet a panel would likely agree that the technical capability for gold exists today. Markets, regulators and boards should expect more of this: practical capability arriving before formal governance frameworks adapt.

The lesson is not that forecasting is futile; rather, it is that forecasting is essential. It’s that the model of the world embedded in many forecasts still assumes continuity, where discontinuities are now common.

What this means for strategy

Update cadence is now a risk control. Treat A.I. capability news like a macro data print: build quarterly (or better, monthly) update cycles that reassess product roadmaps, workforce plans, and security posture against the current frontier, not last year’s. Aggregate forecasts remain useful, but they should be stress-tested against “capability shock” scenarios suggested by external evaluations.
Shift from trend-following to threshold-watching. The IMO episode suggests that once research groups approach a threshold, the probability mass shifts quickly. Track leading indicators (compute access, evaluation wins, new tool use like formal verification, research talent movement) rather than waiting for consensus forecasts to budge. Nature’s and Reuters’ accounts underscore that independent grading and official certification can arrive on different timetables—plan for both.
Demand third-party grading and red-team evaluations. Where stakes are high (finance, health, infrastructure), they require external proof-checking and pre-registration of evaluation protocols. Borrow from the IMO process: independent graders, published solutions, and clear scoring rubrics. The gap between “capable now” and “officially certified” is where reputational and regulatory risk accumulates.
Rebalance the portfolio toward “optionality.” Underpriced A.I. upside (and downside) argues for more real options—pilot budgets, staged vendor contracts, and contingent hiring plans—so organisations can expand or pause as capability thresholds are crossed. The FRI report shows even strong forecasters missed the timing; optionality is a rational hedge.

The deeper takeaway is the combination of humility and urgency. Forecasting communities are still valuable, mainly when aggregated—but in domains where learning curves are steep and evaluation bars are public, the right posture is watchful, instrumented and fast to update.

A.I. cleared an aspirational bar years early. The next surprise will not announce itself more politely.

Building Creative Machines

Discussion about this post

Ready for more?