The hottest math benchmark at the moment is FrontierMath, created by Epoch AI. It consists of 300 math problems written by professional mathematicians.
When it was first announced in November of 2024, no publicly available AI model solved more than 2% of the problems. Since then, OpenAI has published internal evaluations claiming that their o3 model solves about 30%1 of the problems. It would be one thing if the benchmark sat largely unsolved — it’s not like any AI system has solved a Millennium Problem — but given this burst of progress, we are drawn to ask what success means.
What makes these new problems special? They are designed to be original and hard.
“Original” is easy to explain. With machine learning systems, there’s always a concern that problems and solutions are in the training data. To avoid this, Epoch AI commissioned mathematicians to write new problems, and then kept the problems private. They’ve released only five sample problems publicly. This is a very reasonable decision, though it does create a barrier for external analysis.
“Hard” is also easy to motivate. Progress in AI over the last year or so has “saturated” most math benchmarks: the models get near-perfect scores. Epoch AI’s goal was to create a benchmark consisting of harder problems. Did they succeed? Clearly in some ways: 30% is far from saturated. Humans also think they’re hard: three Fields Medalists who reviewed the problems characterized them as “exceptionally challenging”.
But difficulty isn’t a monolithic concept. So, what I want to know is: what makes these problems hard? Will AI models find them hard for the same reasons that humans do? And, can that analysis shed any light on what we ought to make of o3’s progress?
I’ll explore these questions in the rest of this post. My tentative conclusion about what makes FrontierMath hard is that it has more to do with the extensive mathematical background required and less to do with creative insight. I’ll also speculate on a few dynamics that could be at play with Open AI’s rapid progress, but that’s a harder nut to crack. I’ll conclude with a brief description of what I’d personally find most interesting in a math benchmark created by working mathematicians.
Born Yesterday
Before delving into FrontierMath in particular, I’d like to give a general perspective on new AI benchmarks. In my first post about the AIME, I wrote:
I find most AI benchmarks hard to situate. Results all sound like: the new generation of models does well on a test that was designed 6 months ago to stump the old generation of models. What are we supposed to make of that?
[…]
At least the AIME has a life outside of AI: it’s the sort of thing that humans are familiar with and have opinions about. We can hope to answer, at least for humans, what makes the AIME hard.
Coming up with a good benchmark from scratch is challenging. If the benchmark doesn’t arise naturally in the course of human activity, we lack independent reference points about what the benchmark measures. The technical term for this is “construct validity”. Roughly speaking, a benchmark has construct validity if it measures what it is supposed to measure. For brand-new benchmarks, we’re not always sure what they really measure: it takes a while to accumulate evidence one way or another. AI evaluation adds another layer: even for benchmarks that are well-established with humans, we aren’t sure if AI performance lets us infer the same characteristics we’d infer from human performance.
Thus, early iterations of benchmarks often have surprising deficiencies. FrontierMath is just at the beginning of its journey, but I don’t mean this as a knock against FrontierMath. From what I’ve seen, Epoch AI is very thoughtful and open to iteration. But this was all in the back of my mind as I begun to investigate.
Dimensions of Difficulty
In creating the FrontierMath problems, Epoch AI broke down problem difficulty along three dimensions. Problem authors were asked to rate their problems on these dimensions. From the initial technical paper:
Background: This rating ranges from 1 to 5 and indicates the level of mathematical background required to approach the problem. A rating of 1 corresponds to high school level, 2 to early undergraduate level, 3 to late undergraduate level, 4 to graduate level, and 5 to research level.
Creativity: Estimated as the number of hours an expert would need to identify the key ideas for solving the problem. This measure has no upper limit.
Execution: Similarly estimated as the number of hours required to compute the final answer once the key ideas are found, including time writing a script if applicable. This measure also has no upper limit.
The easiest of the sample problems has ratings of: high school / undergraduate, 2 hours, and 2 hours, respectively. Epoch AI subsequently aggregated these measures into three tiers, Tier 1, Tier 2, and Tier 3, with Tier 3 being the most difficult.
The Background and Execution parts are really what’s new. IMO coach and FrontierMath co-author Evan Chen points out that, while the hardest olympiad problems require high creativity, they are explicitly designed not to require advanced math background or extensive execution phases. This makes sense for high schoolers working on the clock. But research mathematicians face all three elements, and, anyway, if the goal is a hard benchmark, why not give it all we’ve got?
I really like these dimensions. In fact, my own characterization of AIME problems is pretty well-summed-up by saying they require advanced high-school background, a moderate amount of execution (though only “moderate” given the time constraint), and relatively little creativity.
But, while I think it’s very good to pursue this multi-faceted understanding of difficulty, I think it’s probably unhelpful to aggregate the dimensions. Consider what happens if a problem maxes out just one dimension while remaining low on the others.
High Background, Low Creativity and Execution. In the limit, this is like a search problem. In practice, it may be a matter of training data: did the model have a chance to get familiar with the relevant concepts. Humans may consider the subject matter obscure, but it would just be a matter of reading comprehension to go through the relevant papers and eventually dig up the answer.
High Creativity, Low Background and Execution. In the limit, this is like a very challenging riddle (and some IMO problems). As I’ve discussed in other posts, I think this is perhaps the most interesting deficit with current models. Maybe scaling will lead to emergent properties around creativity, but I don’t think we’ve seen much sign of this yet.
High Execution, Low Background and Creativity. In the limit, this is like multiplying two 100-digit numbers. We know models struggle at some point here, but simple scaling will address that, at least up to a point. For very high-Execution problems it’s possible fundamentally new approaches will be required, just to help models manage all the threads. But my guess is that, at least for the next few legs up, this will be solvable with relatively straightforward scaling.
These are three very different types of problems! Even if the distribution of problems within FrontierMath is roughly uniform over these dimensions, we still have to wonder which 30% o3 is solving.
Can we get any sense of the actual distribution of FrontierMath problems? There are only the 5 sample problems, but we can at least see what mathematicians say about them. I found number theorist Daniel Litt’s take particularly interesting.
Elsewhere Litt says that for two of the sample problems he knew “instantly” what to do — not due to any brilliancy, just because he was familiar with the relevant subject matter. That seems out of step with the Creativity ratings on these two problems, which say it should take 2 and 3 hours to identify the key ideas.
Litt also contrasts his impression of FrontierMath problems with IMO problems, which, consistent with Chen, he says are “not really accessible to expertise” and are “about cleverness to some extent.” This also differs from some of Epoch AI’s statements: they compare Tier 1 problems to IMO problems, and see higher tiers as being harder.2
I think what’s going on here is FrontierMath problems and IMO problems are hard for different reasons. My suspicion is that a significant chunk of FrontierMath problems can be solved by applying advanced mathematical techniques in relatively straightforward ways. If anything, this might obscure their difficulty to humans: most people don’t have the right knowledge, and without the right knowledge the problems seem impossible; but with the right knowledge, they aren’t so bad.
Given what else we’ve seen about o3’s abilities in general, I’m inclined to believe that solving this sort of Background-based problem accounts for much of its performance on FrontierMath. See also the “Reading Up” section below for a specific way this dynamic could explain the gap between o1 and o3’s performance.
To close this section, here’s how the Fields Medalists who reviewed FrontierMath described what it would mean if a model could solve the entire benchmark.
The mathematicians generally agreed that AI systems at FrontierMath’s level would be most valuable as supplements to human mathematicians rather than replacements, helping to verify calculations, test conjectures, and handle routine technical work while leaving broader research direction and insight generation to humans.
Although these mathematicians think the FrontierMath problems are challenging, they don’t necessarily think solving them is indicative of creative insight.
Poor Correlation
Toward the end of writing this post, I learned that OpenAI’s result of solving ≈30% of FrontierMath problems included solving ≈30% of the Tier 3 problems: success was distributed evenly across the difficulty tiers. This seemed to surprise the benchmark’s creators. Here’s their lead mathematician, Elliot Glazer:
One very concrete reason to expect this is that mathematicians can’t seem to agree on the difficulty ratings. Epoch AI had a small subset of FrontierMath problems reviewed twice. From the technical paper:
We observed inconsistent difficulty ratings between first and second reviewers; due to the subjective nature of this task, ratings rarely matched and often showed substantial differences.
They also acknowledge that this is just hard in general.
Problems that seem impossible may become trivial after exposure to certain techniques, and multiple solution paths of varying complexity often exist.
Moreover, while we designed our problems to require substantial mathematical work rather than allow solutions through guessing or pattern matching, the possibility of models finding unexpected shortcuts could undermine such difficulty estimates.
Where to go from here? In the thread linked above, Glazer says he’d “like to build a system that is simply aimed towards ‘how long do we think it will be before AI solves this problem?’”
Ultimately I would be more interested in, “How hard was this for a human, and why was it that hard?” That is a harder thing to measure, but I think the current difficulty-rating system is a very good start! I just think it would be better to disaggregate the dimensions and try to measure performance on them more independently.
Does OpenAI Have an Edge?
Could there be an unexpected reason why o3 does better than o1 on FrontierMath?
Update: One possibility I should have mentioned initially is that o3 doesn’t do that much better than o1. The ≈30% number for o3 is based on internal evaluations from OpenAI. As far as I can tell, they haven't published numbers for o1 in a comparable evaluation setting. Epoch AI's evaluations for o1 (really o1-preview and o1-mini) were below 2%, but it's possible that OpenAI knows how to elicit better performance from their models in general. We might learn more about this if/when either org publishes more comparable evaluation results. Thanks to Epoch AI director Jaime Sevilla for his comments on this topic. I’ve left the rest of this section unchanged as I think the ideas are interesting in general, but there’s certainly a chance there’s nothing here.
There was a minor brouhaha when it came out that OpenAI was the primary funder of FrontierMath, which wasn’t reported at first. It turned out that OpenAI has access to all of the problems, plus solution write-ups for all but 50 of the problems. An OpenAI employee, Sebastien Bubeck, stated that, “No decision on any of our models was based on anything to do with [FrontierMath].” I’m inclined to take that at face value. OpenAI has kept a lot of information private, but I haven’t seen any signs at all of scientific fraud.
Still, I think the rapid jump in performance on FrontierMath from o1 to o3 deserves scrutiny. We know these models can’t do all the math in the world. What are the odds that a benchmark was developed at just the right difficulty level, and released at just the right time, such that the best models went from “zero” to “something”?
Again, I don’t think there’s anything nefarious going on. But this reminds me of some of the lessons from the replication crisis in psychology and medicine: researchers have “degrees of freedom” in conducting experiments, which can unintentionally lead to misleading results. I wonder if something like this was at work with FrontierMath.
Freedom of Expression
For example, maybe someone at Epoch AI checked whether some version of o1 could solve drafts of the problems, and took that into account editorially. We know that these models can be sensitive to wording: any hunt for a problem the model can’t solve may stumble into a random corner of language-space where the model makes a mistake.
Such mistakes are often not stable, though, and this process could unintentionally overfit the problems to be difficult for o1. Perhaps random changes between o1 and o3 were enough to shake things up in some of these cases. o3’s apparent performance would increase, but without reflecting as real an improvement in abilities.
Reading Up
Or, here’s a dynamic I expect to happen in general and which may be at play here. One of the Fields Medalists who reviewed FrontierMath described how the relevant background material might be hard to come by:
For many FrontierMath problems, the relevant training data is "almost nonexistent...you’re talking like a dozen papers with relevant things."
There’s an obvious thing to do about that: make sure the relevant papers are in the training data! I don’t know how well models can learn general facts from single papers during pre-training. But when I asked o3-mini-high, with web search disabled, if it knew anything about the one math paper I co-authored as an undergraduate, it gave me a correct summary. It seems like there’s at least some chance of this.
Now, suppose that OpenAI knows that FrontierMath is coming, and knows that it will require sophisticated mathematical background. It seems like fair game for them to make sure that every last math paper is included in pre-training for o3. Maybe this violates what Bubeck said above about “no decision” being based on FrontierMath, but I wouldn’t really hold it against OpenAI. Still, it could explain the difference in o1 and o3’s performance.
Regardless of whether that happened in this case, I think it’s worth keeping an eye on: even if problems are kept out of training data, relevant background may accumulate over time. This can cause model performance to improve even without any change in general-purpose reasoning capabilities. Comparing detailed solutions across model versions should give some insight as to whether this is happening.
Low-Proof Spirits
One last note on FrontierMath: Epoch AI created an innovative evaluation format. They wanted solutions to be automatically verifiable, and thus required problem authors to creatively squeeze big conceptual ideas into simple, well-defined answer formats. This opened up problems around models brute-forcing solutions, which Epoch AI did their best to address. I’m sure there’s more work to do there, but with more extensive testing I bet they can push this pretty far.
Does this place an inherent limit on what the benchmark measures? The Fields Medalists commented on this a bit.
However, some mathematicians pointed out that the numerical format of the questions feels somewhat contrived. Borcherds, in particular, mentioned that the benchmark problems “aren’t quite the same as coming up with original proofs.”
Proofs are a core part of mathematical activity, and some questions probably can’t be squeezed into simple answer formats. I suspect there will remain a role for manual evaluations, but I don’t see this as a mark against FrontierMath.
What I Wish I Knew
In the end, there are two big things I wish I knew about FrontierMath and o1/o3.
How much of FrontierMath’s difficulty is due to each difficulty factor?
If you adjust for Background and Execution, what’s left? What proportion of problems require Creativity on par with harder IMO problems?
How much of each dimension was required to solve the 30% that o3 solved?
Why did o3 do better than o1?
Did the creators of FrontierMath check that o1 couldn’t solve the problems as part of the problem-development process?
How do o1 and o3’s training differ in terms of both the raw math texts and labeled math reasoning that were used? Was the set of math texts used in pre-training significantly expanded for o3?
If no to both of the above, then what are the differences between o1 and o3? Of course, everyone wants to know that one.
So There I Was…
What else could we do for a math benchmark? Here’s an outline of my personal ideal.
Interview mathematicians. Have them describe problems they solved in the course of their research. Have them present the problems not with the benefit of hindsight, but in the terms that they originally formulated when they began working. “So there I was, trying to prove this lemma...” Then have them describe the solution they ultimately found, including any reformulations of the problems. Turn this into labeled data.
Operationalizing this is a tall order. Making automated evaluations would be challenging. Data contamination problems would be significant, especially if the mathematicians had gone on to publish their results. But the reason I consider it such an ideal is it would have a good shot at construct validity: it would measure a model’s ability to do a part of the work that a mathematician does.
OpenAI specifically claimed 25% and 32% under different settings. 25% is for “o3”, without further qualification. 32% is for “o3‑mini with high reasoning effort, when prompted to use a Python tool”. I couldn’t find any more detail, so we’ll just round this off to 30%. It won’t make a difference for our analysis.
I also find it notable that IMO problems are chosen through a very intensive process. They are sourced globally over the course of the year, undergo extensive test solving, are narrowed into a shortlist of about 30 problems, and then the 6 final problems are selected by an expert panel. In contrast, most FrontierMath problems were reviewed only once. Epoch AI certainly recognizes the importance of this angle. They are currently working on a Tier 4 of problems, meant to be as hard as Tier 3 but more extensively tested (“red-teamed”) to make sure there are no shortcuts.
Very interesting. o3-mini-high and especially Deep Research seem worse at my subfield than you'd predict from FrontierMath
I have two different theories:
1. Could some of the problems be not as novel and untouched as their proposers had thought? The literature nowadays is enormous, and researchers constantly reinvent things; the prior art might then get discovered 10 years later. But a well-trained AI could have enough memory to avoid this problem.
2. Could the AI be doing some educated guessing? The problems were supposed to have answers that are not easily guessed, but there are often (particularly in number theory) heuristics that give an exact number with morally-speaking 95% probability (e.g., if some polynomial is squarefree, or the GRH is true, or some series converges). An AI who gets such guesses right would be an indistinguishable from an AI who solves 95% of such problems, unless the authors take care to pick unusual cases often.