Interview with Elliot Glazer, Lead Mathematician at Epoch AI
Now I Know Some More About FrontierMath
A few weeks ago, I wrote a post called What I Wish I Knew About FrontierMath, a new math benchmark created by Epoch AI. The head mathematician of the project, Elliot Glazer, reached out to chat about some of my questions, and about Epoch AI’s plans for the future of the benchmark.
This post contains highlights from our discussion, lightly edited. You can watch a recording of our full chat here. Everything below is a quote from Elliot, unless otherwise indicated. Bold just indicates bits that I found particularly interesting.
On the Origins of FrontierMath
Epoch conceived of frontier math before I joined. They were discussing how AI benchmarks were getting saturated very fast and they wanted something that was more difficult and would last a while but still be an objectively valid source of information. Mathematics, which is about as objective as it gets, seemed like a natural choice.
They did a pilot version of FrontierMath that was smaller and much easier. Eventually, I think a few months later, Epoch was able to secure funding with Open AI. Having conversations there helped them realize that FrontierMath needed to be much harder if they wanted it to last even a year. Basically, they were told that Open AI’s models were improving so fast, anything that was just barely outside of AI's range would be grasped by it shortly.
So they lowered the quantity of problems. I think it was originally going to be on the order of a thousand problems, but has ended up being only 300 problems, with more of an emphasis on high difficulty. Tamay, Epoch’s associate director, and Ege, who's a senior researcher, decided to pass the product on to an actual mathematician.
The timing was great for me. I was in the middle of my job search, I saw this ad for what seemed like a perfect opportunity, so, I went for it.
If difficulty is the goal, why not use unsolved problems?
We did consider that and we are still considering that. We have various projects in the works based around the idea of studying AI's progress on certain open problems for which it would be concretely verifiable when AI progresses on it. For instance, one I suggested was finding new records for the longest known arithmetic sequence of primes, since this can be verified efficiently. So we are actually beginning the process of cataloging such problems.
Greg: So was automatic verifiability the big roadblock to using unsolved problems?
FrontierMath does have a rather separate goal: to get a better understanding of AI's progress towards reaching human level. This requires problems that have actually been solved by humans, for us to have some sense of what the difficulty is.
It's really hard to figure out the difficulty of an open problem. I imagine if you polled mathematicians about the famous conjectures and what order they will eventually be solved in, that would resemble a random permutation. People have such different instincts on these things. The only thing we know is, however hard we've tried thus far is a lower bound.
So the advantages [to using solved problems] are that we do know the answers and we know where the problems stack up on the line from high school math olympian to professional researcher. And we can use this to benchmark how close AI is to the pinnacle of humanity.
What would you say if AI systems solved FrontierMath by the end of the year?
One possibility is that would mean AI was really that good. In which case FrontierMath didn't fail, it just pointed out that reality. That would be a success.
But as impressed as I have been by AI, I would think, if that was achieved this year, it would more likely speak poorly of the design of Frontier Math.
We're already seeing some glimmers of this, where I think some of the problems suffer from being too guessable. When I say guessable, I mean basically any way in which a model can reach the correct answer of a problem without going through most of the rigorous reasoning needed to justify it.
The ideal problem will have the property that if you have found the answer almost surely you did the reasoning. But not every [problem author] succeeded at that ideal. It's really hard. We are compiling the lessons we've learned from the way these models have “cheesed” various problems, to provide to future problem contributors. A little handbook on how not to have your problem get cheesed by AI.
On the FrontierMath Answer Format
FrontierMath problems require closed form values, usually integers. This is largely done for convenience, where having it in this format makes it easy to check.
Alternatively, we could demand models submit English language proofs that have to be verified by a human judge. But this would be extremely unwieldy.
Or, we could have models submit formalized problems in a language like Lean. We are interested in this, but there are a lot of limitations. For instance, Lean is still missing a lot of undergraduate mathematics, let alone graduate research level topics. So we would only be able to test certain subfields of math, if we wanted to design a Lean benchmark today. But we are still interested in doing some sort of project with this form.
What does it mean if an AI system solves FrontierMath?
Greg: You said, maybe if FrontierMath is solved by an AI system, the AI system is “that good” at math. What is your target for what “that good” means? In other words, we know that the FrontierMath problems don't encapsulate all activities undertaken by mathematicians, so what do you hope FrontierMath to reflect in terms of mathematical activities?
Our difficulty tiering system gives some insight into that. After we collect difficulty ratings along various axes, we squeeze that into one of three numbers representing three tiers of difficulty that we intended Frontier Math to test.
Tier 1 is roughly hard undergraduate: this is stuff like Putnam problems, arguably a sufficiently creative IMO problem, but also things like difficult exercises in an undergraduate math course. A difficult exercise in a textbook would require perhaps more background knowledge than a typical Putnam problem, but maybe not as much of a test of creativity.
Tier 2, our goal was more like problems that a grad student would face in a graduate course or in a qualifying exam. Things that show that they are really learning the details of an advanced subject. Still not really necessarily demanding much creativity. This would also include, then, a sufficiently creative undergraduate problem, or something like an REU research problem.
Tier 3, our goal was problems that were like early researcher level. This phrasing can be a bit confusing, because we aren't suggesting that doing a Tier 3 problem amounts to genuinely doing a research project. It's more like, what are the prowesses that you would expect an early researcher in a field to have? What sorts of mastery would they need to begin really tackling the interesting questions in their field? And what might be a first problem a PhD advisor would give their student to help them start exploring their field before they tackle a long-term project. That's the area I was hoping Tier 3 would conceptualize. Of course, again, this will also include some really creative or execution-intense versions of tier 2 problems.
These were our attempts to sort the problems into three large buckets of difficulty.
So, the question you had asked was then what do I intend it to mean when a model has saturated FrontierMath, and I would answer that tier by tier.
I would hope that a model that has saturated the Tier 1 subset is as strong as a really strong undergraduate, who has some combination of contest experience but also has a lot of advanced courses under their belt. Tier 2 would be like an amalgam of early grad students who have chosen to pursue various subfields of mathematics and the sort of combined insights they might have. A Tier 3 saturating model would be some sort of amalgam of early researchers ready to tackle big projects, again among all the diverse subfields of math.
On Comments from Professional Mathematicians
Back in October, we interviewed Terry Tao, Timothy Gowers, Richard Borcherds, and Evan Chen. We sent them the Tier 3 problems we had finished at the time. This was a subset that was pre-selected to be as hard as possible, not a representative sample. Their comments were roughly along the lines of [the above characterization of Tier 3] is what it would mean for a model to solve these problems.
Of course, no one of these mathematicians is omniscient with respect to all of the different fields of math. They would all acknowledge that they would need to do some background study before they could tackle what would be considered even an easy problem if it is far enough from their expertise. They all said, “I don't know how to solve most of these,” but I don't think this was them endorsing them as genuinely research-level problems.
Greg: This is an interesting difference between a human and an AI system doing math. To a human, if you're unfamiliar with an area, you might have no idea what's going on in a problem statement. That can be very intimidating. But once you’re able to map that to something you do understand, or just brush up on the background, something that looked intimidating to begin with can look easy or trivial. But an AI system that's been pre-trained on enough math content doesn’t have that problem. It's all familiar. And so you have this funny element where just because an AI system can solve something that's unfamiliar to me, I may think that looks really impressive.
So that represents one of two different ways in which a problem that might be impressive to even a Fields Medalist might not really be that significant when it's solved, namely lack of experience in that subfield.
And even then, mathematicians who do fully understand the difficulty of the mathematics of the problem will sometimes fail to recognize the wide array of tricks a model might use to jump to the answer without proving all the lemmas.
On the Difficulty Dimensions
I think it was Ege who made that system, because it was already here when I got here.
I thought it just seemed a very natural way to factor problem difficulty and seems robust to even the most advanced research level math problems. You can look up any recent major result and you can go through it in terms of: how much background knowledge do you need to even begin to start reading the paper; at how many points is there a very surprising next step that would seem to be a creative epiphany; and then, between each creative epiphany, how much manual labor do you have to do.
So I think it's a very nice robust system and it will scale well to even problems that are beyond the FrontierMath scope. I like it.
But then you have all the practical realities of how to operationalize it. If we choose an operationalization that’s based on [how difficult it would be for] an idealized person who's the most skilled in the relevant area — how well can arbitrary reviewers simulate that person, or how well do two reviewers tend to correlate in their ratings?
I don't think that we have yet come up with a system that is reliable, in the sense that, if I hired a new team of reviewers and had them all do the same process I've given the ones thus far, I'm not that convinced that the numbers would align that much with the original ones.
So we are working towards a better strategy for coming up with these values, how to make them more precise and how to extract more predictive content from these estimates. So in terms of the idealized version of this difficulty system, I still think this is really a great way to break down problem difficulty. We just need to find much better ways to assign numbers to these things.
The FrontierMath Competition
This is one of the big things that we are doing to actually put some meaning behind these difficulty ratings. We're actually going to run an in-person competition near the end of March.
The idea is that the only way to really get a good sense of how hard these problems are is to give them to appropriate experts. There's no point in giving anyone all 300 problems. [Any given person] won't even understand the words [in many problems] because each field has its own jargon. But we will give a subset to mathematicians based around their expertise and have them compete against each other for prizes. I think the prize pool is over $30,000. We want to have a good incentive to have strong mathematicians do the best on the set of problems that are most relevant to them.
In doing so, we'll be able to collect data on how hard these things really are.
Tier 4
I want to create a data set that does not get saturated until AI has genuinely mastered the main ideas of most of the major fields of mathematics. I think people think that's what Tier 3 is. It's not, but we are trying to realize that really high bar of difficulty now with this [Tier 4] project.
Frankly, it's a good thing we didn't start with Tier 4 because there were a lot of mistakes that we have learned from, so that not only will we be able to hire the type of top professors we need to make Tier 4 happen, but also we're going to have lots of screening for the kinds of errors that resulted in problems that had been rated Tier 3 getting solved by heuristics.
This will be easier to explain once we release our next set of five public problems. The first set [of five public problems] was a random selection, one from each quintile of difficulty.
But now that we're seeing what models can do on Frontier Math, the next five public problems will be based around demonstrating different things we have learned from this process rather than based off of what we had guessed would be the relative difficulties.
So it will, for example, include some problems that I would have been shocked to see solved this year. We’ll be writing a public report of what we learned, including the weaknesses that allowed AI to solve these problems — even though I still don't really think that they mastered the relevant subjects.
Note: These problems are now available here. See this thread for mathematician Daniel Litt’s discussion of the problem he contributed to FrontierMath, which, to his surprise, o3-mini-high can solve.
Analyzing OpenAI’s Progress
OpenAI shared the reasoning traces with us for their internal evaluation of o3 on FrontierMath. We’re able to share these traces with the problem authors, so they can see how the models tackled their problems. We’re having them fill out a survey for each problem. For correct problems, how much of the legitimate mathematical reasoning did the model do? Did it find all the key steps? Similarly for ones where the problems weren't solved, how much progress had it made, how early on did it first make a fatal deduction that it never recovered from, etc.
[The same public report mentioned above] will address things like the survey data from the authors, what lessons we are learning, analysis of these five newly-public problems, what those show about the differences between AI difficulty versus our internal difficulty ratings, and also just go over some of the things we have learned about the three axes of difficulty that we've been measuring and to what extent those correlate with the model’s performance.
I'm curious to see if any of the difficulty ratings are as predictive of the AI’s performance as another metric we measured, which was how many pages long the solution is. That feels like such a naive, brusque sense of difficulty, but I wouldn’t be that surprised if that actually has a relatively good correlation with a problem being hard for o3-mini. Just because: the more steps, the more opportunities to hallucinate.
One comment about these reasoning traces [that OpenAI sent from their internal evaluations]: they are definitely a lot more impressive than what we get online [using ChatGPT or the API]. Almost surely they put a lot of resources into solving these problems when they do their internal evaluations.
Note: Epoch AI has since run their own evaluations, and made all outputs available for the 10 public FrontierMath problems. You can inspect the results here.
My biggest takeaway from looking at these traces is that they are surprisingly human. The way it talks really captures the feeling of chaos that goes on inside my own head, when I'm thinking about a math problem. The more I think about it, I constantly hallucinate. I'll quickly make a false guess. Then a different part of my brain says: that doesn't make sense; try a different direction. A very similar sort of dialogue plays out in these reasoning traces, where it does still hallucinate all the time, but it goes back and realizes it and course corrects, until it finds a train of thought that it doesn’t see any problems with, and then it commits to that.
Looking Ahead
We're learning as much as possible about the nature of mathematical difficulty and the ways in which we have and have not captured it. I’m happy to talk about the flaws of the project, but I still very much stand by the base FrontierMath as has been released. I think we have captured a genuine signal here. The scores from frontier models are genuinely showing serious progress in how much these models are understanding mathematics.
With Tier 4, I hope to have a much higher difficulty, one that perhaps will survive a few years, if we're able to capture that difficulty and avoid the heuristic mistakes that some problems in base FrontierMath suffered from. So, we will release this report soon, and I am really excited to see how Tier 4 plays out.
Thanks to Elliot for the discussion! You can find the full recording here.
Lots of great information and insight here on FrontierMath and the o3 results!
I was slightly confused by this passage:
> One comment about these reasoning traces [that OpenAI sent from their internal evaluations]: they are definitely a lot more impressive than what we get online [using ChatGPT or the API]. Almost surely they put a lot of resources into solving these problems when they do their internal evaluations.
> Note: Epoch AI has since run their own evaluations, and made all outputs available for the 10 public FrontierMath problems. You can inspect the results here.
Following the "inspect the results" link, it looks like Epoch AI has evaluated o3-mini but not the full o3 model? Which is what I'd expect, since (AFAIK) OpenAI is not providing public access to o3 (except that, reportedly, their Deep Research tool uses o3). However, wouldn't that explain why OpenAI's internal evaluations were "more impressive than what we [Epoch] get online"?
(Also, I forget the details regarding OpenAI's internal evaluation, but I believe their most impressive numbers involved running the model many times on each problem?)