5 Comments

Very interesting. o3-mini-high and especially Deep Research seem worse at my subfield than you'd predict from FrontierMath

Expand full comment

I have two different theories:

1. Could some of the problems be not as novel and untouched as their proposers had thought? The literature nowadays is enormous, and researchers constantly reinvent things; the prior art might then get discovered 10 years later. But a well-trained AI could have enough memory to avoid this problem.

2. Could the AI be doing some educated guessing? The problems were supposed to have answers that are not easily guessed, but there are often (particularly in number theory) heuristics that give an exact number with morally-speaking 95% probability (e.g., if some polynomial is squarefree, or the GRH is true, or some series converges). An AI who gets such guesses right would be an indistinguishable from an AI who solves 95% of such problems, unless the authors take care to pick unusual cases often.

Expand full comment

I agree, both of these are live issues. Epoch AI discusses both, but they aren't solved.

On (1), they say this: "Our primary method for validating originality relied on expert review by our core team of mathematicians. Their extensive familiarity with competition and research-level problems enabled them to identify potential similarities with existing problems that automated systems might miss. The team conducted thorough manual checks against popular mathematics websites, online repositories, and academic publications. This expert review process was supplemented by verification using plagiarism detection software."

They also ran the problems through two plagiarism detection programs, and got no hits.

I'd call that a good effort, but as you say it's not a guarantee.

More importantly, IMO, the models might not find much of a difference between the literature containing the problem itself verbatim, vs. a discussion of the key results that, if applied, straightforwardly solve the problem. In the extreme case, maybe the problem itself is novel but there's a Wikipedia page with the key result/formula, and if you think to apply that, you're done.

On (2), Epoch AI is definitely focused here. They mention how, in the subset of 35 problems that they put through a second review, "For two of the 35 questions, reviewers proposed strategies for guessing the solution with substantially less effort or computation than was necessary for a rigorous mathematical justification of the answer." They plan to beef this up with, e.g., much more test-solving.

Also, Litt says in another tweet that, on the hardest sample problem, his "no thinking at all" idea was a heuristic approach that computed the first 3 digits correctly of the 6-digit answer before timing out. Source: https://x.com/littmath/status/1870544620498857996 -- and read down for Glazer's response.

So I think there are really good points. More work can probably ameliorate these issues, but we'll always be a bit unsure how much they're at play.

Expand full comment

Ah, good that they thought of these! Would be nice to know more, though.

Actually, their sample problem 2 is a great example of "easier if you don't take the high route". You're asked to ensure that the set {p(x) = p(y)} has many components; in other words, there should be many distinct reasons why two values of p can be equal. Well, if you forget for a moment that p is to be a polynomial, then the easiest way to ensure this is by picking a trig function: sin(x) = sin(y) holds not just for x = y but also for x = pi - y and for x = 2pi + y and so on. The closest thing to trig functions among the polynomials are, of course, the Chebyshev polynomials, and so we're naturally led to T_19 (for degree reasons). Then the linear coefficient requirement forces us to make an affine transformation.

I am not saying that this is the AI's thinking, but it might well be.

Expand full comment

another openai employee confirms that the FM reasoning traces guide "high-level" decisions about the model

Expand full comment