I kiiiind of model human math problem solving as a heuristic tree search? Like at any given point you have a "board state" (things you know/have calculated) and a bunch of potential "moves" (next things to try to prove/calculate) and you have to get to a winning state in as few moves as possible. Some moves take you to losing board states (unsound proofs/calculation errors) and must be avoided. Indeed, automated theorem proving was (Wikipedia tells me) one of the original applications of MCTS long before AlphaGo and friends.
I find this a useful way to operationalize what "creativity" and "execution" could mean in terms of AI capabilities. "execution" is a combination of both being good at avoiding losing states (if you have to do 100 steps in a row correctly, you better screw up less than 1% of the time) and also just raw ability to explore the space. "creativity" is mostly about better move evaluation/tree-pruning heuristics, being able to guess which approaches are likely to succeed/when an approach isn't planning out.
This also implies that the line between "mere search" and "creativity" is pretty permeable (assuming the requisite data's in the training set) -- it's just a question of how obvious we humans think the heuristics are that lead you to the correct answer. Take a problem like: "Timmy drops a stone down a well, and hears a splash exactly three seconds later. How deep is the well?" To us this feels pretty much like mere search (can you pull d=1/2a*t^2?) but there are a lot of other potentially-relevant things you _could_ start doing that you have to prune out to know to proceed with that calculation.
If you agree with this framing, it's a bit of a surprise to me that an LLM-based model would be better at execution than creativity given that:
- It's good at heuristically identifying relevant concepts, but
- It has a relatively high step-by-step error rate and can't do that many steps.
Indeed, fun story, when I asked ChatGPT (o3-mini-high) the above question, it got the right answer but through the wrong set of steps (it incorporated the speed of sound, came up with the correct polynomial, said "this is transcendental so we have to solve by numerical estimation", and then did so correctly).
So, I'm clearly wrong about something. As a more experienced math-problem-solver than me, you can tell me, is my model of the problem-solving process wrong? Or am I wrong about what o3-mini is good at and why?
I like this model of problem-solving a lot, and I think in these very terms sometimes. I don't know if it's a bug or a feature that it's clearly an AI-inflected way of looking at things, but I do find it helpful. (It's also, presumably, exactly how AlphaProof works.) When I've written this out for myself, though, I've had a third concept, which I was calling "judgement". I thought of judgement as what we use to prune a search: to decide whether we're going down a good path or not. I reserved "creativity" for the act of generating new moves to try, like, "well I could try using X" or "I could try reframing the problem as Y". Regardless, I agree this suggests that creativity, on some level, is just about having a good search strategy.
In any case, I do think there's a reasonable story about what's going on with the latest LLM-based models. If you ask the earlier generation, e.g. 4o, to identify the concepts relevant for solving a math problem, it will do a decent job. This is more of a retrieval-style search problem, after all, and it excels there. It won't come up with anything particularly novel: just theorems or techniques that are common enough to have names, or at least to be discussed a fair amount in its corpus. But this is adequate material to solve a lot of math problems.
What models like o1 and o3 seem to have added on top of that is the ability to execute the solutions. I was certainly surprised that this proved possible! But they basically found a way to drive down the step-by-step error rates by a couple orders of magnitude. I don't know the gist of how they did this. Part of it was inclining the models to write much longer chains of thought. Part of it might have been RL on annotated problem-solving "transcripts" of human reasoning steps. But something like this seems not too mysterious. At least: it's a reasonable thing to try, and, hey, it worked. So that's the recipe, as I see it, for the current LLM-based model performance.
Of course, it is still limited! What I haven't seen from these models is creativity that goes beyond this sort of fact/technique retrieval. I take this to be a limit on their space-exploring ability. But some math problems require coming up with "whole big new ideas", for lack of a more precise way to put it, and I don't think the LLM-based models are good at this yet. (Even the much-more-specialized AlphaProof seems weak here to me: the two IMO questions it didn't get seemed to me to be the ones that required these bigger leaps out into the search space.) One question, then, is what sort of training data would open up the search space to include these harder "aha" moments? I could imagine needing a lot more scale-up before the models could make these creative leaps, since they're really not "retrieval" oriented.
I've got a draft post including some discussion about the abstract nature of creativity, but it's a slippery concept... I imagine it'll take a while to climb that hill.
1. Could some of the problems be not as novel and untouched as their proposers had thought? The literature nowadays is enormous, and researchers constantly reinvent things; the prior art might then get discovered 10 years later. But a well-trained AI could have enough memory to avoid this problem.
2. Could the AI be doing some educated guessing? The problems were supposed to have answers that are not easily guessed, but there are often (particularly in number theory) heuristics that give an exact number with morally-speaking 95% probability (e.g., if some polynomial is squarefree, or the GRH is true, or some series converges). An AI who gets such guesses right would be an indistinguishable from an AI who solves 95% of such problems, unless the authors take care to pick unusual cases often.
I agree, both of these are live issues. Epoch AI discusses both, but they aren't solved.
On (1), they say this: "Our primary method for validating originality relied on expert review by our core team of mathematicians. Their extensive familiarity with competition and research-level problems enabled them to identify potential similarities with existing problems that automated systems might miss. The team conducted thorough manual checks against popular mathematics websites, online repositories, and academic publications. This expert review process was supplemented by verification using plagiarism detection software."
They also ran the problems through two plagiarism detection programs, and got no hits.
I'd call that a good effort, but as you say it's not a guarantee.
More importantly, IMO, the models might not find much of a difference between the literature containing the problem itself verbatim, vs. a discussion of the key results that, if applied, straightforwardly solve the problem. In the extreme case, maybe the problem itself is novel but there's a Wikipedia page with the key result/formula, and if you think to apply that, you're done.
On (2), Epoch AI is definitely focused here. They mention how, in the subset of 35 problems that they put through a second review, "For two of the 35 questions, reviewers proposed strategies for guessing the solution with substantially less effort or computation than was necessary for a rigorous mathematical justification of the answer." They plan to beef this up with, e.g., much more test-solving.
Also, Litt says in another tweet that, on the hardest sample problem, his "no thinking at all" idea was a heuristic approach that computed the first 3 digits correctly of the 6-digit answer before timing out. Source: https://x.com/littmath/status/1870544620498857996 -- and read down for Glazer's response.
So I think there are really good points. More work can probably ameliorate these issues, but we'll always be a bit unsure how much they're at play.
Ah, good that they thought of these! Would be nice to know more, though.
Actually, their sample problem 2 is a great example of "easier if you don't take the high route". You're asked to ensure that the set {p(x) = p(y)} has many components; in other words, there should be many distinct reasons why two values of p can be equal. Well, if you forget for a moment that p is to be a polynomial, then the easiest way to ensure this is by picking a trig function: sin(x) = sin(y) holds not just for x = y but also for x = pi - y and for x = 2pi + y and so on. The closest thing to trig functions among the polynomials are, of course, the Chebyshev polynomials, and so we're naturally led to T_19 (for degree reasons). Then the linear coefficient requirement forces us to make an affine transformation.
I am not saying that this is the AI's thinking, but it might well be.
I kiiiind of model human math problem solving as a heuristic tree search? Like at any given point you have a "board state" (things you know/have calculated) and a bunch of potential "moves" (next things to try to prove/calculate) and you have to get to a winning state in as few moves as possible. Some moves take you to losing board states (unsound proofs/calculation errors) and must be avoided. Indeed, automated theorem proving was (Wikipedia tells me) one of the original applications of MCTS long before AlphaGo and friends.
I find this a useful way to operationalize what "creativity" and "execution" could mean in terms of AI capabilities. "execution" is a combination of both being good at avoiding losing states (if you have to do 100 steps in a row correctly, you better screw up less than 1% of the time) and also just raw ability to explore the space. "creativity" is mostly about better move evaluation/tree-pruning heuristics, being able to guess which approaches are likely to succeed/when an approach isn't planning out.
This also implies that the line between "mere search" and "creativity" is pretty permeable (assuming the requisite data's in the training set) -- it's just a question of how obvious we humans think the heuristics are that lead you to the correct answer. Take a problem like: "Timmy drops a stone down a well, and hears a splash exactly three seconds later. How deep is the well?" To us this feels pretty much like mere search (can you pull d=1/2a*t^2?) but there are a lot of other potentially-relevant things you _could_ start doing that you have to prune out to know to proceed with that calculation.
If you agree with this framing, it's a bit of a surprise to me that an LLM-based model would be better at execution than creativity given that:
- It's good at heuristically identifying relevant concepts, but
- It has a relatively high step-by-step error rate and can't do that many steps.
Indeed, fun story, when I asked ChatGPT (o3-mini-high) the above question, it got the right answer but through the wrong set of steps (it incorporated the speed of sound, came up with the correct polynomial, said "this is transcendental so we have to solve by numerical estimation", and then did so correctly).
So, I'm clearly wrong about something. As a more experienced math-problem-solver than me, you can tell me, is my model of the problem-solving process wrong? Or am I wrong about what o3-mini is good at and why?
I like this model of problem-solving a lot, and I think in these very terms sometimes. I don't know if it's a bug or a feature that it's clearly an AI-inflected way of looking at things, but I do find it helpful. (It's also, presumably, exactly how AlphaProof works.) When I've written this out for myself, though, I've had a third concept, which I was calling "judgement". I thought of judgement as what we use to prune a search: to decide whether we're going down a good path or not. I reserved "creativity" for the act of generating new moves to try, like, "well I could try using X" or "I could try reframing the problem as Y". Regardless, I agree this suggests that creativity, on some level, is just about having a good search strategy.
In any case, I do think there's a reasonable story about what's going on with the latest LLM-based models. If you ask the earlier generation, e.g. 4o, to identify the concepts relevant for solving a math problem, it will do a decent job. This is more of a retrieval-style search problem, after all, and it excels there. It won't come up with anything particularly novel: just theorems or techniques that are common enough to have names, or at least to be discussed a fair amount in its corpus. But this is adequate material to solve a lot of math problems.
What models like o1 and o3 seem to have added on top of that is the ability to execute the solutions. I was certainly surprised that this proved possible! But they basically found a way to drive down the step-by-step error rates by a couple orders of magnitude. I don't know the gist of how they did this. Part of it was inclining the models to write much longer chains of thought. Part of it might have been RL on annotated problem-solving "transcripts" of human reasoning steps. But something like this seems not too mysterious. At least: it's a reasonable thing to try, and, hey, it worked. So that's the recipe, as I see it, for the current LLM-based model performance.
Of course, it is still limited! What I haven't seen from these models is creativity that goes beyond this sort of fact/technique retrieval. I take this to be a limit on their space-exploring ability. But some math problems require coming up with "whole big new ideas", for lack of a more precise way to put it, and I don't think the LLM-based models are good at this yet. (Even the much-more-specialized AlphaProof seems weak here to me: the two IMO questions it didn't get seemed to me to be the ones that required these bigger leaps out into the search space.) One question, then, is what sort of training data would open up the search space to include these harder "aha" moments? I could imagine needing a lot more scale-up before the models could make these creative leaps, since they're really not "retrieval" oriented.
I've got a draft post including some discussion about the abstract nature of creativity, but it's a slippery concept... I imagine it'll take a while to climb that hill.
Very interesting. o3-mini-high and especially Deep Research seem worse at my subfield than you'd predict from FrontierMath
I have two different theories:
1. Could some of the problems be not as novel and untouched as their proposers had thought? The literature nowadays is enormous, and researchers constantly reinvent things; the prior art might then get discovered 10 years later. But a well-trained AI could have enough memory to avoid this problem.
2. Could the AI be doing some educated guessing? The problems were supposed to have answers that are not easily guessed, but there are often (particularly in number theory) heuristics that give an exact number with morally-speaking 95% probability (e.g., if some polynomial is squarefree, or the GRH is true, or some series converges). An AI who gets such guesses right would be an indistinguishable from an AI who solves 95% of such problems, unless the authors take care to pick unusual cases often.
I agree, both of these are live issues. Epoch AI discusses both, but they aren't solved.
On (1), they say this: "Our primary method for validating originality relied on expert review by our core team of mathematicians. Their extensive familiarity with competition and research-level problems enabled them to identify potential similarities with existing problems that automated systems might miss. The team conducted thorough manual checks against popular mathematics websites, online repositories, and academic publications. This expert review process was supplemented by verification using plagiarism detection software."
They also ran the problems through two plagiarism detection programs, and got no hits.
I'd call that a good effort, but as you say it's not a guarantee.
More importantly, IMO, the models might not find much of a difference between the literature containing the problem itself verbatim, vs. a discussion of the key results that, if applied, straightforwardly solve the problem. In the extreme case, maybe the problem itself is novel but there's a Wikipedia page with the key result/formula, and if you think to apply that, you're done.
On (2), Epoch AI is definitely focused here. They mention how, in the subset of 35 problems that they put through a second review, "For two of the 35 questions, reviewers proposed strategies for guessing the solution with substantially less effort or computation than was necessary for a rigorous mathematical justification of the answer." They plan to beef this up with, e.g., much more test-solving.
Also, Litt says in another tweet that, on the hardest sample problem, his "no thinking at all" idea was a heuristic approach that computed the first 3 digits correctly of the 6-digit answer before timing out. Source: https://x.com/littmath/status/1870544620498857996 -- and read down for Glazer's response.
So I think there are really good points. More work can probably ameliorate these issues, but we'll always be a bit unsure how much they're at play.
Ah, good that they thought of these! Would be nice to know more, though.
Actually, their sample problem 2 is a great example of "easier if you don't take the high route". You're asked to ensure that the set {p(x) = p(y)} has many components; in other words, there should be many distinct reasons why two values of p can be equal. Well, if you forget for a moment that p is to be a polynomial, then the easiest way to ensure this is by picking a trig function: sin(x) = sin(y) holds not just for x = y but also for x = pi - y and for x = 2pi + y and so on. The closest thing to trig functions among the polynomials are, of course, the Chebyshev polynomials, and so we're naturally led to T_19 (for degree reasons). Then the linear coefficient requirement forces us to make an affine transformation.
I am not saying that this is the AI's thinking, but it might well be.
another openai employee confirms that the FM reasoning traces guide "high-level" decisions about the model