Discussion about this post

User's avatar
Jacob's avatar

I kiiiind of model human math problem solving as a heuristic tree search? Like at any given point you have a "board state" (things you know/have calculated) and a bunch of potential "moves" (next things to try to prove/calculate) and you have to get to a winning state in as few moves as possible. Some moves take you to losing board states (unsound proofs/calculation errors) and must be avoided. Indeed, automated theorem proving was (Wikipedia tells me) one of the original applications of MCTS long before AlphaGo and friends.

I find this a useful way to operationalize what "creativity" and "execution" could mean in terms of AI capabilities. "execution" is a combination of both being good at avoiding losing states (if you have to do 100 steps in a row correctly, you better screw up less than 1% of the time) and also just raw ability to explore the space. "creativity" is mostly about better move evaluation/tree-pruning heuristics, being able to guess which approaches are likely to succeed/when an approach isn't planning out.

This also implies that the line between "mere search" and "creativity" is pretty permeable (assuming the requisite data's in the training set) -- it's just a question of how obvious we humans think the heuristics are that lead you to the correct answer. Take a problem like: "Timmy drops a stone down a well, and hears a splash exactly three seconds later. How deep is the well?" To us this feels pretty much like mere search (can you pull d=1/2a*t^2?) but there are a lot of other potentially-relevant things you _could_ start doing that you have to prune out to know to proceed with that calculation.

If you agree with this framing, it's a bit of a surprise to me that an LLM-based model would be better at execution than creativity given that:

- It's good at heuristically identifying relevant concepts, but

- It has a relatively high step-by-step error rate and can't do that many steps.

Indeed, fun story, when I asked ChatGPT (o3-mini-high) the above question, it got the right answer but through the wrong set of steps (it incorporated the speed of sound, came up with the correct polynomial, said "this is transcendental so we have to solve by numerical estimation", and then did so correctly).

So, I'm clearly wrong about something. As a more experienced math-problem-solver than me, you can tell me, is my model of the problem-solving process wrong? Or am I wrong about what o3-mini is good at and why?

Expand full comment
ZFC's avatar

Very interesting. o3-mini-high and especially Deep Research seem worse at my subfield than you'd predict from FrontierMath

Expand full comment
5 more comments...

No posts