I spent a fair amount of time on P5 and was so convinced that the answer was somewhere in the vicinity of 12 (a diagonal line being essentially worst case, and requiring a binary search to find the spot where the pattern deviates from a perfect diagonal). Embarrassed at how far off I was. A beautiful problem.
Thanks as always for the thorough analysis! There's so much value to be obtained from really digging deep into how the models approach individual problems.
If it's any consolation, I spent 1-1.5 hours on it and wound up in the exact same spot. Naturally, I could not prove that binary search was optimal! I remember early on having a sort of proto-thought to the effect of "should I explore the first row?" but deciding not to pursue that line of thinking, and never revisiting it. With hindsight, that might have been more fruitful.
I think we will we need both LLMs and FPMs working together. The LLMs have a lot of "holes" that the FPMs can fill. But there are a lot of tasks that the FPM can't really even start on.
I think we need a step like "looking for interesting intermediate results". In practice, you don't solve these problems by guessing the answer and then cranking away. You start asking things like, what if a = b, what can we prove. Or, oh it looks like g has to divide both a - 1 and b - 1, that seems useful. And then while you look for interesting intermediate results, you notice things that pop up.
For example, on the snail problem, there's an obvious way to look for intermediate results. Think of some strategies and see how well they do. One strategy is, just try column 1, then just try column 2, etc.
(spoiler alert)
After a little while you think and see, well if I just try column n, and find the monster, I can try "dodging around" just that monster. Well, columns n-1 and n+1 could have the monster one step away diagonally. So that's the worst case, and actually, the worse case is finding the monster right in the middle of column n. Hmm, if only you could find the monster *not* in the middle of the column. And then this line of reasoning suggests the actual optimal strategy.
I think the FPMs can just never really do this evaluation of, is this an interesting intermediate result or not. That is a very "LLM" task.
Plus, in actual non-contest mathematics, looking for interesting intermediate results is useful all the time. You aren't going to be able to tell it "please prove P != NP now" but you could tell it, okay here are all of these problems and the best known algorithms for them, what other interesting stuff is there. New problems, new algorithms, etc.
Anyway this is a very, very interesting post and thank you for making it.
I broadly agree, though I will say that evaluating "is this an interesting intermediate result" is related to what the so-called prover model + MCTS is supposed to provide. I think it's analogous to judging whether a move in a game will take you in a winning direction, even if you can't see all the way to the end of the game. LLMs' generative abilities make them seem like better brainstorming partners, but I'm not sure less-linguistic systems like AlphaProof can't also get good statistical intuitions for which direction to take a proof. The space of possibly-good technical directions seems wide open to me!
That's true. I guess you could build something entirely different as well. Like design some subsystem that tries to generate "interesting intermediate results", treat that as a single step in a graph, and run MCTS on that model. So an entirely separate architecture for either "prove this result using tree search on the underlying prover" or "use an LLM to generate code and hope it compiles". Perhaps in some sense this is equivalent to training a reasoning LLM that can call out to the prover in substeps.
I spent a fair amount of time on P5 and was so convinced that the answer was somewhere in the vicinity of 12 (a diagonal line being essentially worst case, and requiring a binary search to find the spot where the pattern deviates from a perfect diagonal). Embarrassed at how far off I was. A beautiful problem.
Thanks as always for the thorough analysis! There's so much value to be obtained from really digging deep into how the models approach individual problems.
If it's any consolation, I spent 1-1.5 hours on it and wound up in the exact same spot. Naturally, I could not prove that binary search was optimal! I remember early on having a sort of proto-thought to the effect of "should I explore the first row?" but deciding not to pursue that line of thinking, and never revisiting it. With hindsight, that might have been more fruitful.
I think we will we need both LLMs and FPMs working together. The LLMs have a lot of "holes" that the FPMs can fill. But there are a lot of tasks that the FPM can't really even start on.
I think we need a step like "looking for interesting intermediate results". In practice, you don't solve these problems by guessing the answer and then cranking away. You start asking things like, what if a = b, what can we prove. Or, oh it looks like g has to divide both a - 1 and b - 1, that seems useful. And then while you look for interesting intermediate results, you notice things that pop up.
For example, on the snail problem, there's an obvious way to look for intermediate results. Think of some strategies and see how well they do. One strategy is, just try column 1, then just try column 2, etc.
(spoiler alert)
After a little while you think and see, well if I just try column n, and find the monster, I can try "dodging around" just that monster. Well, columns n-1 and n+1 could have the monster one step away diagonally. So that's the worst case, and actually, the worse case is finding the monster right in the middle of column n. Hmm, if only you could find the monster *not* in the middle of the column. And then this line of reasoning suggests the actual optimal strategy.
I think the FPMs can just never really do this evaluation of, is this an interesting intermediate result or not. That is a very "LLM" task.
Plus, in actual non-contest mathematics, looking for interesting intermediate results is useful all the time. You aren't going to be able to tell it "please prove P != NP now" but you could tell it, okay here are all of these problems and the best known algorithms for them, what other interesting stuff is there. New problems, new algorithms, etc.
Anyway this is a very, very interesting post and thank you for making it.
I broadly agree, though I will say that evaluating "is this an interesting intermediate result" is related to what the so-called prover model + MCTS is supposed to provide. I think it's analogous to judging whether a move in a game will take you in a winning direction, even if you can't see all the way to the end of the game. LLMs' generative abilities make them seem like better brainstorming partners, but I'm not sure less-linguistic systems like AlphaProof can't also get good statistical intuitions for which direction to take a proof. The space of possibly-good technical directions seems wide open to me!
That's true. I guess you could build something entirely different as well. Like design some subsystem that tries to generate "interesting intermediate results", treat that as a single step in a graph, and run MCTS on that model. So an entirely separate architecture for either "prove this result using tree search on the underlying prover" or "use an LLM to generate code and hope it compiles". Perhaps in some sense this is equivalent to training a reasoning LLM that can call out to the prover in substeps.