Very interesting overview! I wrote a bunch of physics questions for HLE, and tested many more. I came away with roughly similar impressions, and some different ones.
- Difficulty with visualization and preference for calculation. I came up with some very simple questions where the answer is obvious when one carefully visualizes the process in 3d or occurring in time. These gave all the models a lot of trouble. One problem I had (but didn't submit, since it was too guessable) involved a sequence of rotations of a person's arm, which a 5 year old could do. It stumped o1 but o3-mini got it by working through explicit rotation matrices! One of my submitted questions (highlighted in the HLE paper) requires visualizing an object that's rotating while accelerating and changing its center of rotation; it can be done with two lines of algebra. Again, it stumped o1 but o3-mini got it by setting up coordinates and working through a completely explicit solution, which a human in an exam wouldn't do because it takes 20x longer.
- Preference for steps well-represented in the training data. I was able to stump all the models using a simple elastic collision problem where three frictionless balls collide at the same time. (Again, not submitted to HLE on the basis of being too easy.) To do the problem, you need to use the usual energy and momentum conservation, but you also need to observe that the impulses are in known directions due to the geometry of the situation. However, no model ever thought about the last thing, and just went in circles using energy and momentum conservation forever, even though they clearly don't provide enough equations. I suspect this is because there are thousands of basic, high school classroom collision problems in the training set, all of which are solvable using those equations alone.
- Reliance on (sometimes hallucinated) background knowledge. I had a problem involving estimating the drag force of the cosmic microwave background on the Earth (not submitted to HLE due to having a simple final answer, so too guessable). It works because you won't find it anywhere in textbooks (it is too small to ever matter, so nobody cares), but you can estimate it in a few steps using first principles. o1 really wanted to avoid doing this (maybe because those steps aren't common in the training data?) and instead recalled a nonexistent formula that yielded an incorrect answer in one step. I can see why the models do that though, because it does help in other cases. I had a bunch of failed question ideas that could be solved by either a careful and insightful elementary argument, or by just remembering a single obscure result (i.e. obscure enough to be in specialized monographs, but not graduate textbooks). Whenever the latter was possible, the models did it.
- Difficulty in performing qualitative steps. One common phenomenon in harder physics problems is that an exact solution is almost impossible, so one has to apply approximations and scaling arguments. It's tough for models, because it has to be done very carefully; incorrect arguments sound very similar to correct ones. I have a question in HLE involving a single simple system with 6 separate scaling exponents. From my testing, each one was gotten right by a model at some point, but they're not careful enough to get them all right simultaneously. A human can do it by doing consistency checks, or sketching out how an exact calculation would go.
- Chain of thought sometimes discarded in final answer. On numerous occasions, I observed models with completely wrong chains of thought miraculously guess a correct final answer, which was not clearly connected to anything they did. In other cases, the chain of thought is roughly correct, but the final answer comes out wrong anyway. There was an interesting benchmark of theoretical physics questions released recently:
Page 43 shows an example where models are asked to work out the form of a SUSY transformation. This is pretty straightforward and mechanical stuff, covered in many textbooks. It's always annoying, because there are a lot of opportunities for minus signs to crop up depending on your conventions, and there are multiple reasonable conventions, so one has to check it every time. For the conventions used in the problem, the answer turns out to have the opposite sign from the most common textbook result. However, every single model (including o3-mini) has a 0/5 success rate -- they _always_ return the wrong sign! Though I don't know what the chains of thought looked like, I imagine it's the same effect: the models strongly "want" to return an answer similar to what they've seen before. Either an extra sign is tossed in randomly somewhere, or the result is ignored and the common result presented. This seems to be an artifact of how the models are trained; it will certainly improve benchmarks on average, but it makes it difficult to trust for anything.
- Visualization. Definitely! I saw one HLE question that's like, "Take a regular tetrahedron and octahedron with same-sized faces. Glue them together along a face. How many faces does the resulting polyhedron have?" It's not the general-case answer you might expect, because each of the tetrahedron's non-glued faces happens to be coplanar with one of the octahedron's faces. I'm not sure I like this question's guessability, but it's a nice example of simple spatial reasoning that (having just tried o3-mini-high) the models still get wrong.
- Training Data. I really wish I had some idea for how to get a better handle on just how much they are generalizing past their training data. Clearly they're generalizing to some degree! But I share your instinct that they can be tripped up by seeing problems that only differ subtly from a large set of problems they've seen before. Any ideas for a way to measure this?
- Background knowledge. Yeah, they love applying facts. Sometimes I think their ability can be summarized as knowing a lot of facts and being able to string them together pretty competently. Here's a good discussion on doing this for pretty advanced math, where the sophistication of the background obscures the actual difficulty of the problem: https://x.com/littmath/status/1898461323391815820
- Qualitative steps. I think this is related: I've noticed they're worse at proofs than you'd maybe expect. They don't seem to have as strong a sense for whether a step in an argument is justified. If they're really just trained using reinforcement learning with a binary objective (i.e., right/wrong answer, or tests pass/don't pass) then maybe that explains why they don't learn such a crisp sense of what's a right or wrong step along the way.
- Physics benchmark. Thanks for the pointer! I hadn't seen that one. Very neat that model performance did correlate with their difficulty scores. And yeah, interesting how they report it sometimes just pulls the final answer out of thin air -- which they speculate is because it has it memorized. This is where DeepSeek-R1 is really useful, since they display the whole CoT. The OpenAI models display only a summarized version, and I think they're pretty heavily censoring it. I've seen DeepSeek-R1 sometimes say something like, "It doesn't seem like I've made any progress. I should probably give up. But I've got to say something to the user, so I guess I'll just take a guess that XYZ is correct." That kind of self-awareness seems really useful! But I haven't seen it harnessed yet.
Very interesting overview! I wrote a bunch of physics questions for HLE, and tested many more. I came away with roughly similar impressions, and some different ones.
- Difficulty with visualization and preference for calculation. I came up with some very simple questions where the answer is obvious when one carefully visualizes the process in 3d or occurring in time. These gave all the models a lot of trouble. One problem I had (but didn't submit, since it was too guessable) involved a sequence of rotations of a person's arm, which a 5 year old could do. It stumped o1 but o3-mini got it by working through explicit rotation matrices! One of my submitted questions (highlighted in the HLE paper) requires visualizing an object that's rotating while accelerating and changing its center of rotation; it can be done with two lines of algebra. Again, it stumped o1 but o3-mini got it by setting up coordinates and working through a completely explicit solution, which a human in an exam wouldn't do because it takes 20x longer.
- Preference for steps well-represented in the training data. I was able to stump all the models using a simple elastic collision problem where three frictionless balls collide at the same time. (Again, not submitted to HLE on the basis of being too easy.) To do the problem, you need to use the usual energy and momentum conservation, but you also need to observe that the impulses are in known directions due to the geometry of the situation. However, no model ever thought about the last thing, and just went in circles using energy and momentum conservation forever, even though they clearly don't provide enough equations. I suspect this is because there are thousands of basic, high school classroom collision problems in the training set, all of which are solvable using those equations alone.
- Reliance on (sometimes hallucinated) background knowledge. I had a problem involving estimating the drag force of the cosmic microwave background on the Earth (not submitted to HLE due to having a simple final answer, so too guessable). It works because you won't find it anywhere in textbooks (it is too small to ever matter, so nobody cares), but you can estimate it in a few steps using first principles. o1 really wanted to avoid doing this (maybe because those steps aren't common in the training data?) and instead recalled a nonexistent formula that yielded an incorrect answer in one step. I can see why the models do that though, because it does help in other cases. I had a bunch of failed question ideas that could be solved by either a careful and insightful elementary argument, or by just remembering a single obscure result (i.e. obscure enough to be in specialized monographs, but not graduate textbooks). Whenever the latter was possible, the models did it.
- Difficulty in performing qualitative steps. One common phenomenon in harder physics problems is that an exact solution is almost impossible, so one has to apply approximations and scaling arguments. It's tough for models, because it has to be done very carefully; incorrect arguments sound very similar to correct ones. I have a question in HLE involving a single simple system with 6 separate scaling exponents. From my testing, each one was gotten right by a model at some point, but they're not careful enough to get them all right simultaneously. A human can do it by doing consistency checks, or sketching out how an exact calculation would go.
- Chain of thought sometimes discarded in final answer. On numerous occasions, I observed models with completely wrong chains of thought miraculously guess a correct final answer, which was not clearly connected to anything they did. In other cases, the chain of thought is roughly correct, but the final answer comes out wrong anyway. There was an interesting benchmark of theoretical physics questions released recently:
https://arxiv.org/abs/2502.15815
Page 43 shows an example where models are asked to work out the form of a SUSY transformation. This is pretty straightforward and mechanical stuff, covered in many textbooks. It's always annoying, because there are a lot of opportunities for minus signs to crop up depending on your conventions, and there are multiple reasonable conventions, so one has to check it every time. For the conventions used in the problem, the answer turns out to have the opposite sign from the most common textbook result. However, every single model (including o3-mini) has a 0/5 success rate -- they _always_ return the wrong sign! Though I don't know what the chains of thought looked like, I imagine it's the same effect: the models strongly "want" to return an answer similar to what they've seen before. Either an extra sign is tossed in randomly somewhere, or the result is ignored and the common result presented. This seems to be an artifact of how the models are trained; it will certainly improve benchmarks on average, but it makes it difficult to trust for anything.
That's good stuff!
- Visualization. Definitely! I saw one HLE question that's like, "Take a regular tetrahedron and octahedron with same-sized faces. Glue them together along a face. How many faces does the resulting polyhedron have?" It's not the general-case answer you might expect, because each of the tetrahedron's non-glued faces happens to be coplanar with one of the octahedron's faces. I'm not sure I like this question's guessability, but it's a nice example of simple spatial reasoning that (having just tried o3-mini-high) the models still get wrong.
- Training Data. I really wish I had some idea for how to get a better handle on just how much they are generalizing past their training data. Clearly they're generalizing to some degree! But I share your instinct that they can be tripped up by seeing problems that only differ subtly from a large set of problems they've seen before. Any ideas for a way to measure this?
- Background knowledge. Yeah, they love applying facts. Sometimes I think their ability can be summarized as knowing a lot of facts and being able to string them together pretty competently. Here's a good discussion on doing this for pretty advanced math, where the sophistication of the background obscures the actual difficulty of the problem: https://x.com/littmath/status/1898461323391815820
- Qualitative steps. I think this is related: I've noticed they're worse at proofs than you'd maybe expect. They don't seem to have as strong a sense for whether a step in an argument is justified. If they're really just trained using reinforcement learning with a binary objective (i.e., right/wrong answer, or tests pass/don't pass) then maybe that explains why they don't learn such a crisp sense of what's a right or wrong step along the way.
- Physics benchmark. Thanks for the pointer! I hadn't seen that one. Very neat that model performance did correlate with their difficulty scores. And yeah, interesting how they report it sometimes just pulls the final answer out of thin air -- which they speculate is because it has it memorized. This is where DeepSeek-R1 is really useful, since they display the whole CoT. The OpenAI models display only a summarized version, and I think they're pretty heavily censoring it. I've seen DeepSeek-R1 sometimes say something like, "It doesn't seem like I've made any progress. I should probably give up. But I've got to say something to the user, so I guess I'll just take a guess that XYZ is correct." That kind of self-awareness seems really useful! But I haven't seen it harnessed yet.