Discussion about this post

User's avatar
knzhou's avatar

Very interesting overview! I wrote a bunch of physics questions for HLE, and tested many more. I came away with roughly similar impressions, and some different ones.

- Difficulty with visualization and preference for calculation. I came up with some very simple questions where the answer is obvious when one carefully visualizes the process in 3d or occurring in time. These gave all the models a lot of trouble. One problem I had (but didn't submit, since it was too guessable) involved a sequence of rotations of a person's arm, which a 5 year old could do. It stumped o1 but o3-mini got it by working through explicit rotation matrices! One of my submitted questions (highlighted in the HLE paper) requires visualizing an object that's rotating while accelerating and changing its center of rotation; it can be done with two lines of algebra. Again, it stumped o1 but o3-mini got it by setting up coordinates and working through a completely explicit solution, which a human in an exam wouldn't do because it takes 20x longer.

- Preference for steps well-represented in the training data. I was able to stump all the models using a simple elastic collision problem where three frictionless balls collide at the same time. (Again, not submitted to HLE on the basis of being too easy.) To do the problem, you need to use the usual energy and momentum conservation, but you also need to observe that the impulses are in known directions due to the geometry of the situation. However, no model ever thought about the last thing, and just went in circles using energy and momentum conservation forever, even though they clearly don't provide enough equations. I suspect this is because there are thousands of basic, high school classroom collision problems in the training set, all of which are solvable using those equations alone.

- Reliance on (sometimes hallucinated) background knowledge. I had a problem involving estimating the drag force of the cosmic microwave background on the Earth (not submitted to HLE due to having a simple final answer, so too guessable). It works because you won't find it anywhere in textbooks (it is too small to ever matter, so nobody cares), but you can estimate it in a few steps using first principles. o1 really wanted to avoid doing this (maybe because those steps aren't common in the training data?) and instead recalled a nonexistent formula that yielded an incorrect answer in one step. I can see why the models do that though, because it does help in other cases. I had a bunch of failed question ideas that could be solved by either a careful and insightful elementary argument, or by just remembering a single obscure result (i.e. obscure enough to be in specialized monographs, but not graduate textbooks). Whenever the latter was possible, the models did it.

- Difficulty in performing qualitative steps. One common phenomenon in harder physics problems is that an exact solution is almost impossible, so one has to apply approximations and scaling arguments. It's tough for models, because it has to be done very carefully; incorrect arguments sound very similar to correct ones. I have a question in HLE involving a single simple system with 6 separate scaling exponents. From my testing, each one was gotten right by a model at some point, but they're not careful enough to get them all right simultaneously. A human can do it by doing consistency checks, or sketching out how an exact calculation would go.

- Chain of thought sometimes discarded in final answer. On numerous occasions, I observed models with completely wrong chains of thought miraculously guess a correct final answer, which was not clearly connected to anything they did. In other cases, the chain of thought is roughly correct, but the final answer comes out wrong anyway. There was an interesting benchmark of theoretical physics questions released recently:

https://arxiv.org/abs/2502.15815

Page 43 shows an example where models are asked to work out the form of a SUSY transformation. This is pretty straightforward and mechanical stuff, covered in many textbooks. It's always annoying, because there are a lot of opportunities for minus signs to crop up depending on your conventions, and there are multiple reasonable conventions, so one has to check it every time. For the conventions used in the problem, the answer turns out to have the opposite sign from the most common textbook result. However, every single model (including o3-mini) has a 0/5 success rate -- they _always_ return the wrong sign! Though I don't know what the chains of thought looked like, I imagine it's the same effect: the models strongly "want" to return an answer similar to what they've seen before. Either an extra sign is tossed in randomly somewhere, or the result is ignored and the common result presented. This seems to be an artifact of how the models are trained; it will certainly improve benchmarks on average, but it makes it difficult to trust for anything.

Expand full comment
1 more comment...

No posts