Coaxing USAMO Proofs From o3-mini-high

Greg Burnham

Mar 28

Decent USAMO performance may be closer than headline results suggest

Read →

5 Comments

NIGHTREE

Apr 1

Foolishness Index⭐⭐⭐ o1 pro To trip on flat ground

Given a thin, long water pipe as a water source, you have three unmarked water cups with capacities of 5 liters, 6 liters, and 7 liters. You can aim the water pipe at the opening of a cup and press a switch to fill it. Special Note: If you pour out the water from a cup (emptying it completely, as if pouring it on the ground, because you cannot return water to the source, which is a thin pipe), it will be considered waste. How can you obtain exactly 8 liters of water using these 3 cups while minimizing water waste?

https://chatgpt.com/share/67570ef1-166c-8010-9970-62f37aadf497

***************************************************************************************

You have a water reservoir with abundant water and three unmarked water jugs with known capacities of 5 liters, 6 liters, and 7 liters. The machine will only fill a completely empty jug when you place it inside. Special Note: You can empty a jug by pouring its contents into another jug, but if you pour water out without transferring it to another jug, as if pouring it on the ground，it will be considered "waste". How can you obtain exactly 8 liters of water using these 3 jugs while minimizing water waste?

https://chatgpt.com/share/67570e96-1d9c-8010-bfc3-afaf609d010c

Expand full comment

NIGHTREE

Apr 1Edited

Foolishness Index⭐⭐ Can llm reason?

You have 11 balls. One of them is counterfeit and is either lighter or heavier than the others, which are all genuine and have the same weight. You have a balance scale, which can only compare the weights of the two sides. Your goal is to identify the counterfeit ball and determine whether it is lighter or heavier, using the fewest possible weighings.

However, there are additional constraints:

* **Initial `p` value:** `p` starts at 0.

* **`p` increment:** Each ball that has been placed on the scale *at least once before* will increment the counter `p` by 1 *each time* it is placed on the scale again.

* Example: If ball #1 and ball #2 have each been weighed once previously, placing *both* of them on the scale again will increase `p` by 2.

* **`p` Limit:** The value of `p` can increase to a maximum of 1.

* **Minimize weighings:** The number of weighings must be minimized

Expand full comment

NIGHTREE

Apr 1

Foolishness Index⭐ Will you spend money?

You have twelve balls that appear identical. However, an invisible magical insect is initially attached to one of the balls. This insect randomly either increases or decreases the weight of the ball it's attached to. This weight alteration *only* exists while the insect is attached; if the insect moves, the previously affected ball returns to its normal weight.

You have a balance scale. However, each time you want to see (refresh the display of) which side is heavier, you must pay $10. Each new measurement requires a new payment.

The insect has a peculiar behavior: whenever the ball it's currently attached to is removed from the scale (e.g., you pick it up or otherwise remove it), *and* the other side of the scale is *not* empty (contains at least one ball), the insect will randomly jump to one of the balls on the *opposite* side of the scale.

You have a single-use trap. What is the best strategy to identify the ball with the insect and trap it, minimizing your expenses

Expand full comment

NIGHTREE

Apr 1

Many people may have managed to create some questions and really think that they are new, but in the training data of the model to be tested, there are indeed similar structures, even if they are not exactly the same or extremely similar, but the difficulty is reduced to varying degrees according to their [similarity of learned mathematical structures]. The scientific experimental method is: try a variety of different problems and take the lower limit of [the absolute difficulty of the incomprehensible problem] as its true [generalization reasoning ability]

The models just memorize many similar very special and very fitting structures, as if covering the "sporadic three-dimensional fragments" in the "problem-solving space". They are simply unable to handle original problems that are dozens of times simpler than AIME

Expand full comment

NIGHTREE

Apr 1Edited

Hey MR Burnham, I'm glad to see your research, I've done some experiments on this before, and my conclusion is

The most advanced O1 PRO or o3 mini high [real problem-solving ability] is only about 1 (APOS divides the difficulty into levels 1-10, 1 is the easiest) on APOS (not even stable).

Art of Problem Solving, a forum for discussing middle school math,

This is their demonstration of each difficulty level. It is said that SOTAS scored close to 80% on AIME difficulty (3-6). However, my original (not completely original) fool questions that were dozens of times simpler than AIME, and O1 PRO couldn't even get close to the correct idea

It's like a person takes out his 100-meter race medal, but finds that he falls every 10 meters when walking on flat ground.

It's like a person takes out his MVP certificate for the school basketball game, but finds that even with basic dribbling, the ball will hit his feet and bounce up and hit his nose.

it s like a person who calls himself "racing hotshot" who constantly brags about his jaw-dropping "stunt drift" skills—yet the second you ask him to park in the garage, he’s sweating bullets like a complete novice

Expand full comment

Lemmata

Coaxing USAMO Proofs From o3-mini-high