Lots of great information and insight here on FrontierMath and the o3 results!
I was slightly confused by this passage:
> One comment about these reasoning traces [that OpenAI sent from their internal evaluations]: they are definitely a lot more impressive than what we get online [using ChatGPT or the API]. Almost surely they put a lot of resources into solving these problems when they do their internal evaluations.
> Note: Epoch AI has since run their own evaluations, and made all outputs available for the 10 public FrontierMath problems. You can inspect the results here.
Following the "inspect the results" link, it looks like Epoch AI has evaluated o3-mini but not the full o3 model? Which is what I'd expect, since (AFAIK) OpenAI is not providing public access to o3 (except that, reportedly, their Deep Research tool uses o3). However, wouldn't that explain why OpenAI's internal evaluations were "more impressive than what we [Epoch] get online"?
(Also, I forget the details regarding OpenAI's internal evaluation, but I believe their most impressive numbers involved running the model many times on each problem?)
Yes, I believe it's clear that Epoch AI tested o3-mini (both "medium" and "high"). And your understanding matches mine, that full o3 is not publicly available, at least not directly.
However, as far as I'm aware, the most recent reported results from OpenAI on FrontierMath were for o3-mini, not full o3. I'm basing this off the blog post here: https://openai.com/index/openai-o3-mini/
If you search that page for FrontierMath, you'll find a note that says, "On FrontierMath, when prompted to use a Python tool, o3‑mini with high reasoning effort solves over 32% of problems on the first attempt, including more than 28% of the challenging (T3) problems."
So that is about "o3‑mini with high reasoning effort", though seemingly also in some sort of additional scaffold where it has access to a Python tool -- something the public o3-mini-high lacks. (Also see the table above that note for some stats on running the model multiple times per problem, which might be what you recall seeing.)
So, my understanding is that OpenAI provided the reasoning traces from that experiment to Epoch AI.
If that's true, it suggests that OpenAI might have a way of getting o3-mini to think even longer than the publicly available o3-mini-high does. This doesn't seem like too big of a leap: already the publicly available o3-mini comes with three "reasoning effort" settings (low/medium/high). I don't understand what's going on under the hood when this setting is changed. It sounds like there's just one model, so I guess they have some parameter that controls how long the model "thinks". If they have explained this in any greater depth, I haven't seen it. But if we take a guess that the underlying setting is really more like a number, and that they just released three fixed values for that number, labeled low/medium/high, then maybe for their full internal FrontierMath experiment they just turned that number up even higher. (Alternatively, maybe their test scaffold leads to longer thinking for other reasons.)
Finally, I'll just mention that OpenAI has been clear that they don't plan to release o3 on its own, but rather that the upcoming GPT-5 will (more or less) merge GPT-4.5 and o3, so they just have a single best model. Source: https://x.com/sama/status/1889755723078443244
Lots of great information and insight here on FrontierMath and the o3 results!
I was slightly confused by this passage:
> One comment about these reasoning traces [that OpenAI sent from their internal evaluations]: they are definitely a lot more impressive than what we get online [using ChatGPT or the API]. Almost surely they put a lot of resources into solving these problems when they do their internal evaluations.
> Note: Epoch AI has since run their own evaluations, and made all outputs available for the 10 public FrontierMath problems. You can inspect the results here.
Following the "inspect the results" link, it looks like Epoch AI has evaluated o3-mini but not the full o3 model? Which is what I'd expect, since (AFAIK) OpenAI is not providing public access to o3 (except that, reportedly, their Deep Research tool uses o3). However, wouldn't that explain why OpenAI's internal evaluations were "more impressive than what we [Epoch] get online"?
(Also, I forget the details regarding OpenAI's internal evaluation, but I believe their most impressive numbers involved running the model many times on each problem?)
Yes, I believe it's clear that Epoch AI tested o3-mini (both "medium" and "high"). And your understanding matches mine, that full o3 is not publicly available, at least not directly.
However, as far as I'm aware, the most recent reported results from OpenAI on FrontierMath were for o3-mini, not full o3. I'm basing this off the blog post here: https://openai.com/index/openai-o3-mini/
If you search that page for FrontierMath, you'll find a note that says, "On FrontierMath, when prompted to use a Python tool, o3‑mini with high reasoning effort solves over 32% of problems on the first attempt, including more than 28% of the challenging (T3) problems."
So that is about "o3‑mini with high reasoning effort", though seemingly also in some sort of additional scaffold where it has access to a Python tool -- something the public o3-mini-high lacks. (Also see the table above that note for some stats on running the model multiple times per problem, which might be what you recall seeing.)
So, my understanding is that OpenAI provided the reasoning traces from that experiment to Epoch AI.
If that's true, it suggests that OpenAI might have a way of getting o3-mini to think even longer than the publicly available o3-mini-high does. This doesn't seem like too big of a leap: already the publicly available o3-mini comes with three "reasoning effort" settings (low/medium/high). I don't understand what's going on under the hood when this setting is changed. It sounds like there's just one model, so I guess they have some parameter that controls how long the model "thinks". If they have explained this in any greater depth, I haven't seen it. But if we take a guess that the underlying setting is really more like a number, and that they just released three fixed values for that number, labeled low/medium/high, then maybe for their full internal FrontierMath experiment they just turned that number up even higher. (Alternatively, maybe their test scaffold leads to longer thinking for other reasons.)
Finally, I'll just mention that OpenAI has been clear that they don't plan to release o3 on its own, but rather that the upcoming GPT-5 will (more or less) merge GPT-4.5 and o3, so they just have a single best model. Source: https://x.com/sama/status/1889755723078443244
So, hopefully, this confusion will abate a bit.
Thanks – clearly I misremembered whether these results were for o3-mini or o3.