Discussion about this post

User's avatar
Steve Newman's avatar

Lots of great information and insight here on FrontierMath and the o3 results!

I was slightly confused by this passage:

> One comment about these reasoning traces [that OpenAI sent from their internal evaluations]: they are definitely a lot more impressive than what we get online [using ChatGPT or the API]. Almost surely they put a lot of resources into solving these problems when they do their internal evaluations.

> Note: Epoch AI has since run their own evaluations, and made all outputs available for the 10 public FrontierMath problems. You can inspect the results here.

Following the "inspect the results" link, it looks like Epoch AI has evaluated o3-mini but not the full o3 model? Which is what I'd expect, since (AFAIK) OpenAI is not providing public access to o3 (except that, reportedly, their Deep Research tool uses o3). However, wouldn't that explain why OpenAI's internal evaluations were "more impressive than what we [Epoch] get online"?

(Also, I forget the details regarding OpenAI's internal evaluation, but I believe their most impressive numbers involved running the model many times on each problem?)

Expand full comment
2 more comments...

No posts