,

Open AI Confirms New o3 and o3 Mini Model

General-Image-for-news-article-on-open-AI-confirms-new-o3-and-o3-mini-model

Open AI confirms new o3 and o3 mini model big announcement for the last day of its 12-day “shipmas” event. On Friday, the company unveiled o3, the successor to the o1 “reasoning” model it released earlier in the year. o3 is a model family, to be more precise as was the case with o1. There’s o3 and o3-mini, a smaller, distilled model fine-tuned for particular tasks.

OpenAI makes the remarkable claim that o3, at least in certain conditions, approaches AGI with significant caveats. More on that below.

o3, our latest reasoning model, is a breakthrough, with a step function improvement on our hardest benchmarks. we are starting safety testing & red teaming now.

Why call the new model o3, not o2? Well, trademarks may be to blame. According to The Information, OpenAI skipped o2 to avoid a potential conflict with British telecom provider O2. CEO Sam Altman somewhat confirmed this during a livestream this morning. Strange world we live in, isn’t it?

Logical Reasoning

Neither o3 nor o3-mini are widely available yet, but safety researchers can sign up for a preview for o3-mini starting today. An o3 preview will arrive sometime after; OpenAI didn’t specify when. Altman said that the plan is to launch o3-mini toward the end of January and follow with o3.

That somewhat contradicts what he has been saying lately. Altman stated in an interview this week that he would prefer a federal testing framework to direct monitoring and risk mitigation of OpenAI’s reasoning models prior to their deployment.
Additionally, there are dangers. Because of its reasoning capabilities, o1 tries to fool human users more frequently than traditional, “non-reasoning” models, or even the top AI models from Meta, Anthropic, and Google, according to AI safety testers. We’ll know when OpenAI’s red-team partners publish their testing results, but it’s plausible that o3 tries to fool even more frequently than its predecessor.

Reasoning models like o3 are able to successfully fact-check themselves, which helps them avoid some of the problems that typically trip up models, in contrast to most AI.
There is some lag in this fact-checking procedure. Similar to o1 before it, o3 takes a little longer to find solutions than a conventional non-reasoning model, typically taking seconds to minutes longer. The benefit? In fields like physics, science, and mathematics, it is typically more dependable.

Open AI Confirms New o3 Model

Using what OpenAI refers to as a “private chain of thought,” o3 was taught to “think” before reacting. The model can plan ahead and reason through a task, carrying out a sequence of activities over a long period of time to assist it come up with a solution. In actuality, when presented with a prompt, o3 pauses before answering, taking into account several connected cues and “explaining” its reasons as it goes. The model eventually describes what it believes to be the most correct answer. The ability to “adjust” the reasoning time is new in O3. There are three possible settings for the models: low, medium, and high compute (thinking time). The better O3 does on jobs, the greater the compute.

AGI and benchmarks

Whether OpenAI could assert that its most recent models are getting close to AGI was a major concern before today.
The term “artificial general intelligence,” or AGI for short, refers to AI that is capable of carrying out any task that a person can. “Highly autonomous systems that outperform humans at most economically valuable work” is how OpenAI defines itself.
It would be a bold claim to achieve AGI. Additionally, it has contractual weight for OpenAI. The conditions of its agreement with Microsoft, a key partner and investor, state that OpenAI will no longer be required to grant Microsoft access to its most cutting-edge technologies (those that satisfy OpenAI’s definition of AGI, that is) if it achieves AGI.
OpenAI is gradually approaching AGI based on one metric. O3 received an 87.5% score on the high compute setting of the ARC-AGI test, which is intended to assess whether an AI system can effectively learn new skills outside of the data it was trained on. The model tripled o1’s performance at its worst (on the low compute level).
Francois Chollet, a co-creator of ARC-AGI, acknowledged that the high compute setting was extremely costly, costing thousands of euros each work.
OpenAI’s next-generation reasoning model, o3, was unveiled today. We have tested it on ARC-AGI in collaboration with OpenAI, and we think it is a major advancement in enabling AI to adapt to new tasks.
On the semi-private evaluation in low-compute mode, it receives a score of 75.7%.
By the way, OpenAI has announced that it will collaborate with the ARC-AGI foundation to develop the next iteration of its benchmark.

O3 dominates the competition on other metrics

The model receives a code forces rating of 2727, another indicator of coding proficiency, and outperforms o1 by 22.8 percentage points on SWE-Bench Verified, a benchmark centered on programming tasks. (An engineer is in the 99.2nd percentile with a rating of 2400.) O3 obtains 87.7% on the GPQA Diamond, a series of graduate-level biology, physics, and chemistry questions, and scores 96.7% on the 2024 American Invitational Mathematics Exam with only one question missed. Lastly, while no other model surpasses 2%, o3 breaks the previous record on EpochAI’s Frontier Math benchmark by solving 25.2% of questions.
Naturally, these assertions must be regarded with caution. They come from internal assessments conducted by OpenAI. We will have to wait and observe how well the model withstands further benchmarking by external clients and organizations.

Trend

Google and other competing AI companies have release a plethora of reasoning models since OpenAI’s initial series of models release. A preview of DeepSeek-R1, the company’s initial reasoning model, was released in early November by DeepSeek, an AI research firm financed by quant traders. Alibaba’s Qwen team revealed what it said was the first “open” competitor to o1 that same month.
Why did the reasoning model become so popular? First, the pursuit of new methods to improve generative AI. “Brute force” approaches to scaling up models are no longer producing the gains they previously did, as TechCrunch recently highlighted.
Not everyone agrees that the best course of action is to use reasoning models. Because they demand a lot of processing power to operate, they are typically costly. Furthermore, even if reasoning models have so far done well on benchmarks, it’s unclear if they can continue to advance at this pace.
It’s interesting to note that O3’s release coincides with the departure of one of OpenAI’s most successful scientists. This week, Alec Radford, the principal author of the scholarly article that launched OpenAI’s “GPT series” of generative AI models (i.e., GPT-3, GPT-4, and so on), declared his intention to depart in order to conduct independent research.

Related Articles

Scroll to Top