TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

OpenAI O3-Mini

962 点作者 johnneville3 个月前

115 条评论

simonw3 个月前
I used o3-mini to summarize this thread so far. Here&#x27;s the result: <a href="https:&#x2F;&#x2F;gist.github.com&#x2F;simonw&#x2F;09e5922be0cbb85894cf05e6d75ae050" rel="nofollow">https:&#x2F;&#x2F;gist.github.com&#x2F;simonw&#x2F;09e5922be0cbb85894cf05e6d75ae...</a><p>For 18,936 input, 2,905 output it cost 3.3612 cents.<p>Here&#x27;s the script I used to do it: <a href="https:&#x2F;&#x2F;til.simonwillison.net&#x2F;llms&#x2F;claude-hacker-news-themes#user-content-adding-a--m-model-option" rel="nofollow">https:&#x2F;&#x2F;til.simonwillison.net&#x2F;llms&#x2F;claude-hacker-news-themes...</a>
评论 #42896755 未加载
评论 #42892301 未加载
评论 #42894763 未加载
评论 #42892108 未加载
评论 #42896912 未加载
评论 #42901876 未加载
评论 #42892633 未加载
评论 #42892340 未加载
评论 #42894372 未加载
评论 #42892989 未加载
评论 #42901946 未加载
评论 #42896653 未加载
评论 #42901770 未加载
评论 #42892105 未加载
评论 #42894887 未加载
simonw3 个月前
I just pushed a new release of my LLM CLI tool with support for the new model and the reasoning_effort option: <a href="https:&#x2F;&#x2F;llm.datasette.io&#x2F;en&#x2F;stable&#x2F;changelog.html#v0-21" rel="nofollow">https:&#x2F;&#x2F;llm.datasette.io&#x2F;en&#x2F;stable&#x2F;changelog.html#v0-21</a><p>Example usage:<p><pre><code> llm -m o3-mini &#x27;write a poem about a pirate and a walrus&#x27; \ -o reasoning_effort high </code></pre> Output (comparing that with the default reasoning effort): <a href="https:&#x2F;&#x2F;github.com&#x2F;simonw&#x2F;llm&#x2F;issues&#x2F;728#issuecomment-2628321507">https:&#x2F;&#x2F;github.com&#x2F;simonw&#x2F;llm&#x2F;issues&#x2F;728#issuecomment-262832...</a><p>(If anyone has a better demo prompt I&#x27;d love to hear about it)
评论 #42892492 未加载
评论 #42892620 未加载
anotherpaulg3 个月前
For AI coding, o3-mini scored similarly to o1 at 10X less cost on the aider polyglot benchmark [0]. This comparison was with both models using high reasoning effort. o3-mini with medium effort scored in between R1 and Sonnet.<p><pre><code> 62% $186 o1 high 60% $18 o3-mini high 57% $5 DeepSeek R1 54% $9 o3-mini medium 52% $14 Sonnet 48% $0 DeepSeek V3 </code></pre> [0] <a href="https:&#x2F;&#x2F;aider.chat&#x2F;docs&#x2F;leaderboards&#x2F;" rel="nofollow">https:&#x2F;&#x2F;aider.chat&#x2F;docs&#x2F;leaderboards&#x2F;</a>
评论 #42894796 未加载
评论 #42898890 未加载
评论 #42900613 未加载
评论 #42894005 未加载
sixhobbits3 个月前
For years I&#x27;ve been asking all the models this mixed up version of the classic riddle and they 99% of the time get it wrong and insist on taking the goat across first. Even the other reasoning models would reason about how it was wrong, figure out the answer, and then still conclude goat.<p>o3-mini is the first one to get it right for me. Transcript:<p><pre><code> Me: I have a wolf, a goat, and a cabbage and a boat. I want to get them all across the river. If I leave the goat and the wolf together, the goat will eat the wolf. If I leave the wolf and the cabbage together, the wolf will eat the cabbage. I can only take one at a time. How do I do it? o3-mini 1. Take the wolf across to the right bank. 2. Return alone to the left bank. 3. Take the goat across to the right bank. 4. Bring the wolf back to the left bank. 5. Take the cabbage across to the right bank. 6. Return alone to the left bank. 7. Finally, take the wolf across to the right bank. Me: I have a wolf, a goat, and a cabbage and a boat. I want to get them all across the river. If I leave the goat and the cabbage together, the cabbage will eat the goat. If I leave the wolf and the cabbage together, the wolf will eat the cabbage. I can only take one at a time. How do I do it? o3-mini 1. Take the cabbage to the right bank. 2. Go back empty to the left bank. 3. Take the wolf to the right bank. 4. Bring the cabbage back to the left bank. 5. Take the goat to the right bank. 6. Go back empty to the left bank. 7. Finally, take the cabbage to the right bank.</code></pre>
评论 #42899604 未加载
评论 #42899867 未加载
评论 #42899192 未加载
评论 #42910044 未加载
评论 #42899250 未加载
评论 #42904635 未加载
评论 #42901351 未加载
评论 #42902331 未加载
评论 #42899725 未加载
评论 #42899241 未加载
评论 #42900064 未加载
评论 #42900163 未加载
评论 #42900050 未加载
评论 #42899815 未加载
sss1113 个月前
So far, it seems like this is the hierarchy<p>o1 &gt; GPT-4o &gt; o3-mini &gt; o1-mini &gt; GPT-4o-mini<p>o3 mini system card: <a href="https:&#x2F;&#x2F;cdn.openai.com&#x2F;o3-mini-system-card.pdf" rel="nofollow">https:&#x2F;&#x2F;cdn.openai.com&#x2F;o3-mini-system-card.pdf</a>
评论 #42890696 未加载
评论 #42891225 未加载
评论 #42890736 未加载
评论 #42890684 未加载
评论 #42890894 未加载
评论 #42890810 未加载
评论 #42891825 未加载
评论 #42890732 未加载
评论 #42892007 未加载
评论 #42890757 未加载
评论 #42890992 未加载
评论 #42890679 未加载
评论 #42890725 未加载
评论 #42891663 未加载
georgewsinger3 个月前
Did anyone else notice that o3-mini&#x27;s SWE bench dropped from 61% in the leaked System Card earlier today to 49.3% in this blog post, which puts o3-mini back in line with Claude on real-world coding tasks?<p>Am I missing something?
评论 #42890907 未加载
评论 #42890900 未加载
评论 #42890918 未加载
评论 #42890857 未加载
评论 #42890916 未加载
waynecochran3 个月前
I just had it convert Swift code to Kotlin and was surprised at how the comment was translated. It &quot;knew&quot; the author of the paper and what is was doing!? That is wild.<p>Swift:<p><pre><code> &#x2F;&#x2F; &#x2F;&#x2F; Double Reflection Algorithm from Table I (page 7) &#x2F;&#x2F; in Section 4 of https:&#x2F;&#x2F;tinyurl.com&#x2F;yft2674p &#x2F;&#x2F; for i in 1 ..&lt; N { let X1 = spine[i] ... </code></pre> Kotlin:<p><pre><code> &#x2F;&#x2F; Use the Double Reflection Algorithm (from Wang et al.) to compute subsequent frames. for (i in 1 until N) { val X1 = Vector3f(spine[i]) ...</code></pre>
评论 #42899736 未加载
评论 #42897157 未加载
评论 #42895235 未加载
评论 #42896921 未加载
ilaksh3 个月前
It looks like a pretty significant increase on SWE-Bench. Although that makes me wonder if there was some formatting or gotcha that was holding the results back before.<p>If this will work for your use case then it could be a huge discount versus o1. Worth trying again if o1-mini couldn&#x27;t handle the task before. $4&#x2F;million output tokens versus $60.<p><a href="https:&#x2F;&#x2F;platform.openai.com&#x2F;docs&#x2F;pricing" rel="nofollow">https:&#x2F;&#x2F;platform.openai.com&#x2F;docs&#x2F;pricing</a><p>I am Tier 5 but I don&#x27;t believe I have access to it in the API (at least it&#x27;s not on the limits page and I haven&#x27;t received an email). It says &quot;rolling out to select Tier 3-5 customers&quot; which means I will have to wait around and just be lucky I guess.
评论 #42891188 未加载
评论 #42891092 未加载
评论 #42893062 未加载
pookieinc3 个月前
Can&#x27;t wait to try this. What&#x27;s amazing to me is that when this was revealed just one short month ago, the AI landscape looked very different than it does today with more AI companies jumping into the fray with very compelling models. I wonder how the AI shift has affected this release internally, future releases and their mindset moving forward... How does the efficiency change, the scope of their models, etc.
评论 #42890808 未加载
评论 #42890722 未加载
vok3 个月前
Well, o3-mini-high just successfully found the root cause of a seg fault that o1 missed: mistakenly using _mm512_store_si512 for an unaligned store that should have been _mm512_storeu_si512.
评论 #42894014 未加载
评论 #42894461 未加载
评论 #42893438 未加载
Bjorkbat3 个月前
I have to admit I&#x27;m kind of surprised by the SWE-bench results. At the highest level of performance o3-mini&#x27;s CodeForces score is, well, high. I&#x27;ve honestly never really sat down to understand how elo works, all I know is that it scored better than o1, which allegedly as better than ~90% of all competitors on CodeForces. So, you know, o3-mini is pretty good at CodeForces.<p>But it&#x27;s SWE-bench scores aren&#x27;t meaningfully better than Claude, 49.3 vs Claude&#x27;s 49.0 on the public leaderboard (might be higher now due to recent updates?)<p>My immediate thoughts, CodeForces (and competitive programming in general) is a poor proxy for performance on general software engineering tasks. Besides that, for all the work put into OpenAI&#x27;s most recent model it still has a hard time living up to an LLM initially released by Anthropic some time ago, at least according to this benchmark.<p>Mind you, the Github issues that the problems in SWE-bench were based-off have been around long enough that it&#x27;s pretty much a given that they&#x27;ve all found their way into the training data of most modern LLMs, so I&#x27;m really surprised that o3 isn&#x27;t meaningfully better than Sonnet.
评论 #42899080 未加载
评论 #42895195 未加载
评论 #42894998 未加载
评论 #42895514 未加载
cjbarber3 个月前
The interesting question to me is how far these reasoning models can be scaled. With another 12 months of compute scaling (for synthetic data generation and RL) how good will these models be at coding? I talked with Finbarr Timbers (ex-DeepMind) yesterday about this and his take is that we&#x27;ll hit diminishing returns – not because we can&#x27;t make models more powerful, but because we&#x27;re approaching diminishing returns in areas that matter to users and that AI models may be nearing a plateau where capability gains matter less than UX.
评论 #42893872 未加载
devindotcom3 个月前
Sure as a clock, tick follows tock. Can&#x27;t imagine trying to build out cost structures, business plans, product launches etc on such rapidly shifting sands. Good that you get more for your money, I suppose. But I get the feeling no model or provider is worth committing to in any serious way.
评论 #42893791 未加载
评论 #42896814 未加载
0xFEE1DEAD3 个月前
After o3 was announced, with the numbers suggesting it was a major breakthrough, I have to say I’m absolutely not impressed with this version.<p>I think o1 works significantly better, and that makes me think the timing is more than just a coincidence.<p>Last week Nvidia lost 600 billion because of DeepSeek R1, and now OpenAI comes out with a new release which feels like it has nothing to do with the promises that were being made about o3.
评论 #42897906 未加载
评论 #42898335 未加载
scarface_743 个月前
This took 1:53 in o3-mini<p><a href="https:&#x2F;&#x2F;chatgpt.com&#x2F;share&#x2F;679d310d-6064-8010-ba78-6bd5ed3360d3" rel="nofollow">https:&#x2F;&#x2F;chatgpt.com&#x2F;share&#x2F;679d310d-6064-8010-ba78-6bd5ed3360...</a><p>The 4o model without using the Python tool<p><a href="https:&#x2F;&#x2F;chatgpt.com&#x2F;share&#x2F;679d32bd-9ba8-8010-8f75-2f26a792e0d3" rel="nofollow">https:&#x2F;&#x2F;chatgpt.com&#x2F;share&#x2F;679d32bd-9ba8-8010-8f75-2f26a792e0...</a><p>Trying to get accurate results with the paid version of 4o with the Python interpreter.<p><a href="https:&#x2F;&#x2F;chatgpt.com&#x2F;share&#x2F;679d31f3-21d4-8010-9932-7ecadd0b870f" rel="nofollow">https:&#x2F;&#x2F;chatgpt.com&#x2F;share&#x2F;679d31f3-21d4-8010-9932-7ecadd0b87...</a><p>The share link doesn’t show the output for some reason. But it did work correctly. I don’t know whether the ages are correct. I was testing whether it could handle ordering<p>I have no idea what conclusion I should draw from this besides depending on the use case, 4o may be better with “tools” if you know your domain where you are using it.<p>Tools are relatively easy to implement with LangChain or the native OpenAI SDK.
评论 #42891776 未加载
评论 #42891691 未加载
dilap3 个月前
Haven&#x27;t used openai in a bit -- whyyy did they change &quot;system&quot; role (now basically an industry-wide standard) to &quot;developer&quot;? That seems pointlessly disruptive.
评论 #42893629 未加载
评论 #42890902 未加载
jen729w3 个月前
&gt; Testers preferred o3-mini&#x27;s responses to o1-mini 56% of the time<p>I hope by this they don&#x27;t mean me, when I&#x27;m asked &#x27;which of these two responses do you prefer&#x27;.<p>They&#x27;re both 2,000 words, and I asked a question because I have something to do. <i>I&#x27;m not reading them both</i>; I&#x27;m usually just selecting the one that answered first.<p>That prompt is pointless. Perhaps as evidenced by the essentially 50% response rate: it&#x27;s a coin-flip.
评论 #42891590 未加载
评论 #42891508 未加载
评论 #42892123 未加载
评论 #42892232 未加载
评论 #42892626 未加载
评论 #42891766 未加载
评论 #42894188 未加载
评论 #42891521 未加载
评论 #42893650 未加载
评论 #42891557 未加载
评论 #42891656 未加载
评论 #42893399 未加载
评论 #42891551 未加载
评论 #42891961 未加载
评论 #42893371 未加载
评论 #42892060 未加载
评论 #42891772 未加载
评论 #42894603 未加载
评论 #42892327 未加载
评论 #42892680 未加载
评论 #42891487 未加载
评论 #42893280 未加载
评论 #42891987 未加载
kumarm3 个月前
I ran some quick programming tasks I have used O1 previously:<p>1. 1&#x2F;4th time for reasoning for most tasks.<p>2. Far better results.
评论 #42891329 未加载
highfrequency3 个月前
Anyone else confused by inconsistency in performance numbers between this announcement and the concurrent system card? <a href="https:&#x2F;&#x2F;cdn.openai.com&#x2F;o3-mini-system-card.pdf" rel="nofollow">https:&#x2F;&#x2F;cdn.openai.com&#x2F;o3-mini-system-card.pdf</a><p>For example-<p>GPQA diamond system card: o1-preview 0.68<p>GPQA diamond PR release: o1-preview 0.78<p>Also, how should we interpret the 3 different shading colors in the barplots (white, dotted, heavy dotted on top of white)...
评论 #42890890 未加载
评论 #42890876 未加载
silverlake3 个月前
O3-mini solved this prompt. DeepSeek R1 had a mental breakdown. The prompt: “Bob is facing forward. To his left is Ann, to his right is Cathy. Ann and Cathy are facing backwards. Who is on Ann’s left?”
评论 #42892068 未加载
评论 #42893286 未加载
评论 #42894575 未加载
评论 #42892383 未加载
评论 #42945977 未加载
评论 #42895660 未加载
评论 #42899134 未加载
评论 #42893754 未加载
msp263 个月前
I wish they&#x27;d just reveal the CoT (like gemini and deepseek do), it&#x27;s very helpful to see when the model gets misled by something in your prompt. Paying for tokens you aren&#x27;t even allowed to see is peak OpenAI.
评论 #42895533 未加载
评论 #42892515 未加载
mark_l_watson3 个月前
Oh, sweet: both o3-mini low and high support integrated web search. No integrated web search with o1.<p>I prefer, for philosophical reasons, open weight and open process&#x2F;science models, but OpenAI has done a very good job at productizing ChatGPT. I also use their 4o-mini API because it is cheap and compares well to using open models on Groq Cloud. I really love running local models with Ollama but the API venders keep the price so low that I understand most people not wanting the hasssle if running Deepseek-R, etc., locally.
mvkel3 个月前
I&#x27;ve been using cursor since it launched, sticking almost exclusively to claude-3.5-sonnet because it is incredibly consistent, and rarely loses the plot.<p>As subsequent models have been released, most of which claim to be better at coding, I&#x27;ve switched cursor to it to give them a try.<p>o1, o1-pro, deepseek-r1, and the now o3-mini. All of these models suffer from the exact same &quot;adhd.&quot; As an example, in a NextJS app, if I do a composer prompt like &quot;on page.tsx [15 LOC], using shadcn components wherever possible, update this page to have a better visual hierarchy.&quot;<p>sonnet nails it almost perfectly every time, but suffers from some date cutoff issues like thinking that shadcn-ui@latest is the repo name.<p>Every single other model, doesn&#x27;t matter which, does the following: it starts writing (from scratch), radix-ui components. I will interrupt it and say &quot;DO NOT use radix-ui, use shadcn!&quot; -- it will respond with &quot;ok!&quot; then begin writing its own components from scratch, again not using shadcn.<p>This is still problematic with o3-mini.<p>I can&#x27;t believe it&#x27;s the models. It must be the instruction-set that cursor is giving it behind the scenes, right? No amount of .cursorrules, or other instruction, seems to get cursor &quot;locked in&quot; the way sonnet just seems to be naturally.<p>It sucks being stuck on the (now ancient) sonnet, but inexplicably, it remains the only viable coding option for me.<p>Has anyone found a workaround?
评论 #42894031 未加载
评论 #42893927 未加载
评论 #42894879 未加载
评论 #42894283 未加载
评论 #42896162 未加载
评论 #42894147 未加载
评论 #42893907 未加载
评论 #42896303 未加载
评论 #42894221 未加载
评论 #42924840 未加载
评论 #42894351 未加载
评论 #42896442 未加载
评论 #42895628 未加载
评论 #42894284 未加载
评论 #42894772 未加载
评论 #42895721 未加载
评论 #42895513 未加载
评论 #42895532 未加载
评论 #42894392 未加载
评论 #42894661 未加载
jajko3 个月前
A random idea - train one of those models on <i>you</i>, keep it aside, let it somehow work out your intricacies, moods, details, childhood memories, personality, flaws, strengths. Methods can be various - initial dump of social networks, personal photos and videos, maybe some intense conversation to grok rough you, then polish over time.<p>A first step to digital immortality, could be a nice startup of some personalized product for rich, and then even regular folks. Immortality not in ourselves as meat bags of course, we die regardless, but digital copy and memento that our children can use if feeling lonely and can carry with themselves anywhere, or later descendants out of curiosity to hold massive events like weddings. One could &#x27;invite&#x27; long lost ancestors. Maybe your grand-grand father would be a cool guy you could easily click with these days via verbal input. Heck even 3D detailed model.<p>An additional service, &#x27;perpetually&#x27; paid - keeping your data model safe, taking care of it, backups, heck even maybe give it a bit of computing power to to receive current news in some light fashion and evolve, could be extras. Different tiers for different level of services and care.<p>Or am I decade or two ahead? I can see this as universally interesting across many if not all cultures.
ryanhecht3 个月前
&gt; While OpenAI o1 remains our broader general knowledge reasoning model, OpenAI o3-mini provides a specialized alternative for technical domains requiring precision and speed.<p>I feel like this naming scheme is growing a little tired. o1 is for general knowledge reasoning, o3-mini replaces o1-mini but might be more specialized than o1 for certain technical domains...the &quot;o&quot; in &quot;4o&quot; is for &quot;omni&quot; (referring to its multimodality) but the reasoning models start with &quot;o&quot; ...but they can&#x27;t use &quot;o2&quot; for trademark reasons so they skip straight to &quot;o3&quot; ...the word salad is getting really hard to follow!
评论 #42890930 未加载
评论 #42890912 未加载
评论 #42890816 未加载
评论 #42897266 未加载
评论 #42891137 未加载
评论 #42891010 未加载
评论 #42890880 未加载
diegocg3 个月前
I hope chatgpt reconsiders the naming of their models some time. I have troubles deciding which model is the one I should use.
评论 #42893199 未加载
airstrike3 个月前
Hopefully this is a big improvement from o1.<p>o1 has been very disappointing after spending sufficient time with Claude Sonnet 3.5. It&#x27;s like it actively tries to gaslight me and thinks it knows more than I do. It&#x27;s too stubborn and confidently goes off in tangents, suggesting big changes to parts of the code that aren&#x27;t the issue. Claude tends to be way better at putting the pieces together in its not-quite-mental-model, so to speak.<p>I told o1 that a suggestion it gave me didn&#x27;t work and it said &quot;if it&#x27;s still &#x27;doesn&#x27;t work&#x27; in your setup...&quot; with &quot;doesn&#x27;t work&quot; in quotes like it was doubting me... I&#x27;ve canceled my ChatGPT subscription and, when I really need to use it, just go with GPT-4o instead.
评论 #42891365 未加载
评论 #42908885 未加载
zone4113 个月前
It scores 72.4 on NYT Connections, a significant improvement over the o1-mini (42.2) and surpassing DeepSeek R1 (54.4), but it falls short of the o1 (90.7).<p>(<a href="https:&#x2F;&#x2F;github.com&#x2F;lechmazur&#x2F;nyt-connections&#x2F;">https:&#x2F;&#x2F;github.com&#x2F;lechmazur&#x2F;nyt-connections&#x2F;</a>)
antirez3 个月前
Just tested two complicated coding tasks, and surprisingly o3-mini-high nailed it while Sonnet 3.5 failed it. Will do more tests tomorrow.
xyzzy95633 个月前
What is the comparison of this versus DeepSeek in terms of good results and cost?
评论 #42891190 未加载
评论 #42891028 未加载
Oras3 个月前
200k context window<p>$1.1&#x2F;m for input<p>$4.4&#x2F;m for output<p>I assume thinking medium and hard would consume more tokens.<p>I feel the timing is bad for this release especially when deepseek R1 is still peaking. People will compare and might get disappointed with this model.
评论 #42892401 未加载
评论 #42891267 未加载
isusmelj3 个月前
Does anyone know why GPT4 has knowledge cutoff December 2023 and all the other models (newer ones like 4o, O1, O3) seem to have knowledge cutoff October 2023? <a href="https:&#x2F;&#x2F;platform.openai.com&#x2F;docs&#x2F;models#o3-mini" rel="nofollow">https:&#x2F;&#x2F;platform.openai.com&#x2F;docs&#x2F;models#o3-mini</a><p>I understand that keeping the same data and curating it might be beneficial. But it sounds odd to roll back in time with the knowledge cutoff. AFAIK, the only event that happened around that time was the start of the Gaza conflict.
评论 #42891218 未加载
评论 #42891245 未加载
chad1n3 个月前
I think that OpenAI should reduce the prices even further to be competitive with Qwen or Deepseek. There are a lot of vendors offering Deepseek R1 for $2-2.5 per 1 million tokens output.
评论 #42891254 未加载
AISnakeOil3 个月前
The naming convention is so messed up. o1, o3-mini (no o2, no o3???)
评论 #42891526 未加载
评论 #42891475 未加载
评论 #42891587 未加载
airbreather3 个月前
I tried to get it to build me a slightly challenging app to break out data from a fairly obscure file format for some PLC code, after having tried with Claude.<p>o3-mini produced volumes of code more quickly and more of it, but Claude still had greater insight in to the problem and decoded the format to a noticeably greater degree.<p>Whereas 03-mini quickly got to a certain point, it wasn&#x27;t long before it was obvious it wasn&#x27;t really going any further - like it&#x27;s big cousin, but in it&#x27;s own way, it was lazy and forgetful, seeming at times more interested in telling me what I might try than actually trying itself.<p>Interestingly, even when I gave it a copy of Claude&#x27;s code it still wasn&#x27;t able to get to the same depth of understanding.
iamjackg3 个月前
I&#x27;m very interested in their Jailbreak evaluations: they&#x27;re new to me. I might have missed previous mentions.
Ninjinka3 个月前
50 messages a day -&gt; 150 messages a day for Plus and Team users
og_kalu3 个月前
R1 seems to be the only of these reasoning models that seem to have had gains in the creative writing side.
评论 #42894003 未加载
estsauver3 个月前
I couldn&#x27;t find in the documentation anything that describes the relative number of tokens that you get for low&#x2F;medium&#x2F;high. I&#x27;m curious if anyone can find that, I&#x27;d be curious to see how it plays out relative to DeepSeeks thinking sections.
cyounkins3 个月前
I switched an agent from Sonnet V2 to o3-mini (default medium mode) and got strangely poor results: only calling 1 tool at a time despite being asked to call multiple, not actually doing any work, and reporting that it did things it didn&#x27;t
kizer3 个月前
First AI model to pass my test on the first try (I used o3-mini-high).<p>Prompt: Write an interpreter for a simple but practical scripting language. Write the interpreter in JavaScript to be run on the Node.JS platform. You can import any of the standard Node.JS modules.<p>Churned out ~750 lines and a sample source code file to run the interpreter on. Ran on the first try completely correctly.<p>Definitely a step up. Perhaps it&#x27;s in the training data. I don&#x27;t know. But no other model has ever produced an error-free and semantically correct program on the first try, and I don&#x27;t think any ever managed to implement closures.
thimabi3 个月前
Does anyone know the current usage limits for o3-mini and o3-mini-high when used through the ChatGPT interface? I tried to find them on the OpenAI Knowledgebase, but couldn’t find anything about that.
评论 #42893657 未加载
评论 #42893670 未加载
mhb3 个月前
Maybe they can get some advice from the AWS instance naming group.
llamataboot3 个月前
I wish they picked different descriptors than various ox with turbo mini preview etc and some of them shockingly different (o1 preview versus o1 regular for example) just give us three or four buckets of combos of cheap and fast and good :p it’s to the point where I have to look up prices and abilities and decide on a model for every use case and realize I made a mistake in the one I was using the day before.
mohsen13 个月前
It&#x27;s funny because I asked it to fix my script to show <i>deepseek</i>&#x27;s chain of thoughts in the script but it refuses to answer hahaha
modeless3 个月前
Initial vibes are not living up to the hype. It fails my pet prompt, and the Cursor devs say they still prefer Sonnet[1]. I&#x27;m sure it will have its uses but it is not going to dominate.<p>[1] <a href="https:&#x2F;&#x2F;x.com&#x2F;cursor_ai&#x2F;status&#x2F;1885415392677675337" rel="nofollow">https:&#x2F;&#x2F;x.com&#x2F;cursor_ai&#x2F;status&#x2F;1885415392677675337</a>
ks20483 个月前
I think OpenAI should just have a single public facing &quot;model&quot; - all these names and versions are confusing.<p>Imagine if Google, during it&#x27;s accent, had a huge array of search engines with code names and notes about what it&#x27;s doing behind the scenes. No, you open the page and type in box. If they can make it work better next month, great.<p>(I understand this could not apply to developers or enterprise-type API usage).
评论 #42891470 未加载
评论 #42891349 未加载
评论 #42891107 未加载
Havoc3 个月前
5 hours in 500 odd comments. Definitely feels like this has less wow factor than previous OAI releases.
sshh123 个月前
I built a silly political simulation game with this: <a href="https:&#x2F;&#x2F;state.sshh.io&#x2F;" rel="nofollow">https:&#x2F;&#x2F;state.sshh.io&#x2F;</a><p><a href="https:&#x2F;&#x2F;github.com&#x2F;sshh12&#x2F;state-sandbox">https:&#x2F;&#x2F;github.com&#x2F;sshh12&#x2F;state-sandbox</a>
ern3 个月前
I haven’t bothered with o3 mini, because who wants an “inferior” product? I was using 4o as a “smarter Google” until DeepSeek appeared (although its web search is being hammered now and I’m just using Google ).<p>o1 seems to have been neutered in the last week lots of disclaimers and butt-covering in its responses.<p>I also had an annoying discussion with o1 about the DC plane crash..it doesn’t have web access and its cutoff is 2024, so I don’t expect it know about the crash. However, after saying such an event is extremely unlikely and being almost patronisingly reassuring, it treated pasted news articles and links (which to be sure, it can’t access) as “fictionalized”, instead of acknowledging its own cut-off date, and that it could have been wrong. In contrast DeepSeek (with web search turned off) was less dismissive of the risks in DC airspace, and more aware of its own knowledge cut-off.<p>Coupled with the limited number of o1 responses for ChatGPT Plus, I’ve cancelled my subscription for now.
aussieguy12343 个月前
Just gave it a go using open-webui.<p>One immediate difference I noticed is that o3-mini actually observes the system prompt you set. So if I say it&#x27;s a Staff Engineer at Google, it&#x27;ll stay in character.<p>That was not possible with o1-mini, it ignored system prompts completely.
energy1233 个月前
How to disable the LLM summarized thought traces that get spammed into my chat window with o3-mini-high?<p>Very annoying now having to manually press the &quot;^&quot; to hide the verbose thought traces <i>every single question I ask</i>, totally breaks flow.
secondcoming3 个月前
Anyone else stuck in a Cloudflare &#x27;verify you&#x27;re a human&#x27; doom loop?
profsummergig3 个月前
Can someone please share the logic behind their version naming convention?
energy1233 个月前
This might be the best publicly available model for coding:<p><a href="https:&#x2F;&#x2F;livebench.ai&#x2F;#&#x2F;?Coding=as" rel="nofollow">https:&#x2F;&#x2F;livebench.ai&#x2F;#&#x2F;?Coding=as</a>
zora_goron3 个月前
Does anyone know, how &quot;reasoning effort&quot; is implemented technically - does this involve differences in the pre-training, RL, or prompting phases (or all)?
gmaster14403 个月前
i think it says, amongst other things, that there is a salient difference between competitive programming like codeforce and real-world programming. u can train a model to hillclimb elo ratings on codeforce, but that won&#x27;t necessarily directly translate to working on a prod javascript codebase.<p>anthropic figured out something about real world coding that openai is still trying to catch up to, o3-mini-high notwithstanding.
IMTDb3 个月前
I really don&#x27;t get the point of those oX-mini models for chat apps. (API is different, we can benchmark multiple models for a given recurring taks and choose the best one taking costs into consideration). As part of my job, I am trying to promote usage of AI in my company (~150 FTE); we have an OpenAI chatGPT plus subscription for all employees.<p>Roughly speaking the message is: &quot;use GPT-4o all the time, use o1 (soon o3) if you have more complex tasks&quot;. What am I supposed to answer when people ask &quot;when am I supposed to use o3-mini ? . And what the heck is o3-mini-high, how do I know when to use it ?&quot;. People aren&#x27;t gonna ask the same question to 5 different models and burn all their rate limits; yet it feels that what&#x27;s openAI is hoping people will do.<p>Put those weirs models in a sub-menu for advanced users if you really want to, but is you can use o1 there is probably no reason for you to hake o3-mini <i>and</i> o3-mini-high as additional options.
评论 #42891133 未加载
GaggiX3 个月前
The API pricing is almost exactly double the deepseek ones.
评论 #42893367 未加载
resource_waste3 个月前
Further confirmation that LLMs are deep into diminishing returns.<p>It seemed to be true with every &#x27;advancement&#x27; after ChatGPT4.<p>Layers are cool, but that isnt AI&#x2F;LLMs.
andrewstuart3 个月前
I find Claude to be vastly better than any OpenAI model as a programming assistant.<p>In particular the &quot;reasoning&quot; models just seem to be less good and more slow.
feverzsj3 个月前
It&#x27;s already a dead end for a while now, as they can&#x27;t improve o1 meaningfully anymore. The market is also losing patience quickly.
sandos3 个月前
How many benchmarks for LLMs are there out there?<p>Is there any evidence of over-fitting on benchmarks, or is there truely hidden parts to them?
sirolimus3 个月前
Proprietary AI models are not going to last. Not scalable, not moral and not worth the money.
AutistiCoder3 个月前
the o3-mini model would be useful to me if coding&#x27;s the only thing I need to do in a chat log.<p>When I use ChatGPT these days, it&#x27;s to help me write coding videos and then the social media posts around those videos. So that&#x27;s two specialties in one chat log.
revskill3 个月前
Models should be better at clarifying the prompt before actually spamming with bad answers.
yohbho3 个月前
&quot;we are releasing&quot;, and by releasing, we do not mean releasing.<p>Is this company worth anything?
xmichael9093 个月前
So can I ditch the $200 a month o1 pro account, and go back to the $20 account with 03-mini?
评论 #42892369 未加载
bix63 个月前
They use the word reasoning a lot in the post. Is this reasoning or statistical prediction?
EternalFury3 个月前
o1-preview, o1, o1-mini, o3-mini, o3-mini (low), o3-mini (medium), o3-mini (high)...<p>What&#x27;s next?<p>o4-mini (wet socks), o5-Eeny-meeny-miny-moe?<p>I thought they had a product manager over there.<p>They only need 2 names, right? ChatGPT and o.<p>ChatGPT-5 and o4 would be next.<p>This multiplication of the LLM loaves and fishes is kind of silly.
mise_en_place3 个月前
Too little too late IMO. This is not impressive at all, what am I missing here?
评论 #42891472 未加载
评论 #42891466 未加载
评论 #42891641 未加载
danielovichdk3 个月前
I read this as a full on marketing note targeted towards software developers.
wseqyrku3 个月前
OpenAI o3-mini &quot;brought&quot; to you by deepseek.
prompt_overflow3 个月前
Plot twist:<p>1. they are trying to obfuscate deepscrape success<p>2. they are trying to confuse you. the benchmark margins are minimal (and meaningless)<p>3. they are trying to get time (with investors) releasing nothing-special-models in a predicted schedule (jan -&gt; o3, feb -&gt; o3-pro-max, march -&gt; o7-ultra, and in 2026 -&gt; OMG! we&#x27;ve reached singularity! (after spending $500B))<p>-<p>And at the end of the day, nothing changes for me and neither for you. enjoy your time out of this sickness ai hype. bruh!
catigula3 个月前
It&#x27;s actually a bit comforting that it isn&#x27;t very good.
czk3 个月前
im just glad it looks like o3-mini finally has internet access<p>the o1 models were already so niche that i never used them, but not being able to search the web made them even more useless
tempeler3 个月前
They made a discount; it&#x27;s very impressive; they probably found a very efficient way, so it&#x27;s discounted. I guess there&#x27;s no need to build a very large nuclear power plant or a $9 trillion chip factory to run a single large language model. Efficiency has skyrocketed, or thanks to competition, OpenAI&#x27;s all problems were solved.
jokoon3 个月前
Does that mean I can use this on my recent gaming AMD gpu?
anoncow3 个月前
How long before o3 is available in GitHub copilot?
throwaway3141553 个月前
Typical OpenAI release announcement where it turns out they&#x27;re _actually_ doing some sort of delayed rollout and despite what the announcement says, no - you can&#x27;t use o3-mini today.
mvdtnz3 个月前
Wake up honey a new lie generator just dropped.
turnsout3 个月前
Hmm, not seeing it in my dashboard yet (Tier 4)
评论 #42891029 未加载
评论 #42891080 未加载
cranberryturkey3 个月前
Wasn’t available on the $20 plan for me.
评论 #42898270 未加载
simonw3 个月前
Now that the dust is settling a little bit, I have published my notes so far on o3-mini here: <a href="https:&#x2F;&#x2F;simonwillison.net&#x2F;2025&#x2F;Jan&#x2F;31&#x2F;o3-mini&#x2F;" rel="nofollow">https:&#x2F;&#x2F;simonwillison.net&#x2F;2025&#x2F;Jan&#x2F;31&#x2F;o3-mini&#x2F;</a><p>To save you the click: I think the most interesting things about this model are the price - less than half that of GPT-4o while being better for many things, most notably code - and the increased length limits.<p>200,000 tokens input and 100,000 output (compared to 128k&#x2F;16k for GPT-4o and just 8k for DeepSeek R2 and Claude 3.5 on output) could open up some interesting new applications, especially at that low price.
评论 #42896621 未加载
EcommerceFlow3 个月前
First thing I noticed on API and Chat for it is THIS THING IS FAST. That alone makes it a huge upgrade to o1-pro (not really comparable I know, just saying). Can&#x27;t imagine how much I&#x27;ll get done with this type of speed.
Alifatisk3 个月前
Any comparison with other models yet?
oytis3 个月前
Let me guess - everyone is mindblown.
_boffin_3 个月前
why is o1-pro not mentioned in there?
rednafi3 个月前
The most important detail for me was that in coding, it’s weaker than 4o and stronger than o1-mini. So I don’t have much use for it.
评论 #42892754 未加载
sourcecodeplz3 个月前
Even for free users, that is nice
rasengan3 个月前
o1-pro got turned off? and o3-mini-high is significantly worse than o1-pro :(
RobinL3 个月前
Wow - this is seriously fast (o3-mini), and my initial impressions are very favourable. I was asking it to layout quite a complex html form from a schema and it did a very good job.<p>Looking at the comments on here and the benchmark results I was expecting it to be a bit meh, but initial impressions are quite the opposite<p>I was expecting it to perhaps be a marginal improvement for complex things that need a lot of &#x27;reasoning&#x27;, but it seems it&#x27;s a bit improvement for simple things that you need doing fast
评论 #42892532 未加载
swyx3 个月前
for those interested, updated my o3-mini price chart to compare the cost-intelligence frontier with deepseek: <a href="https:&#x2F;&#x2F;x.com&#x2F;swyx&#x2F;status&#x2F;1885432031896887335" rel="nofollow">https:&#x2F;&#x2F;x.com&#x2F;swyx&#x2F;status&#x2F;1885432031896887335</a>
thunkingdeep3 个月前
I’ll take the China Deluxe instead, actually.<p>I’ve been incredibly pleased with DeepSeek this past week. Wonderful product, I love seeing its brain when it’s thinking.
评论 #42890842 未加载
评论 #42890936 未加载
评论 #42890906 未加载
评论 #42890978 未加载
评论 #42891525 未加载
评论 #42891074 未加载
评论 #42890942 未加载
评论 #42890860 未加载
评论 #42890927 未加载
评论 #42891077 未加载
评论 #42890914 未加载
hybrid_study3 个月前
Is this version a prank?
评论 #42891785 未加载
lenerdenator3 个月前
No self-host, no care.
Mr_Bees693 个月前
Can&#x27;t wait till deepseek gets their hands on this
sirolimus3 个月前
Not open source, nothing to see here.
evertedsphere3 个月前
&gt;developer messages<p>looks like finally their threat model has been updated to take into account that the user might be too &quot;unaligned&quot; to be trusted with the ability to provide a system message of their own
评论 #42890985 未加载
评论 #42891035 未加载
jiocrag3 个月前
This is.... underwhelming.
system7rocks3 个月前
AI is absolute trash. It&#x27;s a scam. Like Bitcoin.<p>And LibreOffice.
binary1323 个月前
Not really impressed by the answers I just got.
OutOfHere3 个月前
Wake me up when the full o3 is out.
评论 #42891210 未加载
评论 #42893346 未加载
buyucu3 个月前
why should anyone use this when deepseek is free&#x2F;cheaper?<p>openai is no longer relevant.
评论 #42891217 未加载
评论 #42890758 未加载
评论 #42890821 未加载
siliconc0w3 个月前
The real heated contest here amongst the top AI labs is to see who can come up with the most confusing product names.
评论 #42891316 未加载
评论 #42891018 未加载
评论 #42891347 未加载
vincentpants3 个月前
Wow, it got to the top of the front page so fast! Weird!
评论 #42890720 未加载
评论 #42890721 未加载
评论 #42890765 未加载
评论 #42891050 未加载
评论 #42897270 未加载
kaaskop3 个月前
How&#x27;s this compare to Mistral Small 3?
评论 #42893920 未加载
yapyap3 个月前
They sure scrambled something together after DeepSeek sweeped the market.
评论 #42891078 未加载
dgfitz3 个月前
Oh look, another model. Yay.
badgersnake3 个月前
56% is pretty close to ‘don’t give a toss’
brcmthrowaway3 个月前
Gamechanger?
评论 #42890701 未加载
评论 #42890668 未加载
评论 #42890703 未加载
评论 #42890677 未加载
ofou3 个月前
I find quite interesting they&#x27;re releasing three compute levels (low, medium, high), I guess now there&#x27;s some way to cap the thinking tokens when using their API.<p>Pricing for o3-mini [1] is $1.10 &#x2F; $4.40 per 1M tokens.<p>[1]: <a href="https:&#x2F;&#x2F;platform.openai.com&#x2F;docs&#x2F;pricing#:~:text=o3%2Dmini" rel="nofollow">https:&#x2F;&#x2F;platform.openai.com&#x2F;docs&#x2F;pricing#:~:text=o3%2Dmini</a>
xinayder3 个月前
&quot;oh no DeepSeek copied our product it&#x27;s not fair&quot;<p>&gt; proceeds to release a product based on DeepSeek<p>ah, alas the hypocrisy...
评论 #42891034 未加载
评论 #42890986 未加载
kevinsundar3 个月前
BTW if you want to stay up to date with these kinds of updates from OpenAI you can follow them here: <a href="https:&#x2F;&#x2F;www.getchangelog.com&#x2F;?service=openai.com" rel="nofollow">https:&#x2F;&#x2F;www.getchangelog.com&#x2F;?service=openai.com</a><p>It uses GPT-4o mini to extract updates from the website using scrapegraphai so this is kinda meta :). Maybe I&#x27;ll switch to o3 mini depending on cost. It&#x27;s reasoning abilities, with a lower cost than o1, could be quite powerful for web scraping.
评论 #42890788 未加载
n0id343 个月前
Is AI fizzing out or just me? I feel like they&#x27;re trying to smash out new models as fast as they can but in reality they&#x27;re barely any different, it&#x27;s turning into the smartphone market. New iPhone with a slightly better camera and slightly differently bevelled edges, get it NOW! But doesn&#x27;t actually do anything better than the iPhone 6.<p>Claude, GPT 4 onwards, and DeepSeek all feel the same to me. Okay to a point, then kinda useless. More like a more convenient specialised Google that you need to double check the results of.
评论 #42893174 未加载
评论 #42893152 未加载