O1 isn't a chat model (and that's the point)

165 点作者 gmays4 个月前

20 条评论

geor9e4 个月前

Instead of learning the latest workarounds for the kinks and quirks of a beta AI product, I'm going to wait 3 weeks for the advice to become completely obsolete

评论 #42752713 未加载

评论 #42751306 未加载

评论 #42751000 未加载

评论 #42751054 未加载

评论 #42751845 未加载

评论 #42757350 未加载

评论 #42751553 未加载

goolulusaurs4 个月前

The reality is that o1 is a step away from general intelligence and back towards narrow ai. It is great for solving the kinds of math, coding and logic puzzles it has been designed for, but for many kinds of tasks, including chat and creative writing, it is actually worse than 4o. It is good at the specific kinds of reasoning tasks that it was built for, much like alpha-go is great at playing go, but that does not actually mean it is more generally intelligent.

评论 #42751207 未加载

评论 #42750645 未加载

评论 #42750783 未加载

评论 #42752267 未加载

评论 #42751317 未加载

samrolken4 个月前

I have a lot of luck using 4o to build and iterate on context and then carry that into o1. I’ll ask 4o to break down concepts, make outlines, identify missing information and think of more angles and options. Then at the end, switch on o1 which can use all that context.

ttul4 个月前

FWIW: OpenAI provides advice on how to prompt o1 (<a href="https://platform.openai.com/docs/guides/reasoning/advice-on-prompting#advice-on-prompting" rel="nofollow">https://platform.openai.com/docs/guides/reasoning/advice-on-...</a>). Their first bit of advice is to, “Keep prompts simple and direct: The models excel at understanding and responding to brief, clear instructions without the need for extensive guidance.”

评论 #42750663 未加载

评论 #42750694 未加载

评论 #42751622 未加载

评论 #42751252 未加载

评论 #42750642 未加载

isoprophlex4 个月前

People agreeing and disagreeing about the central thesis of the article, which is fine because i enjoy the discussion...no matter where you stand in the specific o1/o3 discussion the concept of "question entropy" is very enlightening.what is the question of theoretical minimum complexity that still solves your question adequately? or for a specific model, are its users capable of supplying the minimum required intellectual complexity the model needs?Would be interesting to quantify these two and see if our models are close to converging on certain task domains.

评论 #42757315 未加载

martythemaniak4 个月前

One thing I'd like to experiment with is "prompt to service". I want to take an existing microservice of about 3-5kloc and see if I can write a prompt to get o1 to generate the entire service, proper structure, all files, all tests, compiles and passes etc. o1 certainly has the context window to do this at 200k input and 100k output - code is ~10 tokens per line of code, so you'd need like 100k input and 50k output tokens.My approach would be:- take an exemplar service, dump it in the context- provide examples explaining specific things in the exemplar service- write a detailed formal spec- ask for the output in JSON to simplify writing the code - [{"filename":"./src/index.php", "contents":"<?php...."}]The first try would inevitably fail, so I'd provide errors and feedback, and ask for new code (ie complete service, not diffs or explanations), plus have o1 update and rewrite the spec based on my feedback and errors.Curious if anyone's tried something like this.

swyx4 个月前

coauthor/editor here!we recorded a followup conversation after the surprise popularity of this article breaking down some more thoughts and behind the scenes: <a href="https://youtu.be/NkHcSpOOC60?si=3KvtpyMYpdIafK3U" rel="nofollow">https://youtu.be/NkHcSpOOC60?si=3KvtpyMYpdIafK3U</a>

评论 #42751157 未加载

评论 #42757387 未加载

keizo4 个月前

I made a tool for manually collecting context. I use it when copying and pasting multiple files is cumbersome: <a href="https://pypi.org/project/ggrab/" rel="nofollow">https://pypi.org/project/ggrab/</a>

评论 #42751188 未加载

patrickhogan14 个月前

The buggy nature of o1 in ChatGPT is what prevents me from using it the most.Waiting is one thing, but waiting to return to a prompt that never completes is frustrating. It’s the same frustration you get from a long running ‘make/npm/brew/pip’ command that errors out right as it’s about to finish.One pattern that’s been effective is1. Use Claude Developer Prompt Generator to create a prompt for what I want.2. Run the prompt on o1 pro mode

swalsh4 个月前

Work with chat bots like a junior dev, work with o1 like a senior dev.

inciampati4 个月前

o1 appears to not be able to see it's own reasoning traces. Or it's own context is potentially being summarized to deal with the cost of giving access to all those chain of thought traces and the chat history. This breaks the computational expressivity or chain of thought, which supports universal (general) reasoning if you have reliable access to the things you've thought, and is threshold circuit (TC0) or bounded parallel pattern matcher when not.

评论 #42752351 未加载

timewizard4 个月前

> To justify the $200/mo price tag, it just has to provide 1-2 Engineer hours a month> Give a ton of context. Whatever you think I mean by a “ton” — 10x that.One step forward. Two steps back.

adamgordonbell4 个月前

I'd love to see some examples, of good and bad prompting of o1I'll admit I'm probably not using O1 well, but I'd learn best from examples.

mediumsmart4 个月前

I agree with the article and found the non pro version very good at creating my local automation tool chain. It writes the scripts for every step and then you hand them all back to it and it links them up as a single dothiscomplicatedthing.sh

sklargh4 个月前

This echoes my experience. I often use ChatGPT to help with D&D module design and I found that O1 did best when I told it exactly what k required, dumped in a large amount of info and did not expect to use it to iterate multiple times.

irthomasthomas4 个月前

Can you provide prompt/response pairs? I'd like to test how other models perform using the same technique.

iovrthoughtthis4 个月前

this is hilarious

fpgaminer4 个月前

It does seem like individual prompting styles greatly effects the performance of these models. Which makes sense of course, but the disparity is a lot larger than I would have expected. As an example, I'd say I see far more people in the HN comments preferring Claude over everything else. This is in stark contrast to my experience, where ChatGPT has and continues to be my go to for everything. And that's on a range of problems: general questions, coding tasks, visual understanding, and creative writing. I use these AIs all day, every day as part of my research, so my experience is quite extensive. Yet in all cases Claude has performed significantly worse for me. Perhaps it just comes down to the way that I prompt versus the average HN user? Very odd.But yeah, o1 has been a _huge_ leap in my experience. One huge thing, which OpenAI's announcement mentions as well, is that o1 is more _consistently_ strong. 4o is a great model, but sometimes you have to spin the wheel a few times. I much more rarely need to spin o1's wheel, which mostly makes up for its thinking time. (Which is much less these days compared to o1-preview). It also has much stronger knowledge. So far it has solved a number of troubleshooting tasks that there were _no_ fixes for online. One of them was an obscure bug in libjpeg.It's also better at just general questions, like wanting to know the best/most reputable store for something. 4o is too "everything is good! everything is happy!" to give helpful advice here. It'll say Temu is a "great store for affordable options." That kind of stuff. Whereas o1 will be more honest and thus helpful. o1 is also significantly better at following instructions overall, and inferring meaning behind instructions. 4o will be very literal about examples that you give it whereas o1 can more often extrapolate.One surprising thing that o1 does that 4o has never done, is that it _pushes back_. It tells me when I'm wrong (and is often right!). Again, part of that being less happy and compliant. I have had scenarios where it's wrong and it's harder to convince it otherwise, so it's a double edged sword, but overall it has been an improvement in the bot's usefulness.I also find it interesting that o1 is less censored. It refuses far less than 4o, even without coaxing, despite its supposed ability to "reason" about its guidelines :P What's funny is that the "inner thoughts" that it shows says that it's refusing, but its response doesn't.Is it worth $200? I don't think it is, in general. It's not really an "engineer" replacement yet, in that if you don't have the knowledge to ask o1 the right questions it won't really be helpful. So you have to be an engineer for it to work at the level of one. Maybe $50/mo?I haven't found o1-pro to be useful for anything; it's never really given better responses than o1 for me.(As an aside, Gemini 2.0 Flash Experimental is _very_ good. It's been trading blows with even o1 for some tasks. It's a bit chaotic, since its training isn't done, but I rank it at about #2 between all SOTA models. A 2.0 Pro model would likely be tied with o1 if Google's trajectory here continues.)

miltonlost4 个月前

oh god using an LLM for medical advice? and maybe getting 3/5 right? Barely above a coin flip.And that Warning section? "Do not be wrong. Give the correct names." That this is necessary to include is an idiotic product "choice" since its non-inclusion implies the bot is able to be wrong and give wrong names. This is not engineering.

评论 #42751063 未加载

refulgentis4 个月前

This is a bug, and a regression, not a feature.It's odd to see it recast as "you need to give better instructions [because it's different]" -- you could drop the "because it's different" part, and it'd apply to failure modes in all models.It also begs the question of how it's different: and that's where the rationale gets cyclical. You have to prompt it different because it's different because you have to prompt it different.And where that really gets into trouble is the "and that's the point" part -- as the other comment notes, it's expressly against OpenAI's documentation and thus intent.I'm a yuge AI fan. Models like this are a clear step forward. But it does a disservice to readers to leave the impression that the same techniques don't apply to other models, and recasts a significant issue as design intent.

评论 #42751090 未加载

评论 #42751989 未加载

评论 #42750911 未加载