OpenAI Codex hands-on review

170 Punktevon fragmedevor 6 Tagen

16 comments

I was a Plus subscriber and upgraded to Pro just to test Codex, and at least in my experience, it’s been pretty underwhelming.First, I don’t think they got the UX quite right yet. Having to wait for an undefined amount of time before getting a result is definitely not the best, although the async nature of Codex seems to alleviate this issue (that is, being able to run multiple tasks at once).Another thing that bugs me is having to define an environment for the tool to be useful. This is very problematic because AFAIK, you can’t spin up containers that might be needed in tests, severely limiting its usefulness. I guess this will eventually change, but the fact that it’s also completely isolated from the internet seems limiting, as one of the reasons o3 is so powerful in ChatGPT is because it can autonomously research using the web to find updated information on whatever you need.For comparison, I also use Claude a lot, and I’ve found it to work really well to find obscure bugs in a somewhat complex React application by creating a project and adding the GitHub repo as a source. What this allows me is to have a very short wait time, and the difference with Codex is just night and day. Gemini also allows you to do this now, and it works very well because of its massive context window.All that being said, I do understand where OpenAI is going with this. I guess they want to achieve something like a real coworker (they even say that in their promotional videos for Codex) because you are supposed to give tasks to Codex and wait until it’s done, like a real human, but again, IMHO, it’s too “pull-request-focused”I guess I’ll be downgrading to Plus again and wait a little to see where this ends up.

评论 #44051809 未加载

评论 #44057510 未加载

评论 #44045951 未加载

avitalvor 5 Tagen

I work at OpenAI (not on Codex) and have used it successfully for multiple projects so far. Here's my flow:- Always run more than one rollout of the same prompt -- they will turn out different- Look through the parallel implementations, see which is best (even if it's not good enough), then figure out what changes to your prompt would have helped nudge towards the better solution.- In addition, add new modifications to the prompt to resolve the parts that the model didn't do correctly.- Repeat loop until the code is good enough.If you do this and also split your work into smaller parallelizable chunks, you can find yourself spending a few hours only looping between prompt tuning and code review with massive projects implemented in a short period of time.I've used this for "API munging" but also pretty deep Triton kernel code and it's been massive.

评论 #44044559 未加载

评论 #44044774 未加载

评论 #44044403 未加载

评论 #44043955 未加载

评论 #44047394 未加载

评论 #44044190 未加载

评论 #44046305 未加载

评论 #44043925 未加载

teekertvor 5 Tagen

“As I wrote about in Walking and talking with AI in the woods, ideally I'd like to start my morning in an office, launch a bunch of tasks, get some planning out of the way, and then step out for a long walk in nature.”Wouldn’t we all want that, but it sounds like you can leave task launching and planning to an AI and go find another career.

micromacrofootvor 5 Tagen

> Codex will support me and others in performing our work effectively away from our desks.This feels so hopelessly optimistic to me, because "effectively away from our desks" for most people will mean "in the unemployment line"

评论 #44044104 未加载

评论 #44045248 未加载

评论 #44044184 未加载

评论 #44043987 未加载

评论 #44044269 未加载

评论 #44043403 未加载

ryanackleyvor 5 Tagen

If you're building a React app using a popular UI framework, AI will seem like magic at how well it one-shots things.To the author's point about one-shotting. I think it will be a real challenge pushing an AI coding workflow forward because of this problem. In my experience, AI seems to fall off a cliff when you ask it to write code using more obscure libraries and frameworks. It will always hallucinate something rather than admitting it has no knowledge of how something works.

评论 #44051885 未加载

swyxvor 6 Tagen

i shared my review inside of the pod with the team (<a href="https://latent.space/p/codex" rel="nofollow">https://latent.space/p/codex</a>) but basically:- it's a GREAT oneshot coding model (in the pod we find out that they specifically finetuned for oneshotting OAI SWE tasks, eg prioritized over being multiturn)- however comparatively let down by poorer integrations (eg no built in browser, not great github integration - as TFA notes "The current workflow wants to open a fresh pull request for every iteration, which means pushing follow-up commits to an existing branch is awkward at best." - yeah this sucks ass)fortunately the integrations will only improve over time. i think the finding that you can do 60 concurrent Codex instances per hour is qualitatively different than Devin (5 concurrent) and Cursor (1 before the new "background agents").btw> I haven't yet noticed a marked difference in the performance of the Codex model, which OpenAI explains is a descendant of GPT-3 and is proficient in more than 12 programming languages.incorrect, its an o3 finetune.

评论 #44044119 未加载

评论 #44043567 未加载

评论 #44043138 未加载

maxwellgvor 6 Tagen

Being able to make quick changes across a ton of repos sounds awesome. I help maintain a ton of example apps, and doing things like updating a README to conform to a new format, or changing a link, gets pretty tedious when there are 20 different places to do it. If I could delegate all that busywork to Codex and smash the merge button later I would be happy.

评论 #44043024 未加载

atonsevor 6 Tagen

I'm actually curious about using this sort of tool to allow non-devs to make changes to our code.There are so many content changes or small CSS fixes (anyway you would verify that it was fixed by looking at it visually) where I really don't want to be bothered being involved in the writing of it, but I'm happy to do a code review.Letting a non-dev see the ticket, start off a coding thing, test if it was fixed, and then just say "yea this looks good" and then I look at the code, seems like good workflow for most of the minor bugs/enhancements in our backlog.

评论 #44043300 未加载

评论 #44044202 未加载

评论 #44043572 未加载

ramesh31vor 5 Tagen

Needs checkpointing. A full git commit is too much... commitment. Often you'll go down a bad path with agentic codegen that just falls apart, and you wont know where you wanted to return to until you're there. I'm very skeptical of the "automated PR" solutions at the moment. Too much time and money is lost to trust singleshot yet. And if you still need a human in the loop, best to do it in realtime with constant feedback, i.e. cybernetics not automata.

datadrivenangelvor 6 Tagen

40-60% success rate for smaller things is pretty good. Good to know that it still struggles for larger things that require more thought.

评论 #44043805 未加载

jFriedensreichvor 5 Tagen

The "phone and work away from desk" point struck me as absurd. If anything work is pushed to code review and testing which mostly require even more screen estate than coding itself.

bathtub365vor 5 Tagen

Is there anywhere that lists what languages this supports? They aren’t listed in the product announcement or in this review, and the review examples seem to mostly be fixing typos on webpages.

JR1427vor 5 Tagen

I've found it really helpful in rummaging around an unfamiliar codebase, and pointing me to relevant parts of it.The application of patches is hit and miss. If there changes are across multiple files, I find it gets stuck going in circles.But still been a definite net positive in terms of productivity.

turing_completevor 5 Tagen

Is codex-1 no longer available through the API?

theowijrhrjrj48vor 5 Tagen

Sounds like a gptel-tool one can whip up in a week.

yieldcrvvor 5 Tagen

> Codex then clones your repositories into its own sandboxes so it can run commands and create branches on your behalf.Slurping up trade secretsbut maybe I'll sound like the people that are afraid of using github and other cloud git protocolsinteresting crossroads