Although the demos are impressive, they seem short and limited in scope which makes me wonder how well this will work outside of these planned cases. Can it do software architecture at all? Is it still essentially just regurgitating solutions? How often will the solution only be 90% correct, which is 100% not good enough?<p>Even so, I realize the demos are still broad in scope and the results are incredible. Imagine seeing this even 2 years ago. It would seem like magic; you wouldn't be able to believe it. Today, this was inevitable and entirely believable. There will be even better versions of this soon.
Clearly an extremely impressive demo and congrats on the launch. I do wonder how often the bugs Devin encounters will be solvable from the simple fixes that were demonstrated. For instance, I notice in the first demo Devin hits a KeyError and decides to resolve it by wrapping the code in a try-catch. While this will get the code to run, I immediately imagined cases where it's not actually an ideal solution (maybe it's a KeyError because the blog post Devin read is incorrect or out of date and Devin should actually be referencing a different key altogether or a different API). Can Devin "back up" at this point and implement a fix further back in its "decision tree" (e.g. use a different API endpoint) or can it only come up with fixes for the specific problem it's encountering at this moment (catch the KeyError and return None)?
As a developer but also product person, I keep trying to use AI to code for me. I keep failing, because of context length, because of shit output from the model, because of lack of any kind of architecture etc etc etc. I'm probably dumb as hell, because I just can't get it to do anything remotely useful, more than helping me with leetcode.<p>Just yesterday I tried to feed it a simple HTML page to extract a selector, I tried it with GPT-4-turbo, I tried it with Claude, I tried it with Groq, I tried it with a local LLama2 model with 128k context window. None of them worked. This is a task that while annoying, I do in about 10 seconds.<p>Sure, I'm open to the possibility that in the next 2 - 3 days up to a couple of years, I'll no longer do manual coding. But honestly. After so much hype, I'm starting to grow a bit irritated with the hype.<p>Just give me a product that works as advertised and I'll throw money your way because I have a lot more ideas than I have code throughoutput!
While impressive, the demo on UpWork didn’t even come close to fulfilling the job requirements. The job asked for instructions on how to set it up on an EC2 machine. It didn’t ask to run the model, or do anything that was depicted.<p>It makes me question the truthfulness of the other claims.
>> With our advances in long-term reasoning and planning, Devin can plan and execute complex engineering tasks requiring thousands of decisions.<p>They'd better have really advanced reasoning and planning capabilities way beyond everything that anyone else knows how to do with LLMs. There's a growing body of literature that leaves no doubt that LLMs can't reason and can't plan.<p>For a quick summary of some such results see:<p><a href="https://arxiv.org/pdf/2403.04121.pdf" rel="nofollow">https://arxiv.org/pdf/2403.04121.pdf</a>
Scott Wu! I met Scott at a competitive programming event a few years back.<p>He is one of a very small group of people (going back to 1989) to get a perfect raw score at the IoI, the olympiad for competitive programming.<p><a href="https://stats.ioinformatics.org/people/2686" rel="nofollow">https://stats.ioinformatics.org/people/2686</a><p>Glad to see that he's putting his (unbelievable) talents to use. To give you a sense, at the event where I met him, he solved 6 problems equivalent to Leetcode medium-to-hard problems in under 15 minutes (total), including reading the problems, implementing input parsing, debugging, and submitting the solutions.
I must say, I'm not HUGELY impressed with a website that lets me, unauthenticated, upload files of an arbitrary size. Just posted a 500mb dmg file to their server.<p>If anyone is practicing for their B1 Dutch exam, feel free to use this link to get the practice paper.<p><a href="https://usacognition--serve-s3-files.modal.run/attachments/460be415-1283-4963-9a52-931ad509afa4/2020%20Lezen%20I%20openbaar%20examen%20tekstboekje%20(digitaal).pdf" rel="nofollow">https://usacognition--serve-s3-files.modal.run/attachments/4...</a>
Don't get it. If we have this amazing AI why don't we make good use of it? 90% of my job is not to write code (as a senior software engineer), is to:<p>- deobfuscate complex requirements into well divided chunks<p>- find gaps or holes in requirements so that I have to write the minimal amount of code<p>- understand codebases so that the implementation fits nicely<p>I don't need an "AI software engineer", I need an "AI people person who gives me well defined tasks". Now sure, if you combine those two kinds of AIs I could perhaps become irrelevant.
After devin "figures out" 10 issues, what does the code look like? Those are the easy ones, and if you haven't fixed them cleanly, the next 10 will be more difficult to solve, for human and for robot. Now do this for several years. Can devin create its own bug reports and issues? It better be able to!<p>I'm curious what a large, mature codebase, with complex internals and legacy code looks like after you sick devin on it. Not pretty I suspect. In fact, I think it will become so difficult to fix that nobody -- neither human nor devin -- will be able to clean up the mess. By sheer volume, a broken ball of unfixable spaghetti.<p>I would be immensely pissed off if someone did this to an open source project of mine, or even to a closed-source codebase I'm working on. Not only would it not be useful, it would be moving <i>backwards</i>. Creating an icky vomit mess that we will probably have to spend years cleaning up after bug reports and complaints from customers begin mounting, and competitors can iterate faster.<p>Does that sound like something you want to deal with in your software business?
If you need AI to help you program an algorithm, then you shouldn't be using it because you can't tell if AI's solution is correct.<p>If you can tell if a solution is correct or not --- well, then you don't need to have AI write it for you.<p>I think AI programming can only work when the industry begin to treat "almost working" systems backed by human customer service as acceptable.
From their twitter:<p>> When evaluated on the SWE-Bench benchmark, which asks an AI to resolve GitHub issues found in real-world open-source projects, Devin correctly resolves 13.86% of the issues unassisted, far exceeding the previous state-of-the-art model performance of 1.96% unassisted and 4.80% assisted.<p>While it is a progress, its far away from being useful to be a software engineer.
People who try to draw historical analogies to AI replacing humans say things like:<p>"cars replaced horse drawn carriages. But we managed to adapt to that, the carriage drivers got new jobs."<p>My dudes. We are the HORSES being replaced in this scenario.
So why are they hiring? <a href="https://jobs.ashbyhq.com/cognition">https://jobs.ashbyhq.com/cognition</a> cant they just use "Devin" ?
I have a few years of experience in backend development, and I have realized that LLMs are incredible productivity boosts for generating code only if you know the underlying libraries/frameworks/languages very well. You can then prompt it with very specific instructions and it can go do that. Helps with the typing, but that's pretty much all. I still have to know everything and it can definitely not do everything on autopilot. I would be surprised if this product can do any real work.
Let's get realistic here - I just beat GPT-4 at tic tac toe, since it failed to block my 2/3 complete winning line ...<p>Sure, one day we'll have AGI, and one day AGI will replace many jobs that can be done in front of a computer.<p>In the meantime, SOTA AI appears to be an airline chatbot that gets the company sued for lying to the customer. This is just basic question answering, and it can't even get that right. Would you trust it to write the autopilot code to fly the airplane? Maybe to write a tiny bit of it - just code up one function, perhaps?<p>I sure as hell wouldn't, and when it can be trusted to write one function that meets requirements and has no bugs, it's still going to be a LONG way before it can replace the job of the developers who were given a task of "write us an autopilot".
Interesting: The last demo on the blog took 2.5h to complete:
<a href="https://www.cognition-labs.com/blog" rel="nofollow">https://www.cognition-labs.com/blog</a>
<a href="https://www.youtube.com/watch?v=UTS2Hz96HYQ" rel="nofollow">https://www.youtube.com/watch?v=UTS2Hz96HYQ</a>
"Devin's Upwork Side Hustle"<p>I wonder how much time of this was consumed by manually directing Devin into the right direction, manually fixing and undoing the mess Devin produced and watching Devin burn through $$$. As others said, being completely non-transparent about this burns a bit of trust, but I'd really like to know where we are right now. Since Devin is currently "invite only demos", a more realistic peek into the state of the art can be seen here: <a href="https://docs.sweep.dev/blogs/gpt-4-modification">https://docs.sweep.dev/blogs/gpt-4-modification</a><p>My gut feeling (and limited experience): gpt-4 and other models are not quite there yet, but whoever prepares for the next generation of models <i>now</i> will eventually win big times. Or be replaced by simpler approaches.
I've worked on very complex systems - The disney streaming platform (before it was disney), live video streaming systems, banking transaction systems, your run of the mill crud software with kafka clusters piping mind numbing amounts of data, netflix and a few other large engineering heavy companies.<p>No engineering company worth their weight is going to build a world class technology business purely with generative AI in its current state. The risk in doing so currently is total and utter failure. I have a very hard time believing we're any where near that capability. Maybe your mom and pop startup could hire a prompt engineer to build a website and simple tool but we have yet to see those exercises surfaced to the mainstream; it's purely speculative.<p>I say, rest easy programmers. Your careers will be enriched more than axed with generative AI as a support tool for many years to come.<p>Also, if anyone who works in this field has a strong opposing belief, then consider OpenAI engineers are programming themselves out of a job which obviously is not the case.
Looks interesting but claiming "first" seems pretty off, there have been others like <i>Sweep</i> featured here before.<p><a href="https://news.ycombinator.com/item?id=36987454">https://news.ycombinator.com/item?id=36987454</a><p><i>Sweep is an open-source AI-powered junior developer</i><p><a href="https://sweep.dev/">https://sweep.dev/</a>
There is no way this is going to make it so that "engineers can focus on more interesting problems and engineering teams can strive for more ambitious goals."<p>Instead it will mean that bosses can fire 75-90% of the (very expensive) engineers, with the ones who remain left to prompt the AI and clean up any mistakes/misunderstandings.<p>I guess this is the future. We've coded ourselves out of a job. People are smiling and celebrating this all - personally I find it kinda sad that we've basically put an end to software engineering as a career and put loads of people out of work. it is not just SWEs - it is impacting a lot of careers... I hope these researchers can sleep well at night because they're dooming huge swathes of people to unemployment.<p>Are we about to enter a software engineering winter? People will find new careers, no kids will learn to code since AI can do it all. We'll end up with a load of AI researchers being "the new SWEs", but relying on AI to implement everything? Maybe that will work and we'll have a virtuous circle of AIs making AI improvements and we'll never need engineers again? Or maybe we'll hit a wall and progress in comp sci will essentially stop?
As someone who works in this space (<a href="https://pythagora.ai">https://pythagora.ai</a>), I welcome new entrants to this niche.<p>Currently, mainstream AI usage in coding is at the level of assistants and glorified autocomplete. Which is great (I use GitHub Copilot daily), but for us working in the space it's obvious that the impact will be much larger. Besides us (Pythagora), there's also Sweep (mentioned by others in the comments) and GPT Engineer who are tackling the same problem, I believe each with a slightly different angle.<p>Our thesis is that human in the loop is key. In coding, you can think of LLMs as a very eager junior developer who can easily read StackOverflow but doesn't really think twice before jumping to implementation. With guidance (a LOT in terms of internal prompts, and some by human) it can achieve spectacular results.
Is Devin a new LLM? Perhaps equiped with code and deploy plug-ins? The comparisons against other LLMs would suggest so.<p>The real world eval benchmark puts Claude 2 way ahead of GPT-4, which doesn't sound right.
We’re still at the “rhyming not reasoning” phase of LLMs. The question of whether we move past rhyming and onto reasoning is a good one, and I’m not sure what I think about it. But I am pretty sure that coding is a lot more like reasoning than it is like rhyming, at least for de novo problems above a certain level of complexity (intellectual challenge) and complication (moving parts).<p>I remain open minded about what’s next and at the rate things are changing, I wouldn’t rule anything out a priori for now.
I'd really like it if Cognition Labs would put the resulting code from the demo into an open-source repository so we could examine it directly.<p>When I was using chatGPT to help guide me through some coding tasks, I'd find it could create somewhat useful code, but where it fell down was that it would put things into variables which would be better put into a class. It is this structuring of a complete system which is important for any real software engineering, rather than just writing code.
I believe that hosting DEVIN will cost much more GPU time hosting a regular LLM. By inspecting the videos in Cognition Lab's official website, I noticed that DEVIN can take more than one hour to do one step, which is more than an hour of GPU usage. When using GPT-4, we usually get output within 30 seconds, which is less than a minute of GPU usage.<p>In addition, when using GPT-4, I use it only when I have new thoughts, so the GPU occupancy rate is low. I probably use less than 5 hours of GPU time each month. DEVIN is sort of like an intern working for you, so you would probably at least make it work 40hrs/week.<p>These difference in GPU usage would probably make DEVIN 10 times more expensive for the business model to be profitable, that is, if they are using the subscriber business model like GPT-4.<p>I don't think there are any other viable business model for DEVIN - for sure it cannot replace or even reduce the number of human programmer due to LLM's unreliable nature and the necessity of code verification.
Surprised how calm and underwhelmed comments are.<p>Sure it is no senior architect but the trajectory is insane. Wasn’t that long ago that LLMs barely managed coherent poems. Now it’s troubleshooting code problems on its own?<p>Sure it’s just a gpt4 wrapper but that implies the same can be done with gpt5 and six etc.<p>Project it forward and that does actually become non trivial
I really don't like these announcements with invitation lists.<p>Just let me try the goddamn product.<p>By the time you let me in, I don't care anymore or another competitor catched my attention already.<p>Neon, the Postgres as a service put me in such a long wait list that by the time they invited me in, I was already on a completely different solution (and was happy).
Also see Devin the NAI for an old school alternative: <a href="https://docs.google.com/document/d/1byJgu1G_M58QVWmpZeEDthyAB8Bq752RgL_gF7hEJQc/edit?usp=drivesdk" rel="nofollow">https://docs.google.com/document/d/1byJgu1G_M58QVWmpZeEDthyA...</a>
we still don't have agents that can do simple things like: find a funny photo of my dog in my phone and post it as a story on instagram with 100% reliability. I would wait for that to happen first before thinking there can be an autonomous software engineer
Hey! Stop taking our jobs!<p>Side note: I'm kind of offended that something called 'Devin' is going to take my job. If you're going to replace me at least let me keep my dignity by naming it something cool like 'Sora'
I recommend looking at swe-bench to get an idea as to what breakthroughs this product accomplishes: <a href="https://www.swebench.com/" rel="nofollow">https://www.swebench.com/</a>. They claim to have tested SOTA models like GPT-4 and Claude 2 (I would like to see it tested on Claude 3 Opus) and their score is 13.86% as opposed to 4.80% for Claude 2. This benchmark is for solving real-world GitHub issues. So for those claiming that they tried models in the past and it didn't work for their use case, maybe this one will be better?
Bearish. These types of tools/agents-chaining will be irrelevant due to lackluster capability until AGI is achieved. At which point, the basis for creating these types of tools/agents will be defunct.
I've been working on something similar, here's one of their same tests where the AI learns how to make a hidden text image.<p><a href="https://www.youtube.com/watch?v=dHlv7Jl3SFI" rel="nofollow">https://www.youtube.com/watch?v=dHlv7Jl3SFI</a><p>The real problem is coherence (logic and consistency over time) which is what these wrappers try to address. I believe AI could probably be trained to be a lot more coherent out of the box.. working with minimal wrapping.. that is the AI I worry about.
This is awesome to bootstrap some ideas. The question is can it work with (large) existing code bases or modify it's own code. Guess a good test would be, can it reproduce Devin;)
When you have software in prod failing because it was built by shoddy "AI" and people who copy/paste because they don't know any better, and you need a fix, give me a ring.<p>I have tried using GPT4 & gemini extensively, and the amount of bullshit generated makes it unreliable if you don't already know the domain. These tools lack the critical stuff (being context-aware), and just make up libraries and APIs. Yet you can't be sure when it's bullshitting or not, making it an exercise in frustration for anything that's not trivial.<p>Save your money and buy an o'reilly subscription.
I'm totally adding "rescue and recovery of projects botched by AI" to my list of services. One thing is certain, it's not going to be cheap.
I mean, this might just be existential cope, but my first thought when looking at the Upwork demo posted on Twitter (<a href="https://x.com/cognition_labs/status/1767548768734294113?s=20" rel="nofollow">https://x.com/cognition_labs/status/1767548768734294113?s=20</a>) was that it seemed a little bit suspicious.<p>Namely, the client was asking for an unusually specific (for Upwork) ask. It was an almost perfect example of a job to be given to an AI agent for testing purposes.
I have in my codebase several really long django views files (3k lines!). They were written in a poor fashion with many nested if statements for parsing and error handling.<p>On a one by one basis I can use VSCode github copilot to rewrite each one the way I want it.<p>What I want to do is iterate through all functions in the files and do each one of them.<p>I know we are getting there, but does anybody know how that can be done right now?
Technological unemployment and doomerism aside, I think there's a big difference here - in the past, you've needed lots of capital to invest in those labour-saving devices. A farm labourer couldn't buy a tractor, a dockerworker can't buy a crane.<p>But a software engineer absolutely can buy access to AI services.<p>I have no idea how this will end up, but it'll be different to before.
Until you can point out via video what is the issue(“see? Here it flickers a bit and here needs centering” or when you talk to the “swe agent” and say we need this feature taken out for now, and later you ask it to put the feature back in and it remembers it had code implemented at GitHub commit id xxyyzz, you really can’t call this a software engineer
AI replacing one of the last well-paid jobs on the planet is a good thing. Large-scale societal changes are triggered when a critical number of haves turn into have-nots. I would recommend junior engineers to study Nechayev and Bakunin instead of the latest React flavor. Those will have a better ROI in the coming years.
The lack of attention this development is getting on here is astounding. Other developers I am talking to about it brush it off, then change topic.<p>So many comments about how insufficient the tool is.<p>Our heads are really in the sand, I'm afraid.
Is it built with pre-existing LLMs or did they created one from the ground up? With 21 million Seed A funding, an LLM powerful than GPT4 seems impossible. What am I missing?
This is where inference speed starts to matter. H100 might be cheaper per inference than Groq but cutting down the wait time from 1 minute to 10 seconds could be a big deal.
Wow, this is incredible news! Congratulations to the team behind Devin, the first AI engineer! This is a monumental leap forward in technology and innovation. I'm absolutely thrilled to see how Devin will revolutionize the field of engineering.<p>As someone passionate about the potential of AI in tech, I can't wait to see what amazing feats Devin will accomplish. And who knows, maybe one day, companies like Munesoft Technologies will reach similar heights with their own AI-driven advancements. Here's to a future filled with endless possibilities! #DevinAI
For something that you can download and try right now, and actually works for daily coding tasks, you can try my desktop app 16x Prompt.<p><a href="https://prompt.16x.engineer/" rel="nofollow">https://prompt.16x.engineer/</a><p>It's not 100% automated but saves a lot of time spent on writing code.<p>It works by composing prompts from tasks instructions, source code context and formatting instructions, resulting in high quality prompts that can be fed into LLMs to generate high quality code.
There have been many tools that were sold as developer replacements over the years.
- Microsoft FrontPage
- Adobe Dreamweaver
- A litany of glorified WYSIWYG editors
- WordPress
- Wix
- Power Apps/SharePoint
...and so on.<p>Business owners have been getting sexually aroused at the prospect of taking a developer's salary and putting it in their own pockets for decades. Each iteration of this wet dream has only locked businesses into "low code" systems that require even more, highly-specialized developers to operate. Right now, and probably for a while, Devon, et al. is on par with the drag-and-drop automagical app building snake oil stuff.<p>LLMs are useful to help developers be more productive, which does translate to lay-offs, but until someone creates an AI that can translate the absolute fevered gibberish that comes out of business people's heads into a profitable piece of software, this is just MS FrontPage v100.0.<p>Just like an entire industry sprang up around fixing WordPress websites that business owners thought they could do themselves, pretty soon we'll start seeing job postings for AI-Generated Spaghetti Unravellers.<p>I'm (half seriously) imagining a future were software engineers are mostly consultants that show up and talk with business folks, then talk with the local robot, and get the project to actually work. Bill $1k per hour.
There's also the alternative: Devin, the NAI.<p><a href="https://docs.google.com/document/d/1byJgu1G_M58QVWmpZeEDthyAB8Bq752RgL_gF7hEJQc/edit" rel="nofollow">https://docs.google.com/document/d/1byJgu1G_M58QVWmpZeEDthyA...</a>
are we not concerned that even though, yes, devin is only solving 13% of issues - it is also an ML model. it is going to learn, potentially very quickly.
Yet again, bad time to be on the labor side of the equation, great time to be a capitalist. For us laborers, if I had to choose from a list of fields to go into, anything creative would be low on the list. 'Prompt Engineer' will be the only one left.<p>UBI is a pipe dream... it's not happening. The wealth and means of production won't be shared in any meaningful capacity. Wealth inequality can get a whole lot worse.
Humans seek work that provides satisfaction and meaning in their life.<p>For every technological advancement, artisans are the first to be made obsolete.<p>Sure we have landfills full of unworn textiles, the market says its good, but overall, we keep destroying what allows humans to seek meaning.<p>Our governments and society have made it clear, if you don't produce value, you don't deserve dignity.<p>We have outsourced art to computers, so people who don't understand art can have it at their fingertips.<p>Now we're outsourcing engineering so those who don't understand it can have it done for cheap.<p>We hear stories of those who don't understand therapy suggesting AI can be a therapist, of those who don't understand medicine suggesting AI can replace a doctor.<p>What will be left? Where will we be? Just stranded without dignity or purpose, left to rot when we no longer produce value.<p>I ask this question often, with multiple contexts, but to what end? Who benefits from these advancements? The CEO and shareholders, sure, but just because something can be found for cheaper, doesn't mean it improves lives. Our clothes barely last a year, our shoes fall apart. Our devices come with pre-destined expiration dates.<p>Where will we be in the future? Those born into money can continue passing it around, a cargo cult for the numbers going up. But what about everyone else?
hey , I am a newbie in field of AI , i want to know about the nearby future of AI . after devin I am little ! or rather I should say deeply terrified about the future (5 to 8 years) of software eng. can someone explain ?
We're not that far from a major turning point.<p>Currently these models don't provide an adequate enough confidence measure that prevents them from maximizing their potential. In the next few years we're going to reach a point where models will be able to tell if something is possible and avoid hallucinating, guaranteeing much better correctness. Something like that would be absolutely killer.<p>If you add on a top-down approach using a framework, such that it can architect a system down into small individual components, then that's a recipe for a really great workflow. The models we have now really shine in doing automated unit tests, and small bits of code to avoid limits with context size. Making the interfaces obvious enough, and being able to glue things together using obvious connections seems very possible.<p>I really do think that in the next few years we're going to see one of these tools really do well.