This is actually a huge deal.<p>As someone building AI SaaS products, I used to have the position that directly integrating with APIs is going to get us most of the way there in terms of complete AI automation.<p>I wanted to take at stab at this problem and started researching some daily busineses and how they use software.<p>My brother-in-law (who is a doctor) showed me the bespoke software they use in his practice. Running on Windows. Using MFC forms.<p>My accountant showed me Cantax - a very powerful software package they use to prepare tax returns in Canada. Also on Windows.<p>I started to realize that pretty much most of the real world runs on software that directly interfaces with people, without clearly defined public APIs you can integrate into. Being in the SaaS space makes you believe that everyone ought to have client-server backend APIs etc.<p>Boy was I wrong.<p>I am glad they did this, since it is a powerful connector to these types of real-world business use cases that are super-hairy, and hence very worthwhile in automating.
Anthropic blog post outlining the research process:
<a href="https://www.anthropic.com/news/developing-computer-use" rel="nofollow">https://www.anthropic.com/news/developing-computer-use</a><p>Computer use API documentation:
<a href="https://docs.anthropic.com/en/docs/build-with-claude/computer-use" rel="nofollow">https://docs.anthropic.com/en/docs/build-with-claude/compute...</a><p>Computer Use Demo:
<a href="https://github.com/anthropics/anthropic-quickstarts/tree/main/computer-use-demo">https://github.com/anthropics/anthropic-quickstarts/tree/mai...</a>
I like its lazy approach<p>> I apologize, but it seems we're having some issues installing Chromium. As an alternative, since I can see Firefox is already installed on the system, would you like to use Firefox instead? Or would you prefer me to try a different approach to installing Chromium?<p>It seems unable to figure out how to install chormium. Sees the Firefox icon out there and just want to use it. You know, like a real lazy human being.<p>It seems to insistent when it needs to, though<p>> I clicked on Firefox and it should start opening. Let me take another screenshot to see if the window has appeared:<p>> Firefox seems to be taking a moment to start up. Let me wait a few seconds and take another screenshot:<p>> Let me try clicking Firefox one more time to ensure it starts:<p>> Let me try one more time to take a screenshot to see if Firefox has opened:<p>> Let me try starting Firefox using the bash tool as an alternative method:<p>> Firefox has successfully opened! As per the instructions, I should ignore the setup wizard and instead click directly on the address bar. Would you like me to do anything specific with Firefox now that it's open?<p>I didn't instruct him to ignore the setup wizard. So my guess is that Anthropic has configured it to ignore stuff happening on the screen so it doesn't go loose.<p>And here he goes through my website, through my hacker news account and then find this very comment<p>> Looking at his first/most recent comment, it's about a discussion of Claude and computer use. Here's what he wrote:<p>"I like its lazy approach"<p>This appears to be a humorous response in a thread about "Computer use, a new Claude 3.5 Sonnet, and Claude..." where he's commenting on an AI's behavior in a situation. The comment is very recent (shown as "8 minutes ago" in the screenshot) and is referring to a situation where an AI seems to have taken a simpler or more straightforward approach to solving a problem.
I still feel like the difference between Sonnet and Opus is a bit unclear. Somewhere on Anthropic's website it says that Opus is the most advanced, but on other parts it says Sonnet is the most advanced and also the fastest. The UI doesn't make the distinction clear either. Then on Perplexity, Perplexity says that Opus is the most advanced, compared to Sonnet.<p>And finally, in the table in the blogpost, Opus isn't even included? It seems to me like Opus is the best model they have, but they don't want people to default using it, maybe the ROI is lower on Opus or something?<p>When I manually tested it, I feel like Opus gives slightly better replies compared to Sonnet, but I'm not 100% it's just placebo.
The "computer use" ability is extremely impressive!<p>This is a lot more than an agent able to use your computer as a tool (and understanding how to do that) - it's basically an autonomous reasoning agent that you can give a goal to, and it will then use reasoning, as well as it's access to your computer, to achieve that goal.<p>Take a look at their demo of using this for coding.<p><a href="https://www.youtube.com/watch?v=vH2f7cjXjKI" rel="nofollow">https://www.youtube.com/watch?v=vH2f7cjXjKI</a><p>This seems to be an OpenAI GPT-o1 killer - it may be using an agent to do reasoning (still not clear exactly what is under the hood) as opposed to GPT-o1 supposedly being a model (but still basically a loop around an LLM), but the reasoning it is able to achieve in pursuit of a real world goal is very impressive. It'd be mind boggling if we hadn't had the last few years to get used to this escalation of capabilities.<p>It's also interesting to consider this from POV of Anthropic's focus on AI safety. On their web site they have a bunch of advice on how to stay safe by sandboxing, limiting what it has access to, etc, but at the end of the day this is a very capable AI able to use your computer and browser to do whatever it deems necessary to achieve a requested goal. How far are we from paperclip optimization, or at least autonomous AI hacking ?
I've been saying this is coming for a long time, but my really smart SWE friend who is nevertheless not in the AI/ML space dismissed it as a stupid roundabout way of doing things. That software should just talk via APIs. No matter how much I argued regarding legacy software/websites and how much functionality is really only available through GUI, it seems some people are really put off by this type of approach. To me, who is more embedded in the AI, computer vision, robotics world, the fuzziness of day-to-day life is more apparent.<p>Just as how expert systems didn't take off and tagging every website for the Semantic Web didn't happen either, we have to accept that the real world of humans is messy and unstructured.<p>I still advocate making new things more structured. A car on wheels on flattened ground will always be more efficient than skipping the landscaping part and just riding quadruped robots through the forest on uneven terrain. We should develop better information infrastructure but the long tail of existing use cases will require automation that can deal with unstructured mess too.
Not specific to this update, but I wanted to chime in with just how useful Claude has been, and relatively better than ChatGPT and GitHub copilot for daily use. I've been pro for maybe 6 months. I'm not a power user leveraging their API or anything. Just the chat interface, though with ever more use of Projects, lately. I use it every day, whether for mundane answers or curiosities, to "write me this code", to general consultation on a topic. It has replaced search in a superior way and I feel hugely productive with it.<p>I do still occasionally pop over to ChatGPT to test their their waters (or if Claude is just not getting it), but I've not felt any need to switch back or have both. Well done, Anthropic!
Claude 3.5 Opus is no longer mentioned at all on <a href="https://docs.anthropic.com/en/docs/about-claude/models" rel="nofollow">https://docs.anthropic.com/en/docs/about-claude/models</a><p>Internet Archive confirms that on the 8th of October that page listed 3.5 Opus as coming "Later this year" <a href="https://web.archive.org/web/20241008222204/https://docs.anthropic.com/en/docs/about-claude/models" rel="nofollow">https://web.archive.org/web/20241008222204/https://docs.anth...</a><p>The fact that it's no longer listed suggests that its release has at least been delayed for an unpredictable amount of time, or maybe even cancelled.
One of the funnier things during training with the new API (which can control your computer) was this:<p><i>"Even while recording these demos, we encountered some amusing moments. In one, Claude accidentally stopped a long-running screen recording, causing all footage to be lost.<p>Later, Claude took a break from our coding demo and began to peruse photos of Yellowstone National Park."</i><p>[0] <a href="https://x.com/AnthropicAI/status/1848742761278611504" rel="nofollow">https://x.com/AnthropicAI/status/1848742761278611504</a>
This needs more discussion:<p>Claude using Claude on a computer for coding
<a href="https://youtu.be/vH2f7cjXjKI?si=Tw7rBPGsavzb-LNo" rel="nofollow">https://youtu.be/vH2f7cjXjKI?si=Tw7rBPGsavzb-LNo</a> (3 mins)<p>True end-user programming and product manager programming are coming, probably pretty soon. Not the same thing, but Midjourney went from v.1 to v.6 in less than 2 years.<p>If something similar happens, most jobs that could be done remotely will be automatable in a few years.
I wrote up some of my own notes on Computer Use here: <a href="https://simonwillison.net/2024/Oct/22/computer-use/" rel="nofollow">https://simonwillison.net/2024/Oct/22/computer-use/</a>
From the computer use video demo, that's a <i>lot</i> of API calls. Even though Claude 3.5 Sonnet is relatively cheap for its performance, I suspect computer use won't be. It's a very good idea that Anthropic upfront that it isn't perfect. And it's guaranteed that there will be a viral story where Claude will accidentally delete something important with it.<p>I'm more interested in Claude 3.5 Haiku, particularly if it is indeed better than the current Claude 3.5 Sonnet at some tasks as claimed.
Completely irrelevant, and it might just be me, but I really like Anthropic's understated branding.<p>OpenAI's branding isn't exactly screaming in your face either, but for something that's generated as much public fear/scaremongering/outrage as LLMs have over the last couple of years, Anthropic's presentation has a much "cosier" veneer to my eyes.<p>This isn't the Skynet Terminator wipe-us-all-out AI, it's the adorable grandpa with a bag of werthers wipe-us-all-out AI, and that means it's going to be OK.
Claude is amazing. The project documents functionality makes it a clear leader ahead of ChatGPT and I have found it to be the clear leader in coding assistance over the past few months. Web automation is really exciting.<p>I look forward to the brave new future where I can code a webapp without ever touching the code, just testing, giving feedback, and explaining discovered bugs to it and it can push code and tweak infrastructure to accomplish complex software engineering tasks all on its own.<p>Its going to be really wild when Claude (or other AI) can make a list of possible bugs and UX changes and just ask the user for approval to greenlight the change.
And today I realized that despite it being an extremely common activity, we don’t really have a word for “using the computer” which is distinct from “computing”. It’s funny because AI models are <i>always</i> “using a computer” but now they can “use your computer.”
Reminds me of the rise in job application bots. People are applying to thousands of jobs using automated tools. It’s probably one of the inevitable use cases of this technology.<p>It makes me think. Perhaps the act of applying to jobs will go extinct. Maybe the endgame is that as soon as you join a website like Monster or LinkedIn, you immediately “apply” to every open position, and are simply ranked against every other candidate.
Pretty cool! I use Claude 3.5 to control a robot (ARKit/iOS based) and it does surprisingly well in the real world: <a href="https://youtu.be/-iW3Vzzr3oU?si=yzu2SawugXMGKlW9" rel="nofollow">https://youtu.be/-iW3Vzzr3oU?si=yzu2SawugXMGKlW9</a>
Great work by Anthropic!<p>After paying for ChatGPT and OpenAI API credits for a year, I switched to Claude when they launched Artifacts and never looked back.<p>Claude Sonnet 3.5 is already so good, specially at coding. I'm looking forward to testing the new version if it is, indeed, even better.<p>Sonnet 3.5 was a major leap forward for me personally, similar to the GPT-3.5 to GPT-4 bump back in the day.
If "computer use" feature is able to find it's way in Azure, AAD/Entra, SharePoint settings, etc. - it has a chance of becoming a better user interface for Microsoft products. :)<p>Can you imagine how simple the world would be if you'd just need to tell Claude: "user X needs to have access to feature Y, please give them the correct permissions", with no need to spend days in AAD documentation and the settings screens maze. I fear AAD is AI-proof, though :)
I have been a paying ChatGPT customer for a long time (since the very beginning). Last week I've compared ChatGPT to Claude and the results (to my eye) were better, the output better structured and the canvas works better. I'm on the edge of jumping ship.
I think this is good evidence that people's jobs are not being replaced by AI, because no AI would give the product a confusing name like "new Claude 3.5 Sonnet".
Why on god's green earth is it not just called Claude 3.6 Sonnet. Or Claude 4 Sonnet.<p>I don't actually care what the answer is. There's no answer that will make it make sense to me.
Not that I'm scared of this update but I'd probably be alright with pausing llm development today, atleast in regard to producing code.<p>I don't want an llm to write all my code, regardless of if it works, I like to write code. What these models are capable of at the moment is perfect for my needs and I'd be 100% okay if they didn't improve at all going forward.<p>Edit: also I don't see how an llm controlled system can ever replace a deterministic system for critical applications.
Great progress from Anthropic! They really shouldn't change models from under the hood, however. A name should refer to a specific set of model weights, more or less.<p>On the other hand, as long as its actually advancing the Pareto frontier of capability, re-using the same name means everyone gets an upgrade with no switching costs.<p>Though, all said, Claude still seems to be somewhat of an insider secret. "ChatGPT" has something like 20x the Google traffic of "Claude" or "Anthropic".<p><a href="https://trends.google.com/trends/explore?date=now%201-d&geo=US&q=chatgpt,claude,anthropic&hl=en" rel="nofollow">https://trends.google.com/trends/explore?date=now%201-d&geo=...</a>
I am curious why "upgraded Claude 3.5 Sonnet" instead of simply Claude 3.6 Sonnet? Minor version increment is a standard way of versioning update. Am i missing something or it is just Anthropic marketing?
Claude is <i>absurdly</i> better at coding tasks than OpenAI. Like it's not even close. Particularly when it comes to hallucinations. Prompt for prompt, I see Claude being rock solid and returning fully executable code, with all the correct imports, while OpenAI struggles to even complete the task and will make up nonexistent libraries/APIs out of whole cloth.
This can power one of my favorite use-cases.<p>Like find me a list of things to do with a family, given today's weather and in the next 2 hours, quiet sit down with lots of comfy seating, good vegetarian food...<p>Not only is this kind of use getting around API restrictions, it is also a superior way to do search: Specify arbitrary preferences upfront instead of a search box and trawling different modalities of content to get better result. The possibilities for wellness use cases are endless, especially for end users that care about privacy and less screen use.
my quick notes on Computer Use:<p>- "computer use" is basically using Claude's vision + tool use capability in a loop. There's a reference impl but there's no "claude desktop" app that just comes with this OOTB<p>- they're basically advertising that they bumped up Claude 3.5's screen vision capability. we discussed the importance of this general computer agent approach with David on our pod <a href="https://x.com/swyx/status/1771255525818397122" rel="nofollow">https://x.com/swyx/status/1771255525818397122</a><p>- @minimaxir points out questions on cost. Note that the vision use is very sparing - the loop is I/O constrained - it waits for the tool to run and then takes a screenshot, then loops. for a simple 10 loop task at max resolution, Haiku costs <1 cent, Sonnet 8 cents, Opus 41 cents.<p>- beating o1-preview on SWEbench Verified without extended reasoning and at 4x cheaper output per token (a lot cheaper in total tokens since no reasoning tokens) is ABSOLUTE mogging<p>- New 3.5 Haiku is 68% cheaper than Claude Instant haha<p>references i had to dig a bit to find<p>- <a href="https://www.anthropic.com/pricing#anthropic-api" rel="nofollow">https://www.anthropic.com/pricing#anthropic-api</a><p>- <a href="https://docs.anthropic.com/en/docs/build-with-claude/vision#evaluate-image-size" rel="nofollow">https://docs.anthropic.com/en/docs/build-with-claude/vision#...</a><p>- loop code <a href="https://github.com/anthropics/anthropic-quickstarts/blob/main/computer-use-demo/computer_use_demo/loop.py">https://github.com/anthropics/anthropic-quickstarts/blob/mai...</a><p>- some other screenshots <a href="https://x.com/swyx/status/1848751964588585319" rel="nofollow">https://x.com/swyx/status/1848751964588585319</a><p>- <a href="https://x.com/alexalbert__/status/1848743106063306826" rel="nofollow">https://x.com/alexalbert__/status/1848743106063306826</a><p>- model card <a href="https://assets.anthropic.com/m/1cd9d098ac3e6467/original/Claude-3-Model-Card-October-Addendum.pdf" rel="nofollow">https://assets.anthropic.com/m/1cd9d098ac3e6467/original/Cla...</a>
It improves to 25.9 over the previous version of Claude 3.5 Sonnet (24.4) on NYT Connections: <a href="https://github.com/lechmazur/nyt-connections/">https://github.com/lechmazur/nyt-connections/</a>.
This "Computer use" demo:<p><a href="https://www.youtube.com/watch?v=jqx18KgIzAE" rel="nofollow">https://www.youtube.com/watch?v=jqx18KgIzAE</a><p>shows Sonnet 3.5 using the Google web UI in an automated fashion. Do Google's terms really permit this? Will Google permit this when it is happening at scale?
For me, one of the more useful steps on macOS will be when local AI can manipulate anything that has an Apple Script library. The hooks are there and decently documented. For meta purposes, having AI work with a third-party app like Keyboard Maestro or Raycast will even further expand the pre-built possibilities without requiring the local AI to reinvent steps or tools at the time of each prompt.
This looks quite fantastic!<p>Nice improvements in scores across the board, e.g.<p>> On coding, it [the new Sonnet 3.5] improves performance on SWE-bench Verified from 33.4% to 49.0%, scoring higher than all publicly available models—including reasoning models like OpenAI o1-preview and specialized systems designed for agentic coding.<p>I've been using Sonnet 3.5 for most of my AI-assisted coding and I'm already very happy (using it with the Zed editor, I love the "raw" UX of its AI assistant), so any improvements, especially seemingly large ones like this are very welcome!<p>I'm still extremely curious about how Sonnet 3.5 itself, and its new iteration are built and differ from the original Sonnet. I wonder if it's in any way based on their previous work[0] which they used to make golden-gate Claude.<p>[0]: <a href="https://transformer-circuits.pub/2024/scaling-monosemanticity/index.html" rel="nofollow">https://transformer-circuits.pub/2024/scaling-monosemanticit...</a>
It will interesting to see how this evolves. UI automation use case is different from accessibility do to latency requirement. latency matters a lot for accessibility not so much for ui automation testing apparatus.<p>I've often wondered what the combination of grammar-based speech recognition and combination with LLM could do for accessibility. Low domain Natural Language Speech recognition augmented by grammar based speech recognition for high domain commands for efficiency/accuracy reducing voice strain/increasing recognition accuracy.<p><a href="https://github.com/dictation-toolbox/dragonfly">https://github.com/dictation-toolbox/dragonfly</a>
This bolsters my opinion that OpenAI is falling rapidly behind. Presumably due to Sam's political machinations rather than hard-driving technical vision, at least that's what it seems like, outside looking in.<p>Computer use seems it might be good for e2e tests.
I'm giving the new Sonnet a chance, although for my use as a writing companion so far, Opus has been king among all the models I've tried.<p>However, I've been using Opus as a writing companion for several months, especially when you have writer's block and ask it for alternative phrases, it was super creative. But in recent weeks I was noticing a degradation in quality. My impression is that the model was degrading. Could this be technically possible? Might it be some kind of programmed obsolescence to hype new models?
Fascinating. Though I expect people to be concerned about privacy implications of sending screenshots of the desktop, similar to the backlash Microsoft has received about their AI products. Giving the remote service actual control of the mouse and keyboard is a whole another level!<p>But I am very excited about this in the context of accessibility. Screen readers and screen control software is hard to develop and hard to learn to use. This sort of “computer use” with AI could open up so many possibilities for users with disabilities.
Of course there's great inefficiency in having the Claude software control a computer with a human GUI mediating everything, but it's necessary for many uses right now given how much we do where only human interfaces are easily accessible. If something like it takes off, I expect interfaces for AI software would be published, standardized, etc. Your customers may not buy software that lacks it.<p>But what I really want to see is a CLI. Watching their software crank out Bash, vim, Emacs!, etc. - that would be fascinating!
Wow, there's a whole industry devoted to what they're calling "Computer Use" (Robotic Process Automation, or RPA). I wonder how those folks are viewing this.
<i>Claude's current ability to use computers is imperfect. Some actions that people perform effortlessly—scrolling, dragging, zooming—currently present challenges for Claude and we encourage developers to begin exploration with low-risk tasks. </i><p>Nice, but I wonder why didn't they use UI automation/accessibility libraries, that have access to the semantic structure of apps/web pages, as well as accessing documents directly instead of having Excel display them for you.
> developers can direct Claude to use computers the way people do—by looking at a screen, moving a cursor, clicking buttons, and typing text.<p>So, this is how AI takes over the world.
Seems like both:<p>- AI Labs will eat some of the wrappers on top of their APIs - even complex ones like this. There are whole startups that are trying to build computer use.<p>- AI is fitting _some_ scaling law - the best models are getting better and the "previously-state-of-the-art" models are fractions of what they cost a couple years ago. Though it remains to be seen if it's like Moore's Law or if incremental improvements get harder and harder to make.
How does the computer use work -- Is this a desktop app they are providing that can do actions on your computer? Didn't see any such mention in the post
Tried my standard go-to for testing, asked it to generate a voronoi diagram using p5js. For the sake of job security I'm relieved to see it still can't do a relatively simple task with ample representation in the Google search results. Granted, p5js is kind of niche, but not terribly so. It's arguably the most popular library for creating coding.<p>In case you're wondering, I tried o1-preview, and while it did work, I was also initially perplexed why the result looked pixelated. Turns out, that's because many of the p5js examples online use a relatively simple approach where they just see which cell-center each pixel is closest to, more or less. I mean, it works, but it's a pretty crude approach.<p>Now, granted, you're probably not doing creative coding at your job, so this may not matter that much, but to me it was an example of pretty poor generalization capabilities. Curiously, Claude has no problem whatsoever generating a voronoi diagram as an SVG, but writing a script to generate said diagrams using a particular library eluded it. It knows how to do one thing but generalizes poorly when attempting to do something similar.<p>Really hard to get a real sense of capabilities when you're faced with experiences like this, all the while somehow it's able to solve 46% of real-world python pull-requests from a certain dataset. In case you're wondering, one paper (<a href="https://cs.paperswithcode.com/paper/swe-bench-enhanced-coding-benchmark-for-llms" rel="nofollow">https://cs.paperswithcode.com/paper/swe-bench-enhanced-codin...</a>) found that 94% of the pull-requests on SWE-bench were created before the knowledge cutoff dates of the latest LLMs, so there's almost certainly a degree of data-leakage.
I really don't get their model. They have very advanced models, but the service overall seems to be a jumble of priorities. Some examples:<p>Anthropic doesn't offer an unlimited chatbot service, only plans that give you "more" usage, whatever that means. If you have an API key, you are "unlimited," so they have the capability. Why doesn't the chatbot allow one to use their API key in the Claude app to get unlimited usage? (Yes, I know there are third-party BYOK tools. That's not the question.)<p>Claude appears to be smart enough to make an Excel spreadsheet with simple formulae. However, it is apparently prevented from making any kind of file. Why? What principle underlies that guardrail that does not also apply to Computer Use?<p>Really want to make Claude my daily driver, but right now it often feels too much like a research project.
I've seen quite a few YC startups working on AI-powered RPA, and now it looks like a foundational model player is directly competing in their space. It will be interesting to see whether Anthropic will double down on this or leave it to third-party developers to build commercial applications around it.
Is there anything out there yet that will let me issue the command:<p>> Refactor the api folder with any recommended readability improvements or improvements that would help DRY up code without adding additional complexity.<p>Then I can just `git status` to see the changes?
They need to work on their versioning.<p>"3.5 Sonnet (New)", WTAF? - just call it 3.6 Sonnet or something.<p>Is it "New" sonnet? is it "upgraded"? Is there a difference? How do I know which one I use?<p>I can understand claude-3-5-sonnet-20241022, but that's not what users see.
I tried to get it to translate a document and it stopped after a few paragraphs and asked if I wanted it to keep going. This is not appropriate for my use case and it kept doing this even though I explicitly told it not to. The old version did not do this.
> "... and similar speed to the previous generation of Haiku."<p>To me this is the most annoying grammatical error. I can't wait for AI to take over all prose writing so this egregious construction finally vanishes from public fora. There may be some downsides -- okay, many -- but at least I won't have to read endless repetitions of "similar speed to ..." when the correct form is obviously "speed similar to".<p>In fact, in time this correct grammar may betray the presence of AI, since lowly biologicals (meaning us) appear not to either understand or fix this annoying error without computer help.
That’s too much control for my taste. I don’t want anthropic to see my screen. I rather prefer a VS Code with integrated Claude. A version that can see all my dev files in a given folder. I don’t need it to run Chrome for me.
This is what the Rabbit "large action model" pretended to be. Wouldn't be surprised to see them switch to this and claim they were never lying about their capabilities because it works now.<p>Pretty cool for sure.
How long until it is profitable the tell a cheap AI to "win this game by collecting resources and advancing in-game" and then sell the account on eBay?<p>I wonder what optimizations could be made? Could a gold farmer have the directions from one AI control many accounts? Could the AI program simpler bots for each bit of the game?<p>I can imagine not being smart enough to play against computers, because I am flagged as a bot. I can imagine a message telling me I am banned because "nobody but a stupid bot would score so low."
If anyone would like to try the new Sonnet in VSCode. I just updated <a href="https://double.bot">https://double.bot</a> to the new Sonnet. (disclaimer: I am the cofounder/creator)<p>---<p>Some thoughts:<p>* Will be interesting to see what we can build in terms of automatic development loops with the new computer use capabilities.<p>* I wonder if they are not releasing Opus because it's not done or because they don't have enough inference compute to go around, and Sonnet is close enough to state of the art?
The "computer use" demos are interesting.<p>It's a problem we used to work on and perhaps many other people have always wanted to accomplish since 10 years ago. So it's yet to be seen how well it works outside a demo.<p>What was surprising was the slow/human speed of operations. It types into the text boxes at a human speed rather than just dumping the text there. Is it so the human can better monitor what's happening or is it so it does not trigger Captchas ?
This is incredibly cool but it seems like the potential damage from a "hallucination" in this mode is considerable, especially when they provide examples of it going very far off-track (looking up Yellowstone pictures). Would basically need constant monitoring for me not to be paranoid it did something stupid.<p>Also seems like a privacy issue with them sending screenshots of your device back to their servers.
anybody know how the hell they're combating / gonna combat captcha's, cloudflare blocking, etc. I remember playing in this space on a toy project and being utterly frustrated by anti-scraping. Maybe one good thing that will come out of this AI boom is that companies will become nicer to scrapers? Or maybe, they'll just cut sweetheart deals?
I am surprised it uses macOS as the demo, as I thought it would be harder to control vs Ubuntu. But maybe at the same time, macOS is the most predictable/reliable desktop environment? I noticed that they use virtual environment for the demo, curious how do they build that along with docker, is that leveraging the latest virtualization framework from Apple?
Interesting stuff, i look forward to future developments.<p>A comment about the video: Sam Runger talks wayyy too fast, in particular at the beginning.
I skimmed through the computer use code. It's possible to build this with other AI providers too. For instance you can asks ChatGPT API to call functions for click and scroll and type with specific parameters and execute them using OS's APIs (A11y APIs usually)<p>Did I miss something? Did they have to make changes to the model for this?
Imagine the possibilities for cyber-crime. Surely you could program it to log in to a financial institution and transfer money. And if you had a list of user names and passwords from some large info breach? You could automate a LOT of transfers in a short amount of time...
This demo is impressive although my initial reaction is a sort of grief that I wasn't born in the timeline where Alan Kay's vision of object-oriented computing was fully realized -- then we wouldn't have to manually reconcile wildly heterogeneous data formats and interfaces in the first place!
OpenAI must be scared at this point. Anthropic is clobbering them at the high end of the market and Meta is providing free AIs at the low end. OpenAI is pretty soon going to be in the valueless middle fighting with tons of other companies for relevance
Next stop after 'Computer Use' -- Multimodal input from a robot's sensors and generating various signals to control its actions.<p>Looking forward to see this in the coming few years. And hoping such a robot could be of help to many people including those old.
I wonder if OpenAI will fast follow; usually they're the ones to throw down the gauntlet. That being said, you can play around with OpenAI with a similar architecture of vision + agent + exec + loop using Donobu, though it is constrained to web browsers.
Offtopic but youtube doesn't allow me to view the embedded video, with a "Sign in to confirm you’re not a bot" message. I need to open a dedicated youtube tab to watch it<p>The barrier to scraping youtube has increased a lot recently, I can barely use yt-dlp anymore
Hopefully the coding improvements are meaningful because I find that as a coding assistant o1-preview beats it (at least the Claude 3.5 that was available yesterday) but I like Claude's demeanor more (I know this sounds crazy but it matters a bit to me)
Any comments on alignment with Anthropic's missions. Last time I checked Anthropic is about building SOTA as that is the only way to do safety research. Making money ans useful stuff commercially is a means to an end.
I wonder when it'll actually be available in the Bedrock AU region, because as of right now we're still stuck using mid-range models from a year ago.<p>Amazon has really neglected ap-southeast-2 when it comes to LLMs.
Looks like visual understanding of diagrams is improved significantly! For example, it was on par with Chat GPT 4o and Gemini 1.5 in parsing an ERD for a conceptual model, but now far excels over the others.
I suspect they are gonna need some local offload capabilities for Computer Use, the repeated screen reading can definitely be done locally on modern machines, otherwise the cost maybe impractical.
This looks really similar to rabbit's Large Action Model (LAM). Cool!<p><a href="https://www.rabbit.tech/rabbit-os" rel="nofollow">https://www.rabbit.tech/rabbit-os</a>
One suggestion, use the following prompt at a LLM:<p><pre><code> The combination of the words "computer use" is highly confusing. It's also "Yoda speak". For example it's hard for humans to parse the sentences *"Introducing computer use, a new Claude 3.5 Sonnet, and Claude 3.5 Haiku"*, *"Computer use, a new Claude 3.5 Sonnet, and Claude 3.5 Haiku "* (it literally relies on the comma to make any sense) and *"Computer use for automated interaction"* (in the youtube vid's title: this one is just broken english). Please suggest terms that are not confusing for a new ability allowing an AI to control a computer as if it was a human.</code></pre>
This "computer use" feature is obviously perfect for automating GUI tests. Will it work on screenshots of mobile devices like smartphones/tables, also?
Interestingly new claude only knows content up to:<p>> I'm limited to what I know as of April 2024, which includes the initial Claude 3 family launch but not subsequent updates.
Cursor AI already have the option to switch to using claude-3-5-sonnet-20241022 in the chat box.<p>I was about to try to add a custom API. I’m impressed by the speed of that team.
since they didnt rev the version, does this mean if we were using 3.5 today its just automatically using the new version? That doesnt seem great from a change management perspective<p>though I am looking forward to using the new one in cursor.ai
wow, i almost got worried but the cute music and the funny little monster on the desk convinced me that this all just fun and dandy and all will be good. the future is coming and we'll all be much more happy :)
While I was initially impressed with it's context window, I got so sick of fighting with Claude about what it was allowed to answer I quit my subscription after 3 months.<p>Their whole policing AI models stance is commendable but ultimately renders their tools useless.<p>It actually started arguing with me about whether it was allowed to help implement a github repository's code as it might be copywritten... it was MIT licensed open source from Google :/
I checked the docs but did not find it out.
Cloude has API as the GPT Assistant? with also the ability to give a set of documents to work with?<p>It seems that you can only send single message, thus not relying on the ability to "learn" from predefined documents.
computer use is really going to highlight how fragmented the desktop ecosystem is, but also this definitely paints more context on how microsoft wants to use their screenshot ai
Just a question:
For this thingy to work, I must give the provider access to my computer?
Good luck. :)<p>Just another reason to use ONLY local LLM's.
Now I am really curious how to programmatically create a sandboxed compute environment to do a self-hosted "Computer use" and see how well other models, including self-hosted Ollama models, can do this.
The new Sonnet tops aider's code editing leaderboard at 84.2%. Using aider's "architect" mode it sets the SOTA at 85.7% (with DeepSeek as the "editor" model).<p><pre><code> 84% Claude 3.5 Sonnet 10/22
80% o1-preview
77% Claude 3.5 Sonnet 06/20
72% DeepSeek V2.5
72% GPT-4o 08/06
71% o1-mini
68% Claude 3 Opus
</code></pre>
It also sets SOTA on aider's more demanding refactoring benchmark with a score of 92.1%!<p><pre><code> 92% Sonnet 10/22
75% o1-preview
72% Opus
64% Sonnet 06/20
49% GPT-4o 08/06
45% o1-mini
</code></pre>
<a href="https://aider.chat/docs/leaderboards/" rel="nofollow">https://aider.chat/docs/leaderboards/</a>
Why not rev the numbers? "3.5" vs. "3.5 New" feels weird -- is there a particular reason why Anthropic doesn't want to call this 3.6 (or even 3.5.1)?
It's quite sad that application interoperability requires parsing bitmaps instead of exchanging structured information. Feels like a devastating failure in how we do computing.
My First Experience with Claude Computer Use - It's Mind-Blowing!<p>Just tested Claude's new Computer Use feature and had to share this simple but powerful test:<p>My Basic Prompt:
"Please:
1. Search Amazon for 3 wireless earbuds:
Find price
Rating
Brand name<p>2. Make a simple Excel file 'earbuds.xlsx':
Put the information in a basic table
Add colors to the headers
Sort by price<p>3. Show me the results"<p>What blew my mind:
- Claude actually looked at my screen
- Moved the mouse by itself
- Clicked buttons like a human
- Created reports automatically<p>It's like having a virtual assistant that can really use your computer! No coding needed - just simple English instructions.<p>For those interested: <a href="https://mergisi.medium.com/8f56f683e307" rel="nofollow">https://mergisi.medium.com/8f56f683e307</a>
Scary stuff.<p>'Hey Claude 3.5 New, pretend I'm a CEO of a big company and need to lay off 20% people, make me a spreadsheet and send it to HR. Oh make sure to not fire the HR department'<p>c.f. IBM 1979.
Both new Sonnet and gpt-4o still fail at a simple:<p>"How many w's are in strawberry?"<p>gpt-4o: There are 2 "w's" in "strawberry."<p>Claude 3.5 Sonnet (new): Let me count the w's in "strawberry": 0 w's.<p>(same question with 'r' succeeds)<p>What is artificial about current gen of "artificial intelligence" is the way training (predict next token) and benchmarking (overfitting) is done. Perhaps a fresh approach is needed to achieve a true next step.