TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

Computer use, a new Claude 3.5 Sonnet, and Claude 3.5 Haiku

1454 点作者 weirdcat7 个月前

127 条评论

LASR7 个月前
This is actually a huge deal.<p>As someone building AI SaaS products, I used to have the position that directly integrating with APIs is going to get us most of the way there in terms of complete AI automation.<p>I wanted to take at stab at this problem and started researching some daily busineses and how they use software.<p>My brother-in-law (who is a doctor) showed me the bespoke software they use in his practice. Running on Windows. Using MFC forms.<p>My accountant showed me Cantax - a very powerful software package they use to prepare tax returns in Canada. Also on Windows.<p>I started to realize that pretty much most of the real world runs on software that directly interfaces with people, without clearly defined public APIs you can integrate into. Being in the SaaS space makes you believe that everyone ought to have client-server backend APIs etc.<p>Boy was I wrong.<p>I am glad they did this, since it is a powerful connector to these types of real-world business use cases that are super-hairy, and hence very worthwhile in automating.
评论 #41917863 未加载
评论 #41918084 未加载
评论 #41917705 未加载
评论 #41920374 未加载
评论 #41923386 未加载
评论 #41922311 未加载
评论 #41922893 未加载
评论 #41919487 未加载
评论 #41919800 未加载
评论 #41923632 未加载
评论 #41917756 未加载
评论 #41925631 未加载
评论 #41917619 未加载
marsh_mellow7 个月前
Anthropic blog post outlining the research process: <a href="https:&#x2F;&#x2F;www.anthropic.com&#x2F;news&#x2F;developing-computer-use" rel="nofollow">https:&#x2F;&#x2F;www.anthropic.com&#x2F;news&#x2F;developing-computer-use</a><p>Computer use API documentation: <a href="https:&#x2F;&#x2F;docs.anthropic.com&#x2F;en&#x2F;docs&#x2F;build-with-claude&#x2F;computer-use" rel="nofollow">https:&#x2F;&#x2F;docs.anthropic.com&#x2F;en&#x2F;docs&#x2F;build-with-claude&#x2F;compute...</a><p>Computer Use Demo: <a href="https:&#x2F;&#x2F;github.com&#x2F;anthropics&#x2F;anthropic-quickstarts&#x2F;tree&#x2F;main&#x2F;computer-use-demo">https:&#x2F;&#x2F;github.com&#x2F;anthropics&#x2F;anthropic-quickstarts&#x2F;tree&#x2F;mai...</a>
评论 #41954863 未加载
评论 #41916753 未加载
评论 #41918713 未加载
csomar7 个月前
I like its lazy approach<p>&gt; I apologize, but it seems we&#x27;re having some issues installing Chromium. As an alternative, since I can see Firefox is already installed on the system, would you like to use Firefox instead? Or would you prefer me to try a different approach to installing Chromium?<p>It seems unable to figure out how to install chormium. Sees the Firefox icon out there and just want to use it. You know, like a real lazy human being.<p>It seems to insistent when it needs to, though<p>&gt; I clicked on Firefox and it should start opening. Let me take another screenshot to see if the window has appeared:<p>&gt; Firefox seems to be taking a moment to start up. Let me wait a few seconds and take another screenshot:<p>&gt; Let me try clicking Firefox one more time to ensure it starts:<p>&gt; Let me try one more time to take a screenshot to see if Firefox has opened:<p>&gt; Let me try starting Firefox using the bash tool as an alternative method:<p>&gt; Firefox has successfully opened! As per the instructions, I should ignore the setup wizard and instead click directly on the address bar. Would you like me to do anything specific with Firefox now that it&#x27;s open?<p>I didn&#x27;t instruct him to ignore the setup wizard. So my guess is that Anthropic has configured it to ignore stuff happening on the screen so it doesn&#x27;t go loose.<p>And here he goes through my website, through my hacker news account and then find this very comment<p>&gt; Looking at his first&#x2F;most recent comment, it&#x27;s about a discussion of Claude and computer use. Here&#x27;s what he wrote:<p>&quot;I like its lazy approach&quot;<p>This appears to be a humorous response in a thread about &quot;Computer use, a new Claude 3.5 Sonnet, and Claude...&quot; where he&#x27;s commenting on an AI&#x27;s behavior in a situation. The comment is very recent (shown as &quot;8 minutes ago&quot; in the screenshot) and is referring to a situation where an AI seems to have taken a simpler or more straightforward approach to solving a problem.
评论 #41921898 未加载
评论 #41921823 未加载
评论 #41927965 未加载
diggan7 个月前
I still feel like the difference between Sonnet and Opus is a bit unclear. Somewhere on Anthropic&#x27;s website it says that Opus is the most advanced, but on other parts it says Sonnet is the most advanced and also the fastest. The UI doesn&#x27;t make the distinction clear either. Then on Perplexity, Perplexity says that Opus is the most advanced, compared to Sonnet.<p>And finally, in the table in the blogpost, Opus isn&#x27;t even included? It seems to me like Opus is the best model they have, but they don&#x27;t want people to default using it, maybe the ROI is lower on Opus or something?<p>When I manually tested it, I feel like Opus gives slightly better replies compared to Sonnet, but I&#x27;m not 100% it&#x27;s just placebo.
评论 #41915298 未加载
评论 #41915246 未加载
评论 #41916090 未加载
评论 #41915984 未加载
评论 #41915232 未加载
评论 #41915342 未加载
评论 #41915975 未加载
评论 #41918558 未加载
评论 #41921953 未加载
评论 #41925181 未加载
HarHarVeryFunny7 个月前
The &quot;computer use&quot; ability is extremely impressive!<p>This is a lot more than an agent able to use your computer as a tool (and understanding how to do that) - it&#x27;s basically an autonomous reasoning agent that you can give a goal to, and it will then use reasoning, as well as it&#x27;s access to your computer, to achieve that goal.<p>Take a look at their demo of using this for coding.<p><a href="https:&#x2F;&#x2F;www.youtube.com&#x2F;watch?v=vH2f7cjXjKI" rel="nofollow">https:&#x2F;&#x2F;www.youtube.com&#x2F;watch?v=vH2f7cjXjKI</a><p>This seems to be an OpenAI GPT-o1 killer - it may be using an agent to do reasoning (still not clear exactly what is under the hood) as opposed to GPT-o1 supposedly being a model (but still basically a loop around an LLM), but the reasoning it is able to achieve in pursuit of a real world goal is very impressive. It&#x27;d be mind boggling if we hadn&#x27;t had the last few years to get used to this escalation of capabilities.<p>It&#x27;s also interesting to consider this from POV of Anthropic&#x27;s focus on AI safety. On their web site they have a bunch of advice on how to stay safe by sandboxing, limiting what it has access to, etc, but at the end of the day this is a very capable AI able to use your computer and browser to do whatever it deems necessary to achieve a requested goal. How far are we from paperclip optimization, or at least autonomous AI hacking ?
评论 #41931752 未加载
bonoboTP7 个月前
I&#x27;ve been saying this is coming for a long time, but my really smart SWE friend who is nevertheless not in the AI&#x2F;ML space dismissed it as a stupid roundabout way of doing things. That software should just talk via APIs. No matter how much I argued regarding legacy software&#x2F;websites and how much functionality is really only available through GUI, it seems some people are really put off by this type of approach. To me, who is more embedded in the AI, computer vision, robotics world, the fuzziness of day-to-day life is more apparent.<p>Just as how expert systems didn&#x27;t take off and tagging every website for the Semantic Web didn&#x27;t happen either, we have to accept that the real world of humans is messy and unstructured.<p>I still advocate making new things more structured. A car on wheels on flattened ground will always be more efficient than skipping the landscaping part and just riding quadruped robots through the forest on uneven terrain. We should develop better information infrastructure but the long tail of existing use cases will require automation that can deal with unstructured mess too.
评论 #41926346 未加载
评论 #41926791 未加载
评论 #41926481 未加载
评论 #41960381 未加载
评论 #41928169 未加载
评论 #41926801 未加载
评论 #41927095 未加载
LVB7 个月前
Not specific to this update, but I wanted to chime in with just how useful Claude has been, and relatively better than ChatGPT and GitHub copilot for daily use. I&#x27;ve been pro for maybe 6 months. I&#x27;m not a power user leveraging their API or anything. Just the chat interface, though with ever more use of Projects, lately. I use it every day, whether for mundane answers or curiosities, to &quot;write me this code&quot;, to general consultation on a topic. It has replaced search in a superior way and I feel hugely productive with it.<p>I do still occasionally pop over to ChatGPT to test their their waters (or if Claude is just not getting it), but I&#x27;ve not felt any need to switch back or have both. Well done, Anthropic!
simonw7 个月前
Claude 3.5 Opus is no longer mentioned at all on <a href="https:&#x2F;&#x2F;docs.anthropic.com&#x2F;en&#x2F;docs&#x2F;about-claude&#x2F;models" rel="nofollow">https:&#x2F;&#x2F;docs.anthropic.com&#x2F;en&#x2F;docs&#x2F;about-claude&#x2F;models</a><p>Internet Archive confirms that on the 8th of October that page listed 3.5 Opus as coming &quot;Later this year&quot; <a href="https:&#x2F;&#x2F;web.archive.org&#x2F;web&#x2F;20241008222204&#x2F;https:&#x2F;&#x2F;docs.anthropic.com&#x2F;en&#x2F;docs&#x2F;about-claude&#x2F;models" rel="nofollow">https:&#x2F;&#x2F;web.archive.org&#x2F;web&#x2F;20241008222204&#x2F;https:&#x2F;&#x2F;docs.anth...</a><p>The fact that it&#x27;s no longer listed suggests that its release has at least been delayed for an unpredictable amount of time, or maybe even cancelled.
评论 #41922285 未加载
评论 #41920799 未加载
评论 #41923740 未加载
gzer07 个月前
One of the funnier things during training with the new API (which can control your computer) was this:<p><i>&quot;Even while recording these demos, we encountered some amusing moments. In one, Claude accidentally stopped a long-running screen recording, causing all footage to be lost.<p>Later, Claude took a break from our coding demo and began to peruse photos of Yellowstone National Park.&quot;</i><p>[0] <a href="https:&#x2F;&#x2F;x.com&#x2F;AnthropicAI&#x2F;status&#x2F;1848742761278611504" rel="nofollow">https:&#x2F;&#x2F;x.com&#x2F;AnthropicAI&#x2F;status&#x2F;1848742761278611504</a>
评论 #41916258 未加载
评论 #41916948 未加载
评论 #41916178 未加载
评论 #41916273 未加载
评论 #41916903 未加载
评论 #41921326 未加载
评论 #41923667 未加载
评论 #41917347 未加载
评论 #41917471 未加载
评论 #41916985 未加载
nopinsight7 个月前
This needs more discussion:<p>Claude using Claude on a computer for coding <a href="https:&#x2F;&#x2F;youtu.be&#x2F;vH2f7cjXjKI?si=Tw7rBPGsavzb-LNo" rel="nofollow">https:&#x2F;&#x2F;youtu.be&#x2F;vH2f7cjXjKI?si=Tw7rBPGsavzb-LNo</a> (3 mins)<p>True end-user programming and product manager programming are coming, probably pretty soon. Not the same thing, but Midjourney went from v.1 to v.6 in less than 2 years.<p>If something similar happens, most jobs that could be done remotely will be automatable in a few years.
评论 #41917596 未加载
评论 #41917977 未加载
评论 #41918413 未加载
评论 #41918234 未加载
评论 #41918992 未加载
评论 #41922104 未加载
评论 #41917741 未加载
评论 #41919329 未加载
评论 #41917928 未加载
评论 #41917843 未加载
评论 #41919834 未加载
simonw7 个月前
I wrote up some of my own notes on Computer Use here: <a href="https:&#x2F;&#x2F;simonwillison.net&#x2F;2024&#x2F;Oct&#x2F;22&#x2F;computer-use&#x2F;" rel="nofollow">https:&#x2F;&#x2F;simonwillison.net&#x2F;2024&#x2F;Oct&#x2F;22&#x2F;computer-use&#x2F;</a>
评论 #41919951 未加载
minimaxir7 个月前
From the computer use video demo, that&#x27;s a <i>lot</i> of API calls. Even though Claude 3.5 Sonnet is relatively cheap for its performance, I suspect computer use won&#x27;t be. It&#x27;s a very good idea that Anthropic upfront that it isn&#x27;t perfect. And it&#x27;s guaranteed that there will be a viral story where Claude will accidentally delete something important with it.<p>I&#x27;m more interested in Claude 3.5 Haiku, particularly if it is indeed better than the current Claude 3.5 Sonnet at some tasks as claimed.
评论 #41915271 未加载
评论 #41915306 未加载
评论 #41915520 未加载
highwaylights7 个月前
Completely irrelevant, and it might just be me, but I really like Anthropic&#x27;s understated branding.<p>OpenAI&#x27;s branding isn&#x27;t exactly screaming in your face either, but for something that&#x27;s generated as much public fear&#x2F;scaremongering&#x2F;outrage as LLMs have over the last couple of years, Anthropic&#x27;s presentation has a much &quot;cosier&quot; veneer to my eyes.<p>This isn&#x27;t the Skynet Terminator wipe-us-all-out AI, it&#x27;s the adorable grandpa with a bag of werthers wipe-us-all-out AI, and that means it&#x27;s going to be OK.
评论 #41915750 未加载
评论 #41917176 未加载
评论 #41915576 未加载
评论 #41917618 未加载
评论 #41918952 未加载
评论 #41917423 未加载
评论 #41915775 未加载
评论 #41916487 未加载
评论 #41925030 未加载
评论 #41917476 未加载
cwkoss7 个月前
Claude is amazing. The project documents functionality makes it a clear leader ahead of ChatGPT and I have found it to be the clear leader in coding assistance over the past few months. Web automation is really exciting.<p>I look forward to the brave new future where I can code a webapp without ever touching the code, just testing, giving feedback, and explaining discovered bugs to it and it can push code and tweak infrastructure to accomplish complex software engineering tasks all on its own.<p>Its going to be really wild when Claude (or other AI) can make a list of possible bugs and UX changes and just ask the user for approval to greenlight the change.
TaylorAlexander7 个月前
And today I realized that despite it being an extremely common activity, we don’t really have a word for “using the computer” which is distinct from “computing”. It’s funny because AI models are <i>always</i> “using a computer” but now they can “use your computer.”
评论 #41916881 未加载
评论 #41916518 未加载
评论 #41919870 未加载
评论 #41916758 未加载
评论 #41918844 未加载
评论 #41917178 未加载
janalsncm7 个月前
Reminds me of the rise in job application bots. People are applying to thousands of jobs using automated tools. It’s probably one of the inevitable use cases of this technology.<p>It makes me think. Perhaps the act of applying to jobs will go extinct. Maybe the endgame is that as soon as you join a website like Monster or LinkedIn, you immediately “apply” to every open position, and are simply ranked against every other candidate.
评论 #41917552 未加载
评论 #41917091 未加载
评论 #41917311 未加载
trzy7 个月前
Pretty cool! I use Claude 3.5 to control a robot (ARKit&#x2F;iOS based) and it does surprisingly well in the real world: <a href="https:&#x2F;&#x2F;youtu.be&#x2F;-iW3Vzzr3oU?si=yzu2SawugXMGKlW9" rel="nofollow">https:&#x2F;&#x2F;youtu.be&#x2F;-iW3Vzzr3oU?si=yzu2SawugXMGKlW9</a>
评论 #41918181 未加载
hugocbp7 个月前
Great work by Anthropic!<p>After paying for ChatGPT and OpenAI API credits for a year, I switched to Claude when they launched Artifacts and never looked back.<p>Claude Sonnet 3.5 is already so good, specially at coding. I&#x27;m looking forward to testing the new version if it is, indeed, even better.<p>Sonnet 3.5 was a major leap forward for me personally, similar to the GPT-3.5 to GPT-4 bump back in the day.
评论 #41919897 未加载
alentred7 个月前
If &quot;computer use&quot; feature is able to find it&#x27;s way in Azure, AAD&#x2F;Entra, SharePoint settings, etc. - it has a chance of becoming a better user interface for Microsoft products. :)<p>Can you imagine how simple the world would be if you&#x27;d just need to tell Claude: &quot;user X needs to have access to feature Y, please give them the correct permissions&quot;, with no need to spend days in AAD documentation and the settings screens maze. I fear AAD is AI-proof, though :)
评论 #41924640 未加载
KingOfCoders7 个月前
I have been a paying ChatGPT customer for a long time (since the very beginning). Last week I&#x27;ve compared ChatGPT to Claude and the results (to my eye) were better, the output better structured and the canvas works better. I&#x27;m on the edge of jumping ship.
评论 #41915277 未加载
评论 #41915272 未加载
评论 #41916702 未加载
评论 #41916034 未加载
评论 #41917001 未加载
评论 #41916165 未加载
评论 #41918029 未加载
评论 #41915502 未加载
astrange7 个月前
I think this is good evidence that people&#x27;s jobs are not being replaced by AI, because no AI would give the product a confusing name like &quot;new Claude 3.5 Sonnet&quot;.
评论 #41918040 未加载
评论 #41918529 未加载
015a7 个月前
Why on god&#x27;s green earth is it not just called Claude 3.6 Sonnet. Or Claude 4 Sonnet.<p>I don&#x27;t actually care what the answer is. There&#x27;s no answer that will make it make sense to me.
评论 #41916092 未加载
评论 #41921155 未加载
TechDebtDevin7 个月前
Not that I&#x27;m scared of this update but I&#x27;d probably be alright with pausing llm development today, atleast in regard to producing code.<p>I don&#x27;t want an llm to write all my code, regardless of if it works, I like to write code. What these models are capable of at the moment is perfect for my needs and I&#x27;d be 100% okay if they didn&#x27;t improve at all going forward.<p>Edit: also I don&#x27;t see how an llm controlled system can ever replace a deterministic system for critical applications.
评论 #41917049 未加载
评论 #41919215 未加载
pradn7 个月前
Great progress from Anthropic! They really shouldn&#x27;t change models from under the hood, however. A name should refer to a specific set of model weights, more or less.<p>On the other hand, as long as its actually advancing the Pareto frontier of capability, re-using the same name means everyone gets an upgrade with no switching costs.<p>Though, all said, Claude still seems to be somewhat of an insider secret. &quot;ChatGPT&quot; has something like 20x the Google traffic of &quot;Claude&quot; or &quot;Anthropic&quot;.<p><a href="https:&#x2F;&#x2F;trends.google.com&#x2F;trends&#x2F;explore?date=now%201-d&amp;geo=US&amp;q=chatgpt,claude,anthropic&amp;hl=en" rel="nofollow">https:&#x2F;&#x2F;trends.google.com&#x2F;trends&#x2F;explore?date=now%201-d&amp;geo=...</a>
评论 #41915496 未加载
评论 #41915443 未加载
评论 #41915455 未加载
devinprater7 个月前
Maybe LLM&#x27;s helping blind people like me play video games that aren&#x27;t accessible to us normally, is getting closer!
评论 #41919129 未加载
评论 #41917034 未加载
lr19707 个月前
I am curious why &quot;upgraded Claude 3.5 Sonnet&quot; instead of simply Claude 3.6 Sonnet? Minor version increment is a standard way of versioning update. Am i missing something or it is just Anthropic marketing?
评论 #41920195 未加载
ramesh317 个月前
Claude is <i>absurdly</i> better at coding tasks than OpenAI. Like it&#x27;s not even close. Particularly when it comes to hallucinations. Prompt for prompt, I see Claude being rock solid and returning fully executable code, with all the correct imports, while OpenAI struggles to even complete the task and will make up nonexistent libraries&#x2F;APIs out of whole cloth.
评论 #41919712 未加载
评论 #41916354 未加载
评论 #41915913 未加载
itissid7 个月前
This can power one of my favorite use-cases.<p>Like find me a list of things to do with a family, given today&#x27;s weather and in the next 2 hours, quiet sit down with lots of comfy seating, good vegetarian food...<p>Not only is this kind of use getting around API restrictions, it is also a superior way to do search: Specify arbitrary preferences upfront instead of a search box and trawling different modalities of content to get better result. The possibilities for wellness use cases are endless, especially for end users that care about privacy and less screen use.
swyx7 个月前
my quick notes on Computer Use:<p>- &quot;computer use&quot; is basically using Claude&#x27;s vision + tool use capability in a loop. There&#x27;s a reference impl but there&#x27;s no &quot;claude desktop&quot; app that just comes with this OOTB<p>- they&#x27;re basically advertising that they bumped up Claude 3.5&#x27;s screen vision capability. we discussed the importance of this general computer agent approach with David on our pod <a href="https:&#x2F;&#x2F;x.com&#x2F;swyx&#x2F;status&#x2F;1771255525818397122" rel="nofollow">https:&#x2F;&#x2F;x.com&#x2F;swyx&#x2F;status&#x2F;1771255525818397122</a><p>- @minimaxir points out questions on cost. Note that the vision use is very sparing - the loop is I&#x2F;O constrained - it waits for the tool to run and then takes a screenshot, then loops. for a simple 10 loop task at max resolution, Haiku costs &lt;1 cent, Sonnet 8 cents, Opus 41 cents.<p>- beating o1-preview on SWEbench Verified without extended reasoning and at 4x cheaper output per token (a lot cheaper in total tokens since no reasoning tokens) is ABSOLUTE mogging<p>- New 3.5 Haiku is 68% cheaper than Claude Instant haha<p>references i had to dig a bit to find<p>- <a href="https:&#x2F;&#x2F;www.anthropic.com&#x2F;pricing#anthropic-api" rel="nofollow">https:&#x2F;&#x2F;www.anthropic.com&#x2F;pricing#anthropic-api</a><p>- <a href="https:&#x2F;&#x2F;docs.anthropic.com&#x2F;en&#x2F;docs&#x2F;build-with-claude&#x2F;vision#evaluate-image-size" rel="nofollow">https:&#x2F;&#x2F;docs.anthropic.com&#x2F;en&#x2F;docs&#x2F;build-with-claude&#x2F;vision#...</a><p>- loop code <a href="https:&#x2F;&#x2F;github.com&#x2F;anthropics&#x2F;anthropic-quickstarts&#x2F;blob&#x2F;main&#x2F;computer-use-demo&#x2F;computer_use_demo&#x2F;loop.py">https:&#x2F;&#x2F;github.com&#x2F;anthropics&#x2F;anthropic-quickstarts&#x2F;blob&#x2F;mai...</a><p>- some other screenshots <a href="https:&#x2F;&#x2F;x.com&#x2F;swyx&#x2F;status&#x2F;1848751964588585319" rel="nofollow">https:&#x2F;&#x2F;x.com&#x2F;swyx&#x2F;status&#x2F;1848751964588585319</a><p>- <a href="https:&#x2F;&#x2F;x.com&#x2F;alexalbert__&#x2F;status&#x2F;1848743106063306826" rel="nofollow">https:&#x2F;&#x2F;x.com&#x2F;alexalbert__&#x2F;status&#x2F;1848743106063306826</a><p>- model card <a href="https:&#x2F;&#x2F;assets.anthropic.com&#x2F;m&#x2F;1cd9d098ac3e6467&#x2F;original&#x2F;Claude-3-Model-Card-October-Addendum.pdf" rel="nofollow">https:&#x2F;&#x2F;assets.anthropic.com&#x2F;m&#x2F;1cd9d098ac3e6467&#x2F;original&#x2F;Cla...</a>
评论 #41916417 未加载
评论 #41916694 未加载
bhouston7 个月前
Is there an easy way to use Claude as a Co-Pilot in VS Code? If it is better at coding, it would be great to have it integrated.
评论 #41915453 未加载
评论 #41915595 未加载
评论 #41916045 未加载
评论 #41915580 未加载
评论 #41915624 未加载
评论 #41915650 未加载
评论 #41915506 未加载
评论 #41915441 未加载
评论 #41915332 未加载
zone4117 个月前
It improves to 25.9 over the previous version of Claude 3.5 Sonnet (24.4) on NYT Connections: <a href="https:&#x2F;&#x2F;github.com&#x2F;lechmazur&#x2F;nyt-connections&#x2F;">https:&#x2F;&#x2F;github.com&#x2F;lechmazur&#x2F;nyt-connections&#x2F;</a>.
评论 #41917770 未加载
评论 #41916982 未加载
vok7 个月前
This &quot;Computer use&quot; demo:<p><a href="https:&#x2F;&#x2F;www.youtube.com&#x2F;watch?v=jqx18KgIzAE" rel="nofollow">https:&#x2F;&#x2F;www.youtube.com&#x2F;watch?v=jqx18KgIzAE</a><p>shows Sonnet 3.5 using the Google web UI in an automated fashion. Do Google&#x27;s terms really permit this? Will Google permit this when it is happening at scale?
评论 #41916081 未加载
gumboshoes7 个月前
For me, one of the more useful steps on macOS will be when local AI can manipulate anything that has an Apple Script library. The hooks are there and decently documented. For meta purposes, having AI work with a third-party app like Keyboard Maestro or Raycast will even further expand the pre-built possibilities without requiring the local AI to reinvent steps or tools at the time of each prompt.
cube22227 个月前
This looks quite fantastic!<p>Nice improvements in scores across the board, e.g.<p>&gt; On coding, it [the new Sonnet 3.5] improves performance on SWE-bench Verified from 33.4% to 49.0%, scoring higher than all publicly available models—including reasoning models like OpenAI o1-preview and specialized systems designed for agentic coding.<p>I&#x27;ve been using Sonnet 3.5 for most of my AI-assisted coding and I&#x27;m already very happy (using it with the Zed editor, I love the &quot;raw&quot; UX of its AI assistant), so any improvements, especially seemingly large ones like this are very welcome!<p>I&#x27;m still extremely curious about how Sonnet 3.5 itself, and its new iteration are built and differ from the original Sonnet. I wonder if it&#x27;s in any way based on their previous work[0] which they used to make golden-gate Claude.<p>[0]: <a href="https:&#x2F;&#x2F;transformer-circuits.pub&#x2F;2024&#x2F;scaling-monosemanticity&#x2F;index.html" rel="nofollow">https:&#x2F;&#x2F;transformer-circuits.pub&#x2F;2024&#x2F;scaling-monosemanticit...</a>
评论 #41915952 未加载
FloatArtifact7 个月前
It will interesting to see how this evolves. UI automation use case is different from accessibility do to latency requirement. latency matters a lot for accessibility not so much for ui automation testing apparatus.<p>I&#x27;ve often wondered what the combination of grammar-based speech recognition and combination with LLM could do for accessibility. Low domain Natural Language Speech recognition augmented by grammar based speech recognition for high domain commands for efficiency&#x2F;accuracy reducing voice strain&#x2F;increasing recognition accuracy.<p><a href="https:&#x2F;&#x2F;github.com&#x2F;dictation-toolbox&#x2F;dragonfly">https:&#x2F;&#x2F;github.com&#x2F;dictation-toolbox&#x2F;dragonfly</a>
cynicalpeace7 个月前
This bolsters my opinion that OpenAI is falling rapidly behind. Presumably due to Sam&#x27;s political machinations rather than hard-driving technical vision, at least that&#x27;s what it seems like, outside looking in.<p>Computer use seems it might be good for e2e tests.
lossolo7 个月前
Livebench updated<p><a href="https:&#x2F;&#x2F;livebench.ai" rel="nofollow">https:&#x2F;&#x2F;livebench.ai</a><p><pre><code> Model | Global | Reasoning | Coding | Math | Data | Language | IF ------------------------------|---------|-----------|---------|---------|---------|----------|------- o1-preview-2024-09-12 | 66.02 | 68.00 | 50.85 | 62.92 | 63.97 | 72.66 | 77.72 claude-3-5-sonnet-20241022 | 60.33 | 58.67 | 67.13 | 51.28 | 52.78 | 58.09 | 74.05 claude-3-5-sonnet-20240620 | 59.80 | 58.67 | 60.85 | 53.32 | 56.74 | 56.94 | 72.30</code></pre>
urbandw311er7 个月前
&gt; we have provided three tools &gt; bash shell<p>November 2024: AI is allowed to execute commands in a bash shell. What could possibly go wrong?
Hizonner7 个月前
Can this solve CAPTCHAs for me? It&#x27;s starting to get to the point where limited biological brains can&#x27;t do them.
mercacona7 个月前
I&#x27;m giving the new Sonnet a chance, although for my use as a writing companion so far, Opus has been king among all the models I&#x27;ve tried.<p>However, I&#x27;ve been using Opus as a writing companion for several months, especially when you have writer&#x27;s block and ask it for alternative phrases, it was super creative. But in recent weeks I was noticing a degradation in quality. My impression is that the model was degrading. Could this be technically possible? Might it be some kind of programmed obsolescence to hype new models?
评论 #41919067 未加载
freetonik7 个月前
Fascinating. Though I expect people to be concerned about privacy implications of sending screenshots of the desktop, similar to the backlash Microsoft has received about their AI products. Giving the remote service actual control of the mouse and keyboard is a whole another level!<p>But I am very excited about this in the context of accessibility. Screen readers and screen control software is hard to develop and hard to learn to use. This sort of “computer use” with AI could open up so many possibilities for users with disabilities.
评论 #41915161 未加载
评论 #41915514 未加载
评论 #41915691 未加载
评论 #41916715 未加载
mmooss7 个月前
Of course there&#x27;s great inefficiency in having the Claude software control a computer with a human GUI mediating everything, but it&#x27;s necessary for many uses right now given how much we do where only human interfaces are easily accessible. If something like it takes off, I expect interfaces for AI software would be published, standardized, etc. Your customers may not buy software that lacks it.<p>But what I really want to see is a CLI. Watching their software crank out Bash, vim, Emacs!, etc. - that would be fascinating!
评论 #41915675 未加载
评论 #41915899 未加载
turnsout7 个月前
Wow, there&#x27;s a whole industry devoted to what they&#x27;re calling &quot;Computer Use&quot; (Robotic Process Automation, or RPA). I wonder how those folks are viewing this.
torginus7 个月前
<i>Claude&#x27;s current ability to use computers is imperfect. Some actions that people perform effortlessly—scrolling, dragging, zooming—currently present challenges for Claude and we encourage developers to begin exploration with low-risk tasks. </i><p>Nice, but I wonder why didn&#x27;t they use UI automation&#x2F;accessibility libraries, that have access to the semantic structure of apps&#x2F;web pages, as well as accessing documents directly instead of having Excel display them for you.
评论 #41916700 未加载
评论 #41915824 未加载
评论 #41915986 未加载
sedatk7 个月前
&gt; developers can direct Claude to use computers the way people do—by looking at a screen, moving a cursor, clicking buttons, and typing text.<p>So, this is how AI takes over the world.
ford7 个月前
Seems like both:<p>- AI Labs will eat some of the wrappers on top of their APIs - even complex ones like this. There are whole startups that are trying to build computer use.<p>- AI is fitting _some_ scaling law - the best models are getting better and the &quot;previously-state-of-the-art&quot; models are fractions of what they cost a couple years ago. Though it remains to be seen if it&#x27;s like Moore&#x27;s Law or if incremental improvements get harder and harder to make.
评论 #41915667 未加载
jatins7 个月前
How does the computer use work -- Is this a desktop app they are providing that can do actions on your computer? Didn&#x27;t see any such mention in the post
评论 #41915367 未加载
评论 #41915969 未加载
评论 #41916721 未加载
评论 #41915235 未加载
Bjorkbat7 个月前
Tried my standard go-to for testing, asked it to generate a voronoi diagram using p5js. For the sake of job security I&#x27;m relieved to see it still can&#x27;t do a relatively simple task with ample representation in the Google search results. Granted, p5js is kind of niche, but not terribly so. It&#x27;s arguably the most popular library for creating coding.<p>In case you&#x27;re wondering, I tried o1-preview, and while it did work, I was also initially perplexed why the result looked pixelated. Turns out, that&#x27;s because many of the p5js examples online use a relatively simple approach where they just see which cell-center each pixel is closest to, more or less. I mean, it works, but it&#x27;s a pretty crude approach.<p>Now, granted, you&#x27;re probably not doing creative coding at your job, so this may not matter that much, but to me it was an example of pretty poor generalization capabilities. Curiously, Claude has no problem whatsoever generating a voronoi diagram as an SVG, but writing a script to generate said diagrams using a particular library eluded it. It knows how to do one thing but generalizes poorly when attempting to do something similar.<p>Really hard to get a real sense of capabilities when you&#x27;re faced with experiences like this, all the while somehow it&#x27;s able to solve 46% of real-world python pull-requests from a certain dataset. In case you&#x27;re wondering, one paper (<a href="https:&#x2F;&#x2F;cs.paperswithcode.com&#x2F;paper&#x2F;swe-bench-enhanced-coding-benchmark-for-llms" rel="nofollow">https:&#x2F;&#x2F;cs.paperswithcode.com&#x2F;paper&#x2F;swe-bench-enhanced-codin...</a>) found that 94% of the pull-requests on SWE-bench were created before the knowledge cutoff dates of the latest LLMs, so there&#x27;s almost certainly a degree of data-leakage.
评论 #41917662 未加载
评论 #41918968 未加载
评论 #41926887 未加载
Centigonal7 个月前
They should just adopt Apple &quot;version numbers:&quot; Claude Sonnet (Late 2024).
mtgentry7 个月前
What are the licensing implications of this? If I’m Google, I’d be pissed that my software is being used without a human there looking at the ads.
评论 #41919411 未加载
评论 #41919113 未加载
flockonus7 个月前
Are these ppl are aware that they can bump minor versions?<p>The mkt team vetoed Claude 3.6 ???
评论 #41923878 未加载
bbor7 个月前
Ok I know that we&#x27;re in the post-nerd phase of computers, but version numbers are there for a reason. 3.6, please? 3.5.1??
runako7 个月前
I really don&#x27;t get their model. They have very advanced models, but the service overall seems to be a jumble of priorities. Some examples:<p>Anthropic doesn&#x27;t offer an unlimited chatbot service, only plans that give you &quot;more&quot; usage, whatever that means. If you have an API key, you are &quot;unlimited,&quot; so they have the capability. Why doesn&#x27;t the chatbot allow one to use their API key in the Claude app to get unlimited usage? (Yes, I know there are third-party BYOK tools. That&#x27;s not the question.)<p>Claude appears to be smart enough to make an Excel spreadsheet with simple formulae. However, it is apparently prevented from making any kind of file. Why? What principle underlies that guardrail that does not also apply to Computer Use?<p>Really want to make Claude my daily driver, but right now it often feels too much like a research project.
评论 #41917864 未加载
评论 #41917886 未加载
评论 #41924111 未加载
hubraumhugo7 个月前
I&#x27;ve seen quite a few YC startups working on AI-powered RPA, and now it looks like a foundational model player is directly competing in their space. It will be interesting to see whether Anthropic will double down on this or leave it to third-party developers to build commercial applications around it.
评论 #41916427 未加载
joshuamcginnis7 个月前
Is there anything out there yet that will let me issue the command:<p>&gt; Refactor the api folder with any recommended readability improvements or improvements that would help DRY up code without adding additional complexity.<p>Then I can just `git status` to see the changes?
评论 #41920113 未加载
评论 #41919954 未加载
attentive7 个月前
They need to work on their versioning.<p>&quot;3.5 Sonnet (New)&quot;, WTAF? - just call it 3.6 Sonnet or something.<p>Is it &quot;New&quot; sonnet? is it &quot;upgraded&quot;? Is there a difference? How do I know which one I use?<p>I can understand claude-3-5-sonnet-20241022, but that&#x27;s not what users see.
abc-17 个月前
I tried to get it to translate a document and it stopped after a few paragraphs and asked if I wanted it to keep going. This is not appropriate for my use case and it kept doing this even though I explicitly told it not to. The old version did not do this.
评论 #41917422 未加载
lutusp7 个月前
&gt; &quot;... and similar speed to the previous generation of Haiku.&quot;<p>To me this is the most annoying grammatical error. I can&#x27;t wait for AI to take over all prose writing so this egregious construction finally vanishes from public fora. There may be some downsides -- okay, many -- but at least I won&#x27;t have to read endless repetitions of &quot;similar speed to ...&quot; when the correct form is obviously &quot;speed similar to&quot;.<p>In fact, in time this correct grammar may betray the presence of AI, since lowly biologicals (meaning us) appear not to either understand or fix this annoying error without computer help.
submeta7 个月前
That’s too much control for my taste. I don’t want anthropic to see my screen. I rather prefer a VS Code with integrated Claude. A version that can see all my dev files in a given folder. I don’t need it to run Chrome for me.
评论 #41917028 未加载
bluelightning2k7 个月前
This is what the Rabbit &quot;large action model&quot; pretended to be. Wouldn&#x27;t be surprised to see them switch to this and claim they were never lying about their capabilities because it works now.<p>Pretty cool for sure.
评论 #41915674 未加载
RecycledEle7 个月前
How long until it is profitable the tell a cheap AI to &quot;win this game by collecting resources and advancing in-game&quot; and then sell the account on eBay?<p>I wonder what optimizations could be made? Could a gold farmer have the directions from one AI control many accounts? Could the AI program simpler bots for each bit of the game?<p>I can imagine not being smart enough to play against computers, because I am flagged as a bot. I can imagine a message telling me I am banned because &quot;nobody but a stupid bot would score so low.&quot;
amai7 个月前
Finally a general tool to solve captchas for my web scrapers.
wesleyyue7 个月前
If anyone would like to try the new Sonnet in VSCode. I just updated <a href="https:&#x2F;&#x2F;double.bot">https:&#x2F;&#x2F;double.bot</a> to the new Sonnet. (disclaimer: I am the cofounder&#x2F;creator)<p>---<p>Some thoughts:<p>* Will be interesting to see what we can build in terms of automatic development loops with the new computer use capabilities.<p>* I wonder if they are not releasing Opus because it&#x27;s not done or because they don&#x27;t have enough inference compute to go around, and Sonnet is close enough to state of the art?
gerash7 个月前
The &quot;computer use&quot; demos are interesting.<p>It&#x27;s a problem we used to work on and perhaps many other people have always wanted to accomplish since 10 years ago. So it&#x27;s yet to be seen how well it works outside a demo.<p>What was surprising was the slow&#x2F;human speed of operations. It types into the text boxes at a human speed rather than just dumping the text there. Is it so the human can better monitor what&#x27;s happening or is it so it does not trigger Captchas ?
throwaway0123_57 个月前
This is incredibly cool but it seems like the potential damage from a &quot;hallucination&quot; in this mode is considerable, especially when they provide examples of it going very far off-track (looking up Yellowstone pictures). Would basically need constant monitoring for me not to be paranoid it did something stupid.<p>Also seems like a privacy issue with them sending screenshots of your device back to their servers.
maestrae7 个月前
anybody know how the hell they&#x27;re combating &#x2F; gonna combat captcha&#x27;s, cloudflare blocking, etc. I remember playing in this space on a toy project and being utterly frustrated by anti-scraping. Maybe one good thing that will come out of this AI boom is that companies will become nicer to scrapers? Or maybe, they&#x27;ll just cut sweetheart deals?
29decibel7 个月前
I am surprised it uses macOS as the demo, as I thought it would be harder to control vs Ubuntu. But maybe at the same time, macOS is the most predictable&#x2F;reliable desktop environment? I noticed that they use virtual environment for the demo, curious how do they build that along with docker, is that leveraging the latest virtualization framework from Apple?
Tepix7 个月前
Interesting stuff, i look forward to future developments.<p>A comment about the video: Sam Runger talks wayyy too fast, in particular at the beginning.
msoad7 个月前
I skimmed through the computer use code. It&#x27;s possible to build this with other AI providers too. For instance you can asks ChatGPT API to call functions for click and scroll and type with specific parameters and execute them using OS&#x27;s APIs (A11y APIs usually)<p>Did I miss something? Did they have to make changes to the model for this?
评论 #41915991 未加载
fernly7 个月前
Imagine the possibilities for cyber-crime. Surely you could program it to log in to a financial institution and transfer money. And if you had a list of user names and passwords from some large info breach? You could automate a LOT of transfers in a short amount of time...
tammer7 个月前
This demo is impressive although my initial reaction is a sort of grief that I wasn&#x27;t born in the timeline where Alan Kay&#x27;s vision of object-oriented computing was fully realized -- then we wouldn&#x27;t have to manually reconcile wildly heterogeneous data formats and interfaces in the first place!
评论 #41922633 未加载
aprilthird20217 个月前
OpenAI must be scared at this point. Anthropic is clobbering them at the high end of the market and Meta is providing free AIs at the low end. OpenAI is pretty soon going to be in the valueless middle fighting with tons of other companies for relevance
alok-g7 个月前
Next stop after &#x27;Computer Use&#x27; -- Multimodal input from a robot&#x27;s sensors and generating various signals to control its actions.<p>Looking forward to see this in the coming few years. And hoping such a robot could be of help to many people including those old.
myprotegeai7 个月前
How long until &quot;computer use&quot; is tricked into entering PII or PHI into an attackers website?
评论 #41915947 未加载
wewtyflakes7 个月前
I wonder if OpenAI will fast follow; usually they&#x27;re the ones to throw down the gauntlet. That being said, you can play around with OpenAI with a similar architecture of vision + agent + exec + loop using Donobu, though it is constrained to web browsers.
lairv7 个月前
Offtopic but youtube doesn&#x27;t allow me to view the embedded video, with a &quot;Sign in to confirm you’re not a bot&quot; message. I need to open a dedicated youtube tab to watch it<p>The barrier to scraping youtube has increased a lot recently, I can barely use yt-dlp anymore
评论 #41915740 未加载
abraxas7 个月前
Hopefully the coding improvements are meaningful because I find that as a coding assistant o1-preview beats it (at least the Claude 3.5 that was available yesterday) but I like Claude&#x27;s demeanor more (I know this sounds crazy but it matters a bit to me)
nwnwhwje7 个月前
Any comments on alignment with Anthropic&#x27;s missions. Last time I checked Anthropic is about building SOTA as that is the only way to do safety research. Making money ans useful stuff commercially is a means to an end.
smcleod7 个月前
I wonder when it&#x27;ll actually be available in the Bedrock AU region, because as of right now we&#x27;re still stuck using mid-range models from a year ago.<p>Amazon has really neglected ap-southeast-2 when it comes to LLMs.
评论 #41918161 未加载
brid7 个月前
Looks like visual understanding of diagrams is improved significantly! For example, it was on par with Chat GPT 4o and Gemini 1.5 in parsing an ERD for a conceptual model, but now far excels over the others.
bergutman7 个月前
They need to get the price of 3.5 Haiku down. It&#x27;s about 2x 4o-mini.
评论 #41915820 未加载
m3kw97 个月前
I suspect they are gonna need some local offload capabilities for Computer Use, the repeated screen reading can definitely be done locally on modern machines, otherwise the cost maybe impractical.
评论 #41916747 未加载
评论 #41916028 未加载
thecolorgreen7 个月前
This looks really similar to rabbit&#x27;s Large Action Model (LAM). Cool!<p><a href="https:&#x2F;&#x2F;www.rabbit.tech&#x2F;rabbit-os" rel="nofollow">https:&#x2F;&#x2F;www.rabbit.tech&#x2F;rabbit-os</a>
TacticalCoder7 个月前
One suggestion, use the following prompt at a LLM:<p><pre><code> The combination of the words &quot;computer use&quot; is highly confusing. It&#x27;s also &quot;Yoda speak&quot;. For example it&#x27;s hard for humans to parse the sentences *&quot;Introducing computer use, a new Claude 3.5 Sonnet, and Claude 3.5 Haiku&quot;*, *&quot;Computer use, a new Claude 3.5 Sonnet, and Claude 3.5 Haiku &quot;* (it literally relies on the comma to make any sense) and *&quot;Computer use for automated interaction&quot;* (in the youtube vid&#x27;s title: this one is just broken english). Please suggest terms that are not confusing for a new ability allowing an AI to control a computer as if it was a human.</code></pre>
amai7 个月前
This &quot;computer use&quot; feature is obviously perfect for automating GUI tests. Will it work on screenshots of mobile devices like smartphones&#x2F;tables, also?
评论 #41926589 未加载
Maynor7 个月前
Join PeachLive and input my invitation code 6B94HL to get 20 free coins! Enjoy live video chat at {invitationUrl}
kingkongjaffa7 个月前
Interestingly new claude only knows content up to:<p>&gt; I&#x27;m limited to what I know as of April 2024, which includes the initial Claude 3 family launch but not subsequent updates.
mclau1567 个月前
Did they just invent a new world of warcraft or runescape bot?
punnerud7 个月前
Cursor AI already have the option to switch to using claude-3-5-sonnet-20241022 in the chat box.<p>I was about to try to add a custom API. I’m impressed by the speed of that team.
评论 #41916644 未加载
Alifatisk7 个月前
&gt; Claude 3.5 Haiku matches the performance of Claude 3 Opus<p>Oh wow!
lostmsu7 个月前
Can anyone share a .http or curl or anything similar based session with computer tool use? Docker containers make me cry.
taytus7 个月前
Computer use won&#x27;t allow you to log in to social media accounts, even if it is your account and credentials. Bummer.
robertkoss7 个月前
Does anyone know how I could check whether my Claude Sonnet version that I am using in the UI has been updated already?
评论 #41924072 未加载
评论 #41915395 未加载
crazystar7 个月前
Looks like it just takes a screenshot and can&#x27;t scroll so it might miss things.<p>Claude 3.5 Haiku will be released later this month.
评论 #41915166 未加载
myprotegeai7 个月前
We are approaching FSD for the computer, with all of the lofty promises, and all of the horrible accidents.
iamsanteri7 个月前
I love how they don&#x27;t seem to be calling it &quot;AgenticAI&quot; or something like that.
throwvc37 个月前
What I&#x27;d like to know is whether prompt caching is available to Claude on AWS Bedrock now.
vivekkairi7 个月前
aider benchmarks for claude 3.5 new are impressive. From 77.4% to 83.5% beating o1-preview.
netcraft7 个月前
since they didnt rev the version, does this mean if we were using 3.5 today its just automatically using the new version? That doesnt seem great from a change management perspective<p>though I am looking forward to using the new one in cursor.ai
评论 #41915392 未加载
2-3-7-43-18077 个月前
wow, i almost got worried but the cute music and the funny little monster on the desk convinced me that this all just fun and dandy and all will be good. the future is coming and we&#x27;ll all be much more happy :)
bilsbie7 个月前
Does this make cursor obsolete?<p>You can just use any IDE you want and it will work with it.
评论 #41917955 未加载
veggieWHITES7 个月前
While I was initially impressed with it&#x27;s context window, I got so sick of fighting with Claude about what it was allowed to answer I quit my subscription after 3 months.<p>Their whole policing AI models stance is commendable but ultimately renders their tools useless.<p>It actually started arguing with me about whether it was allowed to help implement a github repository&#x27;s code as it might be copywritten... it was MIT licensed open source from Google :&#x2F;
评论 #41917765 未加载
brcmthrowaway7 个月前
This is bad news for SWEs!
esseti7 个月前
I checked the docs but did not find it out. Cloude has API as the GPT Assistant? with also the ability to give a set of documents to work with?<p>It seems that you can only send single message, thus not relying on the ability to &quot;learn&quot; from predefined documents.
tylerchilds7 个月前
computer use is really going to highlight how fragmented the desktop ecosystem is, but also this definitely paints more context on how microsoft wants to use their screenshot ai
iknownthing7 个月前
Can Claude create and run a CI&#x2F;CD pipeline now from a prompt?
jonesn117 个月前
How does one get access to it without using the API??
ta937548297 个月前
eventually, we&#x27;ll be able to eliminate the intermediate &quot;computer&quot;, and just let the ai render everything we need to interact with
efields7 个月前
Captchas are toast.
评论 #41917333 未加载
netcraft7 个月前
im unclear, is haiku supposed to be similar to 4o-mini in usecase&#x2F;cost&#x2F;performance? If not, do they have an analog?
评论 #41915558 未加载
ta86457 个月前
Still can&#x27;t use their services. They still require a phone number for some reason. What about those of us who don&#x27;t have one?
评论 #41923650 未加载
评论 #41923804 未加载
评论 #41923737 未加载
mathiasrw7 个月前
Just to confirm: did they just release a model with the exact same name as the previous one?
jerrygoyal7 个月前
does anyone know what are some use cases for &quot;computer use&quot;?
nbzso7 个月前
Just a question: For this thingy to work, I must give the provider access to my computer? Good luck. :)<p>Just another reason to use ONLY local LLM&#x27;s.
评论 #41941600 未加载
geniium7 个月前
This is amazing
g9yuayon7 个月前
Is it just me who feels that Anthropic has been innovating faster than ChatGPT in the past year?
Maynor7 个月前
6B94HL
postalcoder7 个月前
and i was just planning to go to sleep…
评论 #41916014 未加载
dtquad7 个月前
Now I am really curious how to programmatically create a sandboxed compute environment to do a self-hosted &quot;Computer use&quot; and see how well other models, including self-hosted Ollama models, can do this.
mannycalavera427 个月前
new VBA version just landed
anotherpaulg7 个月前
The new Sonnet tops aider&#x27;s code editing leaderboard at 84.2%. Using aider&#x27;s &quot;architect&quot; mode it sets the SOTA at 85.7% (with DeepSeek as the &quot;editor&quot; model).<p><pre><code> 84% Claude 3.5 Sonnet 10&#x2F;22 80% o1-preview 77% Claude 3.5 Sonnet 06&#x2F;20 72% DeepSeek V2.5 72% GPT-4o 08&#x2F;06 71% o1-mini 68% Claude 3 Opus </code></pre> It also sets SOTA on aider&#x27;s more demanding refactoring benchmark with a score of 92.1%!<p><pre><code> 92% Sonnet 10&#x2F;22 75% o1-preview 72% Opus 64% Sonnet 06&#x2F;20 49% GPT-4o 08&#x2F;06 45% o1-mini </code></pre> <a href="https:&#x2F;&#x2F;aider.chat&#x2F;docs&#x2F;leaderboards&#x2F;" rel="nofollow">https:&#x2F;&#x2F;aider.chat&#x2F;docs&#x2F;leaderboards&#x2F;</a>
评论 #41922386 未加载
评论 #41918924 未加载
评论 #41918736 未加载
评论 #41920589 未加载
评论 #41920623 未加载
评论 #41920087 未加载
评论 #41922956 未加载
theflyestpilot7 个月前
<i>cries in UiPath</i>
HanClinto7 个月前
Why not rev the numbers? &quot;3.5&quot; vs. &quot;3.5 New&quot; feels weird -- is there a particular reason why Anthropic doesn&#x27;t want to call this 3.6 (or even 3.5.1)?
评论 #41915544 未加载
评论 #41915195 未加载
评论 #41915985 未加载
评论 #41915252 未加载
评论 #41915578 未加载
评论 #41915384 未加载
评论 #41917101 未加载
评论 #41915466 未加载
评论 #41915294 未加载
评论 #41915265 未加载
评论 #41915505 未加载
评论 #41917488 未加载
评论 #41915369 未加载
jampekka7 个月前
It&#x27;s quite sad that application interoperability requires parsing bitmaps instead of exchanging structured information. Feels like a devastating failure in how we do computing.
评论 #41916682 未加载
评论 #41916835 未加载
评论 #41916669 未加载
评论 #41916684 未加载
评论 #41917179 未加载
评论 #41917704 未加载
评论 #41917302 未加载
评论 #41916594 未加载
评论 #41917481 未加载
mergisi7 个月前
My First Experience with Claude Computer Use - It&#x27;s Mind-Blowing!<p>Just tested Claude&#x27;s new Computer Use feature and had to share this simple but powerful test:<p>My Basic Prompt: &quot;Please: 1. Search Amazon for 3 wireless earbuds: Find price Rating Brand name<p>2. Make a simple Excel file &#x27;earbuds.xlsx&#x27;: Put the information in a basic table Add colors to the headers Sort by price<p>3. Show me the results&quot;<p>What blew my mind: - Claude actually looked at my screen - Moved the mouse by itself - Clicked buttons like a human - Created reports automatically<p>It&#x27;s like having a virtual assistant that can really use your computer! No coding needed - just simple English instructions.<p>For those interested: <a href="https:&#x2F;&#x2F;mergisi.medium.com&#x2F;8f56f683e307" rel="nofollow">https:&#x2F;&#x2F;mergisi.medium.com&#x2F;8f56f683e307</a>
评论 #41928799 未加载
评论 #41928880 未加载
baq7 个月前
Scary stuff.<p>&#x27;Hey Claude 3.5 New, pretend I&#x27;m a CEO of a big company and need to lay off 20% people, make me a spreadsheet and send it to HR. Oh make sure to not fire the HR department&#x27;<p>c.f. IBM 1979.
freediver7 个月前
Both new Sonnet and gpt-4o still fail at a simple:<p>&quot;How many w&#x27;s are in strawberry?&quot;<p>gpt-4o: There are 2 &quot;w&#x27;s&quot; in &quot;strawberry.&quot;<p>Claude 3.5 Sonnet (new): Let me count the w&#x27;s in &quot;strawberry&quot;: 0 w&#x27;s.<p>(same question with &#x27;r&#x27; succeeds)<p>What is artificial about current gen of &quot;artificial intelligence&quot; is the way training (predict next token) and benchmarking (overfitting) is done. Perhaps a fresh approach is needed to achieve a true next step.
评论 #41916666 未加载
评论 #41916593 未加载
评论 #41919000 未加载
评论 #41916582 未加载
评论 #41916657 未加载