Great idea Kyle! I read through the source code as an experienced desktop automation/Electron developer and felt good about trying it for some basic tasks.<p>The implementation is a thin wrapper over the Anthropic API and the step-based approach made me confident I could kill the process before it did anything weird. Closed anything I didn't want Anthropic seeing in a screenshot. Installed smoothly on my M1 and was running in minutes.<p>The default task is "find flights from seattle to sf for next tuesday to thursday". I let it run with my Anthropic API key and it used chrome. Takes a few seconds per action step. It correctly opened up google flights, but booked the wrong dates!<p>It had aimed for november 2nd, but that option was visually blocked by the Agent.exe window itself, so it chose november 20th instead. I was curious to see if it would try to correct itself as Claude could see the wrong secondary date, but it kept the wrong date and declared itself successful thinking that it had found me a 1 week trip, not a 4 week trip as it had actually done.<p>The exercise cost $0.38 in credits and about 20 seconds. Will continue to experiment
How long until it can quickly without you noticing add a daemon running on your system. This is the equivalent of how we used to worry about Soviet spies getting access to US secrets, and now we just post them online for everyone to see.<p>There's no antivirus or firewall today that can protect your files from the ability this could have to wreck havoc on your network, let alone your computer.<p>This scene comes to mind: <a href="https://makeagif.com/i/BA7Yt3" rel="nofollow">https://makeagif.com/i/BA7Yt3</a>
Remember a few years back when there was the story about the little girl who did an "Alexa, order me a dollhouse" on the news and people watching the show had their Alexas pick up on it and order dollhouses during the broadcast? Wait until there's a widely watched Netflix show where someone says "Delete C:\Windows".
Sidenote : i recently tried cursor, in "compose" mode, starting a fullstack project from scratch, and i'm stupefied by the result.<p>Do people in the software community realize how much the industry is going to totally transform in the next 5 years ? I can't imagine people actually typing code by hand anymore by that time.
Super off-topic, but somewhat related. What people use to automate non-browser GUI apps on Linux on Wayland? I need to occasionally do it, but this particular combination eludes me.<p>- CLI apps - no problem, just write Bash/Python/whatever
- browser apps, also no problem, use Selenium/Playwright
- Xorg has some libraries; even if they are clunky they will work in a pinch
- Windows has tons of RPA (Robotic Process Automation) solutions<p>But for Wayland I couldn't find anything reliable.
It seems to only work with simple task, I asked it to create some simple tables in both Rhino (Mac App) and OnShape (Chrome tab) and it just seems lost<p>With Rhino it sees the app open, and it says it's doing all these actions, like creating a shape, but I don't see it being done, and it will just continue on to the next action without the previous step being done. It doesn't check if the previous task was completed<p>With OnShape, it says it's going to create a shape, but then selects the wrong item from the menu but assumes it's using the right tool, and continues on with the actions as if it the previous action was done
Computer, shitpost memes all day that make me crypto while I raise my family and tend to my garden.<p>The future is heading in the direction of only suckers using computers. Real wealth is not touching a computer for anything.
this is such a hilariously bad idea, its like knowingly installing malware on your computer - malware that has access to your bank account. please god, any sane person reading this do not install this, you've been warned.
I built something similar (still no GUI) but for the in browser actions only,<p>I think in-browser actions are much safer and can be more predictable with easier to implement safeguards, but I would love to see how this concept pan out in the future!<p>PS: you can check it out on GitHub: <a href="https://github.com/SamDc73/WebTalk/">https://github.com/SamDc73/WebTalk/</a><p>Please let me know what you guys think!
I think there's a lot of opportunity here to make a hybrid of voice control through more traditional approach along with a LLM<p>It will interesting to see how this evolves. UI automation use case is different from accessibility do to latency requirement. latency matters a lot for accessibility not so much for ui automation testing apparatus.<p>I've often wondered what the combination of grammar-based speech recognition and combination with LLM could do for accessibility. Low domain Natural Language Speech recognition augmented by grammar based speech recognition for high domain commands for efficiency/accuracy reducing voice strain/increasing recognition accuracy.<p><a href="https://github.com/dictation-toolbox/dragonfly">https://github.com/dictation-toolbox/dragonfly</a>
Good tool to test the new capability. Thanks for sharing.<p>My limited testing has produced okay result for a trivial use case and very disappointing results for a simple use case.<p>Trivial: what is the time. |
Claude: took screnshot and read the time off the bottom right. |
Cost: $0.02<p>Simple: download a high resolution image of singapore skyline and set it as desktop wallpaper |
Claude: description of steps looks plausible but actions are wild and all over the place. opens national park service website somehow and only other action it is able to do is right click a couple of times. failed! |
Cost: $0.37<p>Long way to go before it can be used for even hobby use cases I feel.<p>PS: is it possible that the screenshots include a image of Agent.exe itself and that is creating a poor feedback loop somehow?
One thing this could be safely used is for generally is read only situations. Like monitor Brokered CD > 5% are released by refreshing the page or during the pandemic when Amazon Shopping window opened up at an arbitrary time and ring an alarm. Hopefully it is not too slow and can do this.
Apple is best positioned to run with the implications of these developments (though Microsoft will probably respond too) with both their historic operating system control hooks and their architecturally grounded respect for privacy (arguably of course). Apple seems to be paying very close attention to LLM developments, I doubt they will rush out an 80/20 response to these LLM agent control use cases, but I would be surprised if they didn't enter this product space.
> "Find flights Tuesday to Thursday next week"<p>> AI Picks Thursday to Saturday this week (as time of writing)<p>Still cheaper to higher real people then
Set a job to have it reboot the system, set it to run on boot, achieve AI-hyped useless machine!<p><a href="https://en.m.wikipedia.org/wiki/Useless_machine" rel="nofollow">https://en.m.wikipedia.org/wiki/Useless_machine</a>
Anyone else getting 400s with "This action is restricted for safety reasons at this time" when trying to use the app? I don't see any docs that mention you have to manually enable the API or anything.
I've been wondering for a while now if Selenium could be replaced by a standard browser distribution with LLM multimodal control.<p>This seems conceptually close.
No disclaimer hmm? Anthropic made it sound very scary.<p><a href="https://github.com/anthropics/anthropic-quickstarts/tree/main/computer-use-demo#anthropic-computer-use-demo">https://github.com/anthropics/anthropic-quickstarts/tree/mai...</a>
It's fascinating/spooky how different LLMs are slowly developing their own "personalities," so to speak. And they seem to be emerging as we're giving them access to more tools and modalities which are harder to do broad RLHF on.<p>With computer use, we first learned that Claude sometimes takes breaks to browse pictures of Yosemite, and now this:<p>> Claude really likes Firefox. It will use other browsers if it absolutely has to, but will behave so much better if you just install Firefox and let it go to its happy place.
> Claude really likes Firefox. It will use other browsers if it absolutely has to, but will behave so much better if you just install Firefox and let it go to its happy place.<p>Good boy!
20 years ago: "I would never let the AI out of the box! I'm not an <i>idiot</i>!"<p>Today: "Sure, I'll give the AI full control over my computer. WCGW?"
Such garbage is only possible because there has been a strong deviation between ethics, philosophy & technology.<p>The business bros are to immoral to know that this is unethical as thier eyes are focused on making money. Not being ethical.<p>The ethical activists & philosophers like Richard Stallman & Jaron Lanier offer un-realistic solutions that normal people cannot adopt.<p>- I can't turn off JavaScript because 80% of my websites won't work,<p>- I can't ditch Apple because GNU wants me to use a 15 year old computer with completely "libre" software impractical for work<p>- I need a cellphone to communicate. I can move without a cellphone like RMS.<p>We need to start teaching people in technology not just "code" but also ethics/philosophy like they do in medicine & law.<p>Also we need people with better moral standards. I would really like it if someone like Snowden, RMS to Jaron built business products (not just non-profit gimmicks) that satisfied real consumer needs.<p>Otherwise we are doomed.