TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

TaxyAI: Open-source browser automation with GPT-4

355 pointsby kcorbittabout 2 years ago

29 comments

kcorbittabout 2 years ago
Hey HN! My brother Arctic_fly and I spent the last two weeks since the GPT-4 launch building Taxy, an open source Chrome extension that lets you automate arbitrary tasks in your browser using GPT-4. You can see a few demos in the Github README, but basically it works like this:<p>1. You open the extension and write the task you&#x27;d like done (eg. &quot;schedule a meeting with David tomorrow at 2&quot;).<p>2. Taxy pulls the DOM of the current page, puts it through a pipeline to remove all non-semantic information, hidden elements, etc and sends it to GPT-4 along with your text instructions.<p>3. GPT-4 tries to figure out what action to take. In our prompt we give it the option to either click an element or set an input&#x27;s value. We use the ReAct paradigm (<a href="https:&#x2F;&#x2F;arxiv.org&#x2F;abs&#x2F;2210.03629" rel="nofollow">https:&#x2F;&#x2F;arxiv.org&#x2F;abs&#x2F;2210.03629</a>) so it explains what it&#x27;s trying to do before taking an action, which both makes it more accurate and helps with debugging.<p>4. Taxy parses GPT-4&#x27;s response and performs the action requested on the page. It then goes back to step (2) and asks GPT-4 for the next action to take with the updated page DOM. It also sends the list of actions already taken as part of the current task so GPT-4 can detect if it&#x27;s getting stuck in a loop and abort. :)<p>5. Once GPT-4 has decided the task is done or it can&#x27;t make any more progress, it responds with a special action indicating it&#x27;s done.<p>Right now there are a lot of limitations, and this is more a &quot;research preview&quot; than a finished product. That said, I&#x27;ve found it surprisingly capable for a number of tasks, and I think it&#x27;s in a stable enough place we can share. Happy to answer any questions!
评论 #35347774 未加载
评论 #35352765 未加载
评论 #35354512 未加载
mlbossabout 2 years ago
This projects demonstrates the death of UI. UIs were created for humans to interact with software not for bots to perform tasks. If all you need to provide is your text request then we don&#x27;t need UIs. All softwares can just expose rest&#x2F;text interfaces and LLM can perform the task for you.
评论 #35348869 未加载
评论 #35353210 未加载
评论 #35348861 未加载
评论 #35373563 未加载
评论 #35350446 未加载
评论 #35407963 未加载
评论 #35354304 未加载
评论 #35352152 未加载
adeelk93about 2 years ago
I really like this, but I wonder if a better approach would be to take that simplified DOM and instead generate playwright or puppeteer code instead of direct DOM manipulation. That way it’s reproducible browser automation.
评论 #35348132 未加载
评论 #35348099 未加载
skybrianabout 2 years ago
Automation seems like a poor fit for an LLM since results are random and irreproducible? But I could see asking it to write a script step by step, and, once you&#x27;ve confirmed it works, keeping the script.<p>Also, it could help you fix the script when it breaks.
评论 #35354676 未加载
dopeboyabout 2 years ago
Anxiously been waiting for something like this - very cool.<p>One use case I&#x27;ve had is that I hate spending time on my linkedin, twitter, etc newsfeeds. But there are a handful of people I care about and want to keep tabs on.<p>Is there a way I could use TaxyAI to setup a role to monitor my LinkedIn newsfeed and keep tabs on certain people + topics and then email me a digest of that?
评论 #35349207 未加载
评论 #35347791 未加载
评论 #35347802 未加载
frankthedogabout 2 years ago
Similar to this, does anyone know if a browser extension that I can paste in (or choose from some saved snippets) a series of playwright or puppeteer steps and have it execute? I could use the saved snippets in the sources tab of dev tools but miss the auto waiting and other niceties. This project seems a bit too slow and non-deterministic.
评论 #35350480 未加载
93poabout 2 years ago
I wrote a piece on my professional blog last week about the imminent death of <i>most</i> UI based software, and it&#x27;s funny to see this releasing today to further my argument.<p>And as I commented elsewhere: yes, UI elements make sense sometimes. But it makes sense for an AI to dynamically make these for us when needed instead of relying on the software&#x27;s own implementation that may suck or have dark patterns or force workflows I don&#x27;t want to deal with
评论 #35349872 未加载
serjesterabout 2 years ago
Why use GPT-4? The latency is significantly worse than 3.5 and this seems simple enough that the performance delta is marginal. If I was going for robustness, I probably wouldn’t be using AI in the first place.<p>Edit: I noticed they support both but I’m assuming by the speed all the demos are using 3.5?
评论 #35347729 未加载
评论 #35347562 未加载
WonderBuilderabout 2 years ago
This is amazing already! Very exciting. I&#x27;ll make sure I follow this project&#x27;s progress. It also reminds me of Adept and their goal with ACT-1. I still haven&#x27;t seen their product launch, though...
Imnimoabout 2 years ago
It will be interesting to see whether this sort of approach works better than something using GPT-4&#x27;s vision capabilities. Obviously websites are built to be easy to use visually rather than easy to use via the DOM. On the other hand, it&#x27;s much less clear how to ground action proposals in the visual domain - how do you ask GPT where on an image of the screen it wants to click?
评论 #35348213 未加载
评论 #35347726 未加载
golearyabout 2 years ago
This is very cool! I was messing with some browser automation (Playwright) via GPT recently.<p>One idea I had: it would be cool if I could teach the agent. For instance, give it a task, but if it struggles, just complete it myself while the extension observes my interactions.<p>Perhaps these could be used as few shot examples for priming the model?<p>Gonna play around with this soon!
评论 #35350499 未加载
cbuqabout 2 years ago
Does the demo show the AI with the prompt to &quot;Schedule standup tomorrow at 10am. Invite david@taxy.ai&quot; scheduling a meeting at 10am TODAY, which also was already five hours in the past?<p>Makes me worried about AI with internet access...
评论 #35348095 未加载
bboygravityabout 2 years ago
Does this mean that form-fillers that actually work are around the corner?<p>Like the Lastpass form filler, but instead it would actually work?<p>Never ever having to fill out any webform manually ever again?!<p>That&#x27;s a killer app right there IMO.
评论 #35349001 未加载
koolbaabout 2 years ago
Very cool. The “sending everything of relevance on the page to OpenAI” is of course creepy. But that’s table stakes for anything like this until people can run them externally.<p>This would make a cool, “magic box”, at the top of a web page. Type in what you want to do, it sends it to the server along with the DOM extract (same site server). Server asks magical LLM how to do it, and then spits it back to the client. So no plug-in needed and data flow would pass through the source server.
评论 #35348091 未加载
评论 #35350029 未加载
paulmendozaabout 2 years ago
Need this to be a console app so I can use it for QA testing.
gloosxabout 2 years ago
I&#x27;m feeling that an API is something much more stable and deterministic than human-readable interface. Also you can train AI to learn which API calls to make for the task by looking at page sources. Why not translating prompts to single API calls instead of a script for clicking through DOM elements achieving the same?
yositoabout 2 years ago
Very cool idea! I&#x27;m excited to try it. I&#x27;m a little bit worried about the reliability of interfacing with a website via the DOM. I trust GPT-4 enough, but I could see a situation where the correct fields to fill in are ambiguous in the DOM and the plugin ends up saving or deleting the wrong data.
seydorabout 2 years ago
Wow this kind of thing makes plugins obsolete. I thought it would take more than a week
Octopuzabout 2 years ago
This makes me think of <a href="https:&#x2F;&#x2F;www.bardeen.ai&#x2F;" rel="nofollow">https:&#x2F;&#x2F;www.bardeen.ai&#x2F;</a> but maybe more capable of finding out what to do by itself.
dpflanabout 2 years ago
Curious: Can someone explain what they are excited to use this for? Can someone provide a large scale use-case&#x2F;scenario?
评论 #35420027 未加载
评论 #35347579 未加载
评论 #35349737 未加载
评论 #35349999 未加载
评论 #35429843 未加载
Arctic_flyabout 2 years ago
Already useful across a variety of domains, and it&#x27;s in early days yet!<p>Just yesterday I used to create a GitHub issue with minimal effort.
paulmendozaabout 2 years ago
This looks great for integration testing
tectonicabout 2 years ago
Oh man, I have been working on an extension like this over the last few days. Congrats on your release!
zavex79about 2 years ago
Would be interesting to combine this with talon voice, to add a voice interface.
matheusmoreiraabout 2 years ago
I wonder how effective these models are at blocking advertisers and tracking.
评论 #35350996 未加载
mgoetzkeabout 2 years ago
how do you solve the token limitations with complex sites
krembananabout 2 years ago
This is very cool, impressive work in 2 weeks! Each action seems to have some delay after it, is there any reason for that? Is it because you are streaming the OpenAI response and performing the actions as they come? If not, I imagine streaming the query response and executing each action as they emit would speed it up quite a bit?
snihalaniabout 2 years ago
TAKE. MY. MONEY. NOW.
评论 #35347901 未加载
mmh0000about 2 years ago
Wow... Another AI thing that I can&#x27;t use because there&#x27;s a &quot;waitlist&quot;. GTFO, software doesn&#x27;t need waitlists and you&#x27;re a jerk for advertising uselessness.
评论 #35350316 未加载
评论 #35356417 未加载
评论 #35349367 未加载