Hey HN! My brother Arctic_fly and I spent the last two weeks since the GPT-4 launch building Taxy, an open source Chrome extension that lets you automate arbitrary tasks in your browser using GPT-4. You can see a few demos in the Github README, but basically it works like this:<p>1. You open the extension and write the task you'd like done (eg. "schedule a meeting with David tomorrow at 2").<p>2. Taxy pulls the DOM of the current page, puts it through a pipeline to remove all non-semantic information, hidden elements, etc and sends it to GPT-4 along with your text instructions.<p>3. GPT-4 tries to figure out what action to take. In our prompt we give it the option to either click an element or set an input's value. We use the ReAct paradigm (<a href="https://arxiv.org/abs/2210.03629" rel="nofollow">https://arxiv.org/abs/2210.03629</a>) so it explains what it's trying to do before taking an action, which both makes it more accurate and helps with debugging.<p>4. Taxy parses GPT-4's response and performs the action requested on the page. It then goes back to step (2) and asks GPT-4 for the next action to take with the updated page DOM. It also sends the list of actions already taken as part of the current task so GPT-4 can detect if it's getting stuck in a loop and abort. :)<p>5. Once GPT-4 has decided the task is done or it can't make any more progress, it responds with a special action indicating it's done.<p>Right now there are a lot of limitations, and this is more a "research preview" than a finished product. That said, I've found it surprisingly capable for a number of tasks, and I think it's in a stable enough place we can share. Happy to answer any questions!
This projects demonstrates the death of UI. UIs were created for humans to interact with software not for bots to perform tasks. If all you need to provide is your text request then we don't need UIs. All softwares can just expose rest/text interfaces and LLM can perform the task for you.
I really like this, but I wonder if a better approach would be to take that simplified DOM and instead generate playwright or puppeteer code instead of direct DOM manipulation. That way it’s reproducible browser automation.
Automation seems like a poor fit for an LLM since results are random and irreproducible? But I could see asking it to write a script step by step, and, once you've confirmed it works, keeping the script.<p>Also, it could help you fix the script when it breaks.
Anxiously been waiting for something like this - very cool.<p>One use case I've had is that I hate spending time on my linkedin, twitter, etc newsfeeds. But there are a handful of people I care about and want to keep tabs on.<p>Is there a way I could use TaxyAI to setup a role to monitor my LinkedIn newsfeed and keep tabs on certain people + topics and then email me a digest of that?
Similar to this, does anyone know if a browser extension that I can paste in (or choose from some saved snippets) a series of playwright or puppeteer steps and have it execute? I could use the saved snippets in the sources tab of dev tools but miss the auto waiting and other niceties. This project seems a bit too slow and non-deterministic.
I wrote a piece on my professional blog last week about the imminent death of <i>most</i> UI based software, and it's funny to see this releasing today to further my argument.<p>And as I commented elsewhere: yes, UI elements make sense sometimes. But it makes sense for an AI to dynamically make these for us when needed instead of relying on the software's own implementation that may suck or have dark patterns or force workflows I don't want to deal with
Why use GPT-4? The latency is significantly worse than 3.5 and this seems simple enough that the performance delta is marginal. If I was going for robustness, I probably wouldn’t be using AI in the first place.<p>Edit: I noticed they support both but I’m assuming by the speed all the demos are using 3.5?
This is amazing already! Very exciting. I'll make sure I follow this project's progress. It also reminds me of Adept and their goal with ACT-1. I still haven't seen their product launch, though...
It will be interesting to see whether this sort of approach works better than something using GPT-4's vision capabilities. Obviously websites are built to be easy to use visually rather than easy to use via the DOM. On the other hand, it's much less clear how to ground action proposals in the visual domain - how do you ask GPT where on an image of the screen it wants to click?
This is very cool! I was messing with some browser automation (Playwright) via GPT recently.<p>One idea I had: it would be cool if I could teach the agent. For instance, give it a task, but if it struggles, just complete it myself while the extension observes my interactions.<p>Perhaps these could be used as few shot examples for priming the model?<p>Gonna play around with this soon!
Does the demo show the AI with the prompt to "Schedule standup tomorrow at 10am. Invite david@taxy.ai" scheduling a meeting at 10am TODAY, which also was already five hours in the past?<p>Makes me worried about AI with internet access...
Does this mean that form-fillers that actually work are around the corner?<p>Like the Lastpass form filler, but instead it would actually work?<p>Never ever having to fill out any webform manually ever again?!<p>That's a killer app right there IMO.
Very cool. The “sending everything of relevance on the page to OpenAI” is of course creepy. But that’s table stakes for anything like this until people can run them externally.<p>This would make a cool, “magic box”, at the top of a web page. Type in what you want to do, it sends it to the server along with the DOM extract (same site server). Server asks magical LLM how to do it, and then spits it back to the client. So no plug-in needed and data flow would pass through the source server.
I'm feeling that an API is something much more stable and deterministic than human-readable interface. Also you can train AI to learn which API calls to make for the task by looking at page sources. Why not translating prompts to single API calls instead of a script for clicking through DOM elements achieving the same?
Very cool idea! I'm excited to try it. I'm a little bit worried about the reliability of interfacing with a website via the DOM. I trust GPT-4 enough, but I could see a situation where the correct fields to fill in are ambiguous in the DOM and the plugin ends up saving or deleting the wrong data.
This makes me think of <a href="https://www.bardeen.ai/" rel="nofollow">https://www.bardeen.ai/</a> but maybe more capable of finding out what to do by itself.
This is very cool, impressive work in 2 weeks!
Each action seems to have some delay after it, is there any reason for that? Is it because you are streaming the OpenAI response and performing the actions as they come? If not, I imagine streaming the query response and executing each action as they emit would speed it up quite a bit?
Wow... Another AI thing that I can't use because there's a "waitlist". GTFO, software doesn't need waitlists and you're a jerk for advertising uselessness.