I’ve always wanted to be able to control my computer hands-free. Whether it’s when my hands are tied like when I’m eating, or when I’m away and want to run a routine task.<p>There are existing tools that can execute predefined tasks on the browser, but they have several disadvantages:<p>* They work by matching css selectors, so they are brittle -> increases the cost of maintenance.<p>* Every task must be predefined - the tool won’t work on new websites without additional configuration.<p>* It is tedious to specify a task consisting of many individual steps.<p>To solve this, I built [name redacted] - it’s a chrome extension that uses GPT4-V to interpret the screen contents and deduce the sequence of steps needed to complete a task. See a demo here: <a href="https://www.youtube.com/watch?v=pyy7cMj-zHk" rel="nofollow">https://www.youtube.com/watch?v=pyy7cMj-zHk</a>.<p>Here’s how it works:<p>1. Ask [name redacted] to perform a task, like book me a dinner reservation at my favorite restaurant.<p>2. [name redacted] takes a screenshot of the current page and some additional metadata about the interactive elements, and selects the next action to take to accomplish the task.<p>3. The action is performed in the browser tab, and the process repeats until the task is done.<p>In this way, [name redacted] can operate on all websites, perform any task a human can, and work mostly autonomously. It is a work in progress to get the accuracy correct, but I’m confident the approach is valid.<p>Some things I’m excited for in the roadmap:<p>* Voice in/out support<p>* Record and rerun macros<p>* Automated testing and reports with [name redacted]<p>Let me know what you think and I’d love any feedback you have!