Show HN: I wrote an open-source browser alternative for Computer Use for any LLM

180 pointsby gregpr077 months ago

Hey HN,I made Browser-Use, an open-source tool that lets (all Langchain supported) LLMs execute tasks directly in the browser just with function calling.It allows you to build agents that interact with web elements using natural language prompts. We created a layer that simplifies website interaction for LLMs by extracting xPaths and interactive elements like buttons and input fields (and other fancy things). This enables you to design custom web automation and scraping functions without manual inspection through DevTools.Hasn't this been done a lot of times? Good question, as a general SaaS tool yes, but I think a lot of people are going to try to make their own web automation agents from scratch, so the idea is to provide groundwork/library for the hard part so that not everyone has to repeat these steps:- parse html in a LLM friendly way (clickable items + screenshots)- provide a nice function calls for everything inside the browser- create reusable agent classesWhat this is NOT? An all knowing AI agent that can solve all your problems.The vision: create repeatable tasks on the web just by prompting your agent and not care about the hows.To better showcase the power of text extraction we made a few demos such as:- Applying for multiple software engineering jobs in San Francisco- Opening new tabs to search for images of Albert Einstein, Oprah Winfrey, and Steve Jobs- Finding the cheapest one-way flight from London to Kyrgyzstan for December 25thI’d be interested in feedback on how this tool fits into your automation workflows. Try it out and let me know how it performs on your end.We are Gregor & Magnus and we built this in 5 days.

16 comments

firejake3087 months ago

Is it decided then that screenshots are better input for LLMs than HTML, or is that still an active area of investigation? I see that y'all elected for a mostly screenshot-based approach here, wondering if that was based on evidence or just a working theory.

评论 #42053656 未加载

评论 #42054247 未加载

评论 #42054735 未加载

评论 #42054397 未加载

评论 #42055047 未加载

评论 #42056661 未加载

theredsix7 months ago

Awesome project, starred! Here are some other projects for agentic browser interactions:* Cerebellum (Typescript): <a href="https://github.com/theredsix/cerebellum">https://github.com/theredsix/cerebellum</a>* Skyvern: <a href="https://github.com/Skyvern-AI/skyvern">https://github.com/Skyvern-AI/skyvern</a>Disclaimer: I am the author of Cerebellum

评论 #42054020 未加载

gitgud7 months ago

It's impressive, but to me it seems like the saddest development experience...<pre><code> agent = Agent( task='Go to hackernews on show hn and give me top 10 post titels, their points and hours. Calculate for each the ratio of points per hour.', llm=ChatOpenAI(model='gpt-4o'), ) await agent.run() </code></pre> Passing prompts to a LLM agent... waiting for the black box to run and do something...

评论 #42060323 未加载

评论 #42057061 未加载

maggreenWAI7 months ago

Let's say in 1 year, more agents than humans interact with the web.Do you think: 1. Websites release more API functions for agents to interact with them or 2. We will transform with tools like this the UI into functions callable by agents and maybe even cache all inferred functions for websites in a third party service?

评论 #42062287 未加载

G_o_D7 months ago

It is called screen scraping, where text rendered on screen/monitors are being scraped either in browser or even in windows os even on android screen , thats how softwares like autohotkey and all do automation windows or android screen can be dumped into heirarchical xml along with x y coordinates of its ui elements along with text they contain which can be uses o click scroll scrape text

bravura7 months ago

It would be amazing if you:a) There were a test / eval suite to determine which model works best for what. It could be divided into a training suite and test suite. (Training tasks can be used for training, test tasks only for evaluation.) Possibly a combination of unit tests against known xpaths, and integration tests that are multi-step and end in a measurable result. I know the web is constantly changing, so I'm not 100% sure how this should work.b) There were some sort of wiki, or perhaps another repo or discussion board, of community-generated prompt recipes for particular actions.

评论 #42053686 未加载

Oras7 months ago

This looks interesting. I am really impressed with MultiOn [0], and I tried to make something similar, but it's quite challenging doing it with a Chrome extension.I also saw one doing Captcha solving with Selenium [1].I will keep an eye on your development, good luck![0] <a href="https://www.multion.ai/" rel="nofollow">https://www.multion.ai/</a> [1] <a href="https://github.com/VRSEN/agency-swarm">https://github.com/VRSEN/agency-swarm</a>

评论 #42054929 未加载

评论 #42053853 未加载

评论 #42053901 未加载

soham1237 months ago

I have built something similar at <a href="https://github.com/ComposioHQ/composio/tree/master/python/composio/tools/local/browsertool/actions">https://github.com/ComposioHQ/composio/tree/master/python/co...</a>Compatible with any LLMs and agentic framework

评论 #42053953 未加载

rahimnathwani7 months ago

In case anyone else was looking for the functions available to the LLM: <a href="https://github.com/gregpr07/browser-use/blob/68a3227c8bc97fe424f90f744c295c7d330ae5fd/src/controller/views.py#L59">https://github.com/gregpr07/browser-use/blob/68a3227c8bc97fe...</a>

评论 #42055163 未加载

评论 #42053619 未加载

coreyp_17 months ago

This looks really interesting. The first hurdle, though, that prevents me from experimenting with this on my job is the lack of a license.I see in the readme that it claims that it is MIT licensed, but there is no actual license file or information in any of the source files that I could find.

评论 #42053440 未加载

daft_pink7 months ago

I was really excited about the original claude computer use until I watched the youtube videos and saw it was only running in a docker container. I wish I could run something like this on a real machine.

评论 #42053825 未加载

评论 #42054695 未加载

评论 #42053811 未加载

KaoruAoiShiho7 months ago

Maybe can build a database for which sites / pages work best with HTML vs Screenshots, and then can choose to use HTML to save on token cost / improve latency if possible.

fragmede7 months ago

wants to have cron, so I can ask it to check with my local parking agency, every day or every 12 hours, do I have a parking ticket, and to raise a warning if I do. Or to check with county jail and see if someone is still there/not there. Or check the price of a product on Amazon every hour and warn when it's changed (aka camelcamelcamel but local). Search craigslist/zillow/Facebook marketplace for items until one shows up. etc.

DeathArrow7 months ago

>This enables you to design custom web automation and scraping functions without manual inspection through DevToolsCan it use a headless browser?

评论 #42062830 未加载

WillAdams7 months ago

Does it work with COM objects/Java applications?I'd give my interest in Hell for a way to have a script plug in data into a Java app.

评论 #42053633 未加载

评论 #42054775 未加载

ReD_CoDE6 months ago

Many web developers use Playwright and Puppeteer, so why Selenium?