Hey HN,<p>I made Browser-Use, an open-source tool that lets (all Langchain supported) LLMs execute tasks directly in the browser just with function calling.<p>It allows you to build agents that interact with web elements using natural language prompts. We created a layer that simplifies website interaction for LLMs by extracting xPaths and interactive elements like buttons and input fields (and other fancy things). This enables you to design custom web automation and scraping functions without manual inspection through DevTools.<p>Hasn't this been done a lot of times?
Good question, as a general SaaS tool yes, but I think a lot of people are going to try to make their own web automation agents from scratch, so the idea is to provide groundwork/library for the hard part so that not everyone has to repeat these steps:<p>- parse html in a LLM friendly way (clickable items + screenshots)<p>- provide a nice function calls for everything inside the browser<p>- create reusable agent classes<p>What this is NOT? An all knowing AI agent that can solve all your problems.<p>The vision: create repeatable tasks on the web just by prompting your agent and not care about the hows.<p>To better showcase the power of text extraction we made a few demos such as:<p>- Applying for multiple software engineering jobs in San Francisco<p>- Opening new tabs to search for images of Albert Einstein, Oprah Winfrey, and Steve Jobs<p>- Finding the cheapest one-way flight from London to Kyrgyzstan for December 25th<p>I’d be interested in feedback on how this tool fits into your automation workflows. Try it out and let me know how it performs on your end.<p>We are Gregor & Magnus and we built this in 5 days.
Is it decided then that screenshots are better input for LLMs than HTML, or is that still an active area of investigation? I see that y'all elected for a mostly screenshot-based approach here, wondering if that was based on evidence or just a working theory.
Awesome project, starred! Here are some other projects for agentic browser interactions:<p>* Cerebellum (Typescript): <a href="https://github.com/theredsix/cerebellum">https://github.com/theredsix/cerebellum</a><p>* Skyvern: <a href="https://github.com/Skyvern-AI/skyvern">https://github.com/Skyvern-AI/skyvern</a><p>Disclaimer: I am the author of Cerebellum
It's impressive, but to me it seems like the saddest development experience...<p><pre><code> agent = Agent(
task='Go to hackernews on show hn and give me top 10 post titels, their points and hours. Calculate for each the ratio of points per hour.',
llm=ChatOpenAI(model='gpt-4o'),
)
await agent.run()
</code></pre>
Passing prompts to a LLM agent... waiting for the black box to run and do something...
Let's say in 1 year, more agents than humans interact with the web.<p>Do you think:
1. Websites release more API functions for agents to interact with them
or
2. We will transform with tools like this the UI into functions callable by agents and maybe even cache all inferred functions for websites in a third party service?
It is called screen scraping, where text rendered on screen/monitors are being scraped either in browser or even in windows os even on android screen , thats how softwares like autohotkey and all do automation windows or android screen can be dumped into heirarchical xml along with x y coordinates of its ui elements along with text they contain which can be uses o click scroll scrape text
It would be amazing if you:<p>a) There were a test / eval suite to determine which model works best for what. It could be divided into a training suite and test suite. (Training tasks can be used for training, test tasks only for evaluation.) Possibly a combination of unit tests against known xpaths, and integration tests that are multi-step and end in a measurable result. I know the web is constantly changing, so I'm not 100% sure how this should work.<p>b) There were some sort of wiki, or perhaps another repo or discussion board, of community-generated prompt recipes for particular actions.
This looks interesting. I am really impressed with MultiOn [0], and I tried to make something similar, but it's quite challenging doing it with a Chrome extension.<p>I also saw one doing Captcha solving with Selenium [1].<p>I will keep an eye on your development, good luck!<p>[0] <a href="https://www.multion.ai/" rel="nofollow">https://www.multion.ai/</a>
[1] <a href="https://github.com/VRSEN/agency-swarm">https://github.com/VRSEN/agency-swarm</a>
I have built something similar at <a href="https://github.com/ComposioHQ/composio/tree/master/python/composio/tools/local/browsertool/actions">https://github.com/ComposioHQ/composio/tree/master/python/co...</a><p>Compatible with any LLMs and agentic framework
In case anyone else was looking for the functions available to the LLM:
<a href="https://github.com/gregpr07/browser-use/blob/68a3227c8bc97fe424f90f744c295c7d330ae5fd/src/controller/views.py#L59">https://github.com/gregpr07/browser-use/blob/68a3227c8bc97fe...</a>
This looks really interesting. The first hurdle, though, that prevents me from experimenting with this on my job is the lack of a license.<p>I see in the readme that it claims that it is MIT licensed, but there is no actual license file or information in any of the source files that I could find.
I was really excited about the original claude computer use until I watched the youtube videos and saw it was only running in a docker container. I wish I could run something like this on a real machine.
Maybe can build a database for which sites / pages work best with HTML vs Screenshots, and then can choose to use HTML to save on token cost / improve latency if possible.
wants to have cron, so I can ask it to check with my local parking agency, every day or every 12 hours, do I have a parking ticket, and to raise a warning if I do. Or to check with county jail and see if someone is still there/not there. Or check the price of a product on Amazon every hour and warn when it's changed (aka camelcamelcamel but local). Search craigslist/zillow/Facebook marketplace for items until one shows up. etc.
>This enables you to design custom web automation and scraping functions without manual inspection through DevTools<p>Can it use a headless browser?