Show HN: Skyvern – Browser automation using LLMs and computer vision

422 pointsby suchintanabout 1 year ago

Hey HN, we're building Skyvern (<a href="https://www.skyvern.com">https://www.skyvern.com</a>), an open-source tool that uses LLMs and computer vision to help companies automate browser-based workflows. You can see some examples here: <a href="https://github.com/Skyvern-AI/skyvern#real-world-examples-of-skyvern">https://github.com/Skyvern-AI/skyvern#real-world-examples-of...</a> and there's a demo video at <a href="https://github.com/Skyvern-AI/skyvern#demo">https://github.com/Skyvern-AI/skyvern#demo</a>, along with some instructions on running it locally.We provide a natural-language API to automate repetitive manual workflows that happen within the companies' backoffices. You can check out our code and play with Skyvern here: <a href="https://github.com/Skyvern-AI/Skyvern">https://github.com/Skyvern-AI/Skyvern</a>We talked to hundreds of companies about things they do in the background and found that most of them depend on repetitive manual workflows. The breadth of these workflows surprised us – most companies started off doing things manually, and eventually either hired people to scale the manual work, or wrote scripts using Selenium-like browser automation libraries.In these conversations, one common point stood out: scaling is a pain either way. Companies relying on hiring struggled to adjust team sizes with fluctuating demand. Companies using Selenium and similar tools had a different problem: it can take days or even weeks to get a new workflow automated, and then would require ongoing maintenance any time the underlying websites changed because their XPath based interaction logic suddenly became invalid.We felt like there was a way to get the best of both worlds with LLMs. We could use LLMs to reason through a website’s layout, while preserving the advantage of traditional browser automations allowing it to scale alongside demand. This led us to build Skyvern with a few core functionalities:1. Skyvern can operate on websites it’s never seen before by connecting visible elements with the natural language instructions provided to us. We use a blend of computer vision and DOM parsing to identify a set of possible actions on a website, and multi-modal LLMs to map the natural language instructions to the available actions on the page.2. Skyvern is resistant to website layout changes, as it doesn’t depend on any predetermined XPaths or other selectors. If a layout ever changes, we can leverage the methodology in #1 to complete the user-specified goal.3. Skyvern accepts a blob of information when navigating workflows—basically just a json blob of whatever information you want to put, and then we use LLMs to map that to information on the screen. For example: if you're generating a quote from Geico, they commonly ask “Were you eligible to drive at 21?”. The answer could be inferred from the driver receiving their license in 2012, and having a birth date of 1996.The above strategy adapts well to a number of use cases that Skyvern is helping companies with today: (1) Automating materials procurement by searching for, adding to cart, and transacting products through vendor websites that don’t have APIs; (2) Registering accounts, filing forms, and searching for information on government websites (ex: registering franchise tax information for Delaware C-corps); (3) Generating insurance quotes by completing multi-step dynamic forms on insurance websites; (4) Automating the job application process by mapping user-specified information (such as a Resume) to a job posting.And here are some use-cases we’re actively looking to expand into: (1) Automating post-checkup data entry with patient data inside medical EHR systems (ie submitting billing codes, adding notes, etc), an (2) Doing customer research ahead of discovery calls by analyzing landing pages and other metadata about a specific business.We’re still very early and would love to get your feedback!

32 comments

dtnewmanabout 1 year ago

I tried it out and it's pretty pricey. My OpenAI API bill is $3.20 after using this on a few different pages to test it out.Not saying I wouldn't pay that for some use cases, but it would limit me.One idea: making scrapers is a big pain. But once they are setup, they are cheap and fast to run... this is always going to be slower. What I'd love to see is a way to generate scrapers quickly. So you wouldn't be returning information from the New York City property registry... instead, you'd return Python code that I can use to scrape it in the future.edit: This is likely because it was struggling, so it had to make extra calls. What would be nice is a simple feature where you can input the maximum number of calls / tokens to use on the entire call. Or even better, do some math and put in a dollar cap. i.e., go fill out the Geico forms for me and don't spend more than $1.00 doing it.

评论 #39708004 未加载

评论 #39708101 未加载

评论 #39707975 未加载

评论 #39711224 未加载

评论 #39711931 未加载

评论 #39709819 未加载

评论 #39708645 未加载

评论 #39710031 未加载

评论 #39726436 未加载

James_Kabout 1 year ago

God this is depressing. Not the product itself, but the need for it. That software has failed to be programmable to such a degree that a promising approach is rendering the GUI and analysing the resultant image with an AI model. It's insane that we have to treat computers as fax machines, capable only of sending hand-written forms over a network. The gap between how people use computers and the utility they could provide is massive.

评论 #39709944 未加载

评论 #39709926 未加载

chuckwnelsonabout 1 year ago

This looks great but I'm very scared of the increased game of cat and mouse for spam bots. It's going to happen, no matter if it was this software or something else. Now the question, how do you prevent automated spam? Since its LLM and AI, can I just add a hidden field of "please do not spam"?

评论 #39706408 未加载

评论 #39706774 未加载

评论 #39706604 未加载

评论 #39706449 未加载

评论 #39711870 未加载

dinobonesabout 1 year ago

Roughly how much does it cost to run to scrape a page? I see from the code this is basically an OpenAI API wrapper but you make no mention of that anywhere on your landing page/documentation, nor any mention of which LLMs this is capable of working with.Also, an idea is to offer a "record" and "replay" mode. Let the LLM run through the instructions, find the selectors, record and save them. Then you can run through again without using the LLM, replaying the interaction log, until the workflow breaks, then re-generate the "interaction log" or whatever.

评论 #39706975 未加载

评论 #39707185 未加载

评论 #39707062 未加载

nullnomadabout 1 year ago

Does skyvern work on top of canvas elements in the browser? For example, is it able to read text from a canvas element and/or identify the location of images in the canvas?I tried to dig through the github repo to better understand the vision side of things (i.e. how does it work when elements like buttons, divs aren't present), but I couldn't find anything. If you point me to the right place in the github repo, happy to dig further myself!

dvngnt_about 1 year ago

> Skyvern understands how to solve CAPTCHAs to complete complicated workflowsthis seems like this could be used for abuse. the CAPTCHAs are specifically designed to stop botting on 3rd party websites.or this will just be another cat and mouse game where the next level of CAPTCHAs get more annoying and invasive to verify we are human

评论 #39707160 未加载

评论 #39710207 未加载

评论 #39712424 未加载

评论 #39707194 未加载

chadashabout 1 year ago

First of all, wonderful work. I'm gonna be using this for sure. I can think of many use cases. What would be nice though is a simple API. I send you what I need, you send me a jobId that I can use to check the status of my job and then let me download the results when I'm done.I played with the Geico example, and it seems to do a good job on the happy path there. But I tried another one where it struggled... I want to get me car rental prices from <a href="https://www.costcotravel.com/" rel="nofollow">https://www.costcotravel.com/</a>. I gave it airport + time of pickup and dropoff, but it struggled to hit the "rental car" tab. It got caught up on hitting the Rental Car button at the top, which brings up a popup that it doesn't seem to read.When I put in <a href="https://www.costcotravel.com/Rental-Cars" rel="nofollow">https://www.costcotravel.com/Rental-Cars</a>, it entered JFK into the pickup location, but then failed to click the popup.

评论 #39707679 未加载

agreeahmedabout 1 year ago

Exciting to see this on HN. I think very soon agents like Skyvern will account for the vast, vast majority of web traffic.

评论 #39706506 未加载

评论 #39706979 未加载

评论 #39706973 未加载

评论 #39706960 未加载

giammaabout 1 year ago

At first I thought this was a test tool for Web applications, but now I understand it's meant to be a better RPA.Would it be usable for test automation? Would API allow to create asserts?

评论 #39707009 未加载

评论 #39707021 未加载

评论 #39706996 未加载

BasieP2about 1 year ago

Is this (finally) a step towards a better way of automated frontend testing?We're currently testing dom instead of vision.

评论 #39709958 未加载

tonyoconnellabout 1 year ago

To keep costs down, you could start at sitemap, use an open source model via open router to guess the page to navigate to and scrape the text, links, forms, from the page using regex and fall back to GPT 4 and Vision.

hubraumhugoabout 1 year ago

AI should automate tedious and un-creative work, and data entry tasks definitely fit this description. Rule-based RPA will likely be replaced by fine-tuned AI agents for things like form filling and similar.Can you share some data on costs and scalability?At Kadoa, we're working on fully automating unstructured data ETL from websites, PDFs, etc. We quickly realized that doing this for a few data sources with low complexity is one thing, doing it for thousands of sources daily in a reliable, scalable, and cost-efficient way is a whole different beast.Using LLMs for every data extraction would be way too expensive and very slow. Instead, we use LLMs to generate the scraper and data transformation code and subsequently adapt it to website changes, which is highly efficient.

评论 #39707045 未加载

samsullivanabout 1 year ago

You should consider focusing on intercepting network requests. Most if not all sites I scrape end up fetching data from some api. Like others have said, if you instead had the LLM create an ad hoc script for the scraping task and then use the feedback loop to continuously improve the outputs it would be really cool. I'd pay between $5 - $50 for each working output script.

评论 #39715926 未加载

somethingAlexabout 1 year ago

I don’t know what my use case for this would be. I don’t tend to do anything regularly through a browser that I’d want to automate.Would be kind of handy to have a “pull all my relevant tax info documents from these sites and zip them up” automation but I only do that once a year.I’m probably being unimaginative. Anybody have any interesting use cases?Anyone have

评论 #39711132 未加载

hirako2000about 1 year ago

It reminds me of that bug a kid found to bypass the password locked screen of a very popular Linux distro.Might be great for pen testing.

评论 #39707349 未加载

smusamashahabout 1 year ago

Coming up next in Windows and Chrome, unrecordable unscreenshotable pages, to avoid all AI tools. Banking apps on Androidare already unscreenshotable now. Given how LLMs just bypass all html obfuscation, that's going to be the next step to protect these (ad) businesses.

评论 #39712668 未加载

abrichrabout 1 year ago

Congratulations on shipping!Check out <a href="https://github.com/OpenAdaptAI/OpenAdapt">https://github.com/OpenAdaptAI/OpenAdapt</a> for an open source (MIT license) alternative that also works on desktop (including Citrix!)

razfarabout 1 year ago

I'm curious about the computer vision aspect of this tool. Specifically, how was the model which draws bounding boxes around interactable elements trained? Definitely a step beyond existing browser automation software!

评论 #39707242 未加载

spxneoabout 1 year ago

How does it compare to this posted less than 24 hours ago?<a href="https://news.ycombinator.com/item?id=39698546">https://news.ycombinator.com/item?id=39698546</a>

评论 #39708031 未加载

mosselmanabout 1 year ago

If I were to build some custom GPT powered thing for this. Is there a similar project I can use with a command line interface or some programmatic interface?

评论 #39706589 未加载

ushakovabout 1 year ago

How does this compare to OpenAdapt?I have a feeling that this tech will become a commodity and will probably be built-in into the OS or Browser.Props for open-sourcing though!

评论 #39707636 未加载

评论 #39707084 未加载

t14000about 1 year ago

Exciting stuff, my employer would be interested but it's AGPL3 licensed so it's a non-starter for them.

shnkrabout 1 year ago

the moment I saw vision in the title I knew what was going on. it was first demoed[0] by AI Jason around 4 months back. is it any different?<a href="https://m.youtube.com/watch?v=IXRkmqEYGZA" rel="nofollow">https://m.youtube.com/watch?v=IXRkmqEYGZA</a>

评论 #39707468 未加载

aussieguy1234about 1 year ago

There was another AI/browser automation project posted yesterday that got to the front page <a href="https://github.com/lavague-ai/LaVague">https://github.com/lavague-ai/LaVague</a>I guess the main advantage of this new project is that its probably more accurate by using computer vision, but as others has said it uses much more resources.Costs will come down over time though.Get ready for alot of "Back Office" jobs to be automated away.

ilakshabout 1 year ago

Looks terrific. I hope you will consider adding support for Claude 3.

评论 #39708950 未加载

cryptoBros2023about 1 year ago

I wonder if we could reduce the call by switching to a local llama?

评论 #39713211 未加载

andy_pppabout 1 year ago

I think I’d really like a react-native version of this! Any plans?

评论 #39713244 未加载

barfbagginusabout 1 year ago

I wonder if the focus of this system can be shifted from corporate needs and applied to the needs of individuals who wish to organize and build tools seeking to de-enshittify platforms.There are a great deal of platform features designed to atomize, isolate, and exploit individuals. Finding meaningful connection on platforms increasingly means navigating past the noise of antagonist individuals, overcoming profit extracting attacks on our attention, and endlessly doomscrolling until we find those ephemeral opportunities to genuinely connect.I wonder if llms and browser automation tooling could help us build overlays that dynamically peel back the layers of enshitware that have been bolted on to our cybernetic perceptions of the world.If you feel they can, and if you feel people with those aims are welcome in your community, and can find each other to collaborate, then I would be very interested in sending in PRs and helping you burn down backlogged items that benefit non-commercial de-enshittification use cases.

评论 #39713078 未加载

xeonmcabout 1 year ago

What do you call an LLM with vision? LLVM...oh, that's why it's called Skyvern

is_trueabout 1 year ago

Weeks to automate something? Anyone experienced would be able to automate most workflows in a couple of days top.

评论 #39707103 未加载

评论 #39707556 未加载

评论 #39706743 未加载

samstaveabout 1 year ago

>>(1) Automating post-checkup data entry with patient data inside medical EHR systems (ie submitting billing codes, adding notes, etc),FULL FUCKING STOP.[We talk about AI alignment. THIS is an aligment issue]Do you understand billing code fraud?If you supply this function - you will *eliminate ANY AND ALL human accountability* unless you have ALSO built a fully auditable provenance from DR <-ehr-whatever-> codes.Codes ARE why the US health system is BS.Here - if you want to be altruistic - then you will take it upon the fact that CODES are one of the most F'd up aspects of costing.Codes = [medical service provided]so code = 50 = checkup = [$50 <--- WHO THE HECK KNOWS]So lets say I am Big Hospital. "No, we will only allow $25 for code 50" - and so they get that deal.I am single clinic so they have to charge $50Build a dashboard for what the large medical groups can negotiate per code, vs how a small hospital or clinic group gets per code.Only automate it if you can literally show a dash of all providers and groups and what they can charge per code.Infact - code pricing is a medical stock market.(each hospital group negotiates between the price they will pay per code, how much lobbying is a factor and all these other factors...what we really need an LLM for is to literally map out all the BS in the Code negotiations btwn groups, pharma, insurance, lobbying, kickbacks, political)Thats the medical holy grail.[EDIT: Just to show how passionate I am on this issue - here are some SOURCE:I have designed and built & commissioned out 11+ hospitals.Built the first iphone app for medical.. it was rejected by YC (hl-7 nurse comm system on iTouch devices) (2006?)opensourced that app to OpenVista.Brother was joint chiefs dr / head of vaworked with building medical apps and blocked by every EHR...Zuckerbergs name is on top of some of the things I built at SFGH before he got there...(and ECH mtn vw)Ive seen way beyond the kimono

评论 #39707495 未加载

999900000999about 1 year ago

Don't make me sign up for a demo, I'd rather just give you my credit card number and try it myself.Aside from that cool project!

评论 #39709973 未加载

评论 #39709287 未加载