Hey HN, we're building Skyvern (<a href="https://www.skyvern.com">https://www.skyvern.com</a>), an open-source tool that uses LLMs and computer vision to help companies automate browser-based workflows. You can see some examples here: <a href="https://github.com/Skyvern-AI/skyvern#real-world-examples-of-skyvern">https://github.com/Skyvern-AI/skyvern#real-world-examples-of...</a> and there's a demo video at <a href="https://github.com/Skyvern-AI/skyvern#demo">https://github.com/Skyvern-AI/skyvern#demo</a>, along with some instructions on running it locally.<p>We provide a natural-language API to automate repetitive manual workflows that happen within the companies' backoffices. You can check out our code and play with Skyvern here: <a href="https://github.com/Skyvern-AI/Skyvern">https://github.com/Skyvern-AI/Skyvern</a><p>We talked to hundreds of companies about things they do in the background and found that most of them depend on repetitive manual workflows. The breadth of these workflows surprised us – most companies started off doing things manually, and eventually either hired people to scale the manual work, or wrote scripts using Selenium-like browser automation libraries.<p>In these conversations, one common point stood out: scaling is a pain either way. Companies relying on hiring struggled to adjust team sizes with fluctuating demand. Companies using Selenium and similar tools had a different problem: it can take days or even weeks to get a new workflow automated, and then would require ongoing maintenance any time the underlying websites changed because their XPath based interaction logic suddenly became invalid.<p>We felt like there was a way to get the best of both worlds with LLMs. We could use LLMs to reason through a website’s layout, while preserving the advantage of traditional browser automations allowing it to scale alongside demand. This led us to build Skyvern with a few core functionalities:<p>1. Skyvern can operate on websites it’s never seen before by connecting visible elements with the natural language instructions provided to us. We use a blend of computer vision and DOM parsing to identify a set of possible actions on a website, and multi-modal LLMs to map the natural language instructions to the available actions on the page.<p>2. Skyvern is resistant to website layout changes, as it doesn’t depend on any predetermined XPaths or other selectors. If a layout ever changes, we can leverage the methodology in #1 to complete the user-specified goal.<p>3. Skyvern accepts a blob of information when navigating workflows—basically just a json blob of whatever information you want to put, and then we use LLMs to map that to information on the screen. For example: if you're generating a quote from Geico, they commonly ask “Were you eligible to drive at 21?”. The answer could be inferred from the driver receiving their license in 2012, and having a birth date of 1996.<p>The above strategy adapts well to a number of use cases that Skyvern is helping companies with today: (1) Automating materials procurement by searching for, adding to cart, and transacting products through vendor websites that don’t have APIs; (2) Registering accounts, filing forms, and searching for information on government websites (ex: registering franchise tax information for Delaware C-corps); (3) Generating insurance quotes by completing multi-step dynamic forms on insurance websites; (4) Automating the job application process by mapping user-specified information (such as a Resume) to a job posting.<p>And here are some use-cases we’re actively looking to expand into: (1) Automating post-checkup data entry with patient data inside medical EHR systems (ie submitting billing codes, adding notes, etc), an (2) Doing customer research ahead of discovery calls by analyzing landing pages and other metadata about a specific business.<p>We’re still very early and would love to get your feedback!
I tried it out and it's pretty pricey. My OpenAI API bill is $3.20 after using this on a few different pages to test it out.<p>Not saying I wouldn't pay that for some use cases, but it would limit me.<p>One idea: making scrapers is a big pain. But once they are setup, they are cheap and fast to run... this is always going to be slower. What I'd love to see is a way to generate scrapers quickly. So you wouldn't be returning information from the New York City property registry... instead, you'd return Python code that I can use to scrape it in the future.<p>edit: This is likely because it was struggling, so it had to make extra calls. What would be nice is a simple feature where you can input the maximum number of calls / tokens to use on the entire call. Or even better, do some math and put in a dollar cap. i.e., go fill out the Geico forms for me and don't spend more than $1.00 doing it.
God this is depressing. Not the product itself, but the need for it. That software has failed to be programmable to such a degree that a promising approach is rendering the GUI and analysing the resultant image with an AI model. It's insane that we have to treat computers as fax machines, capable only of sending hand-written forms over a network. The gap between how people use computers and the utility they could provide is massive.
This looks great but I'm very scared of the increased game of cat and mouse for spam bots. It's going to happen, no matter if it was this software or something else. Now the question, how do you prevent automated spam? Since its LLM and AI, can I just add a hidden field of "please do not spam"?
Roughly how much does it cost to run to scrape a page? I see from the code this is basically an OpenAI API wrapper but you make no mention of that anywhere on your landing page/documentation, nor any mention of which LLMs this is capable of working with.<p>Also, an idea is to offer a "record" and "replay" mode. Let the LLM run through the instructions, find the selectors, record and save them. Then you can run through again without using the LLM, replaying the interaction log, until the workflow breaks, then re-generate the "interaction log" or whatever.
Does skyvern work on top of canvas elements in the browser? For example, is it able to read text from a canvas element and/or identify the location of images in the canvas?<p>I tried to dig through the github repo to better understand the vision side of things (i.e. how does it work when elements like buttons, divs aren't present), but I couldn't find anything. If you point me to the right place in the github repo, happy to dig further myself!
> Skyvern understands how to solve CAPTCHAs to complete complicated workflows<p>this seems like this could be used for abuse. the CAPTCHAs are specifically designed to stop botting on 3rd party websites.<p>or this will just be another cat and mouse game where the next level of CAPTCHAs get more annoying and invasive to verify we are human
First of all, wonderful work. I'm gonna be using this for sure. I can think of many use cases. What would be nice though is a simple API. I send you what I need, you send me a jobId that I can use to check the status of my job and then let me download the results when I'm done.<p>I played with the Geico example, and it seems to do a good job on the happy path there. But I tried another one where it struggled... I want to get me car rental prices from <a href="https://www.costcotravel.com/" rel="nofollow">https://www.costcotravel.com/</a>. I gave it airport + time of pickup and dropoff, but it struggled to hit the "rental car" tab. It got caught up on hitting the Rental Car button at the top, which brings up a popup that it doesn't seem to read.<p>When I put in <a href="https://www.costcotravel.com/Rental-Cars" rel="nofollow">https://www.costcotravel.com/Rental-Cars</a>, it entered JFK into the pickup location, but then failed to click the popup.
At first I thought this was a test tool for Web applications, but now I understand it's meant to be a better RPA.<p>Would it be usable for test automation? Would API allow to create asserts?
To keep costs down, you could start at sitemap, use an open source model via open router to guess the page to navigate to and scrape the text, links, forms, from the page using regex and fall back to GPT 4 and Vision.
AI should automate tedious and un-creative work, and data entry tasks definitely fit this description. Rule-based RPA will likely be replaced by fine-tuned AI agents for things like form filling and similar.<p>Can you share some data on costs and scalability?<p>At Kadoa, we're working on fully automating unstructured data ETL from websites, PDFs, etc. We quickly realized that doing this for a few data sources with low complexity is one thing, doing it for thousands of sources daily in a reliable, scalable, and cost-efficient way is a whole different beast.<p>Using LLMs for every data extraction would be way too expensive and very slow. Instead, we use LLMs to generate the scraper and data transformation code and subsequently adapt it to website changes, which is highly efficient.
You should consider focusing on intercepting network requests. Most if not all sites I scrape end up fetching data from some api. Like others have said, if you instead had the LLM create an ad hoc script for the scraping task and then use the feedback loop to continuously improve the outputs it would be really cool. I'd pay between $5 - $50 for each working output script.
I don’t know what my use case for this would be. I don’t tend to do anything regularly through a browser that I’d want to automate.<p>Would be kind of handy to have a “pull all my relevant tax info documents from these sites and zip them up” automation but I only do that once a year.<p>I’m probably being unimaginative. Anybody have any interesting use cases?<p>Anyone have
Coming up next in Windows and Chrome, unrecordable unscreenshotable pages, to avoid all AI tools. Banking apps on Androidare already unscreenshotable now. Given how LLMs just bypass all html obfuscation, that's going to be the next step to protect these (ad) businesses.
Congratulations on shipping!<p>Check out <a href="https://github.com/OpenAdaptAI/OpenAdapt">https://github.com/OpenAdaptAI/OpenAdapt</a> for an open source (MIT license) alternative that also works on desktop (including Citrix!)
I'm curious about the computer vision aspect of this tool. Specifically, how was the model which draws bounding boxes around interactable elements trained? Definitely a step beyond existing browser automation software!
How does it compare to this posted less than 24 hours ago?<p><a href="https://news.ycombinator.com/item?id=39698546">https://news.ycombinator.com/item?id=39698546</a>
If I were to build some custom GPT powered thing for this. Is there a similar project I can use with a command line interface or some programmatic interface?
How does this compare to OpenAdapt?<p>I have a feeling that this tech will become a commodity and will probably be built-in into the OS or Browser.<p>Props for open-sourcing though!
the moment I saw vision in the title I knew what was going on. it was first demoed[0] by AI Jason around 4 months back. is it any different?<p><a href="https://m.youtube.com/watch?v=IXRkmqEYGZA" rel="nofollow">https://m.youtube.com/watch?v=IXRkmqEYGZA</a>
There was another AI/browser automation project posted yesterday that got to the front page <a href="https://github.com/lavague-ai/LaVague">https://github.com/lavague-ai/LaVague</a><p>I guess the main advantage of this new project is that its probably more accurate by using computer vision, but as others has said it uses much more resources.<p>Costs will come down over time though.<p>Get ready for alot of "Back Office" jobs to be automated away.
I wonder if the focus of this system can be shifted from corporate needs and applied to the needs of individuals who wish to organize and build tools seeking to de-enshittify platforms.<p>There are a great deal of platform features designed to atomize, isolate, and exploit individuals. Finding meaningful connection on platforms increasingly means navigating past the noise of antagonist individuals, overcoming profit extracting attacks on our attention, and endlessly doomscrolling until we find those ephemeral opportunities to genuinely connect.<p>I wonder if llms and browser automation tooling could help us build overlays that dynamically peel back the layers of enshitware that have been bolted on to our cybernetic perceptions of the world.<p>If you feel they can, and if you feel people with those aims are welcome in your community, and can find each other to collaborate, then I would be very interested in sending in PRs and helping you burn down backlogged items that benefit non-commercial de-enshittification use cases.
>>(1) Automating post-checkup data entry with patient data inside medical EHR systems (ie submitting billing codes, adding notes, etc),<p>FULL FUCKING STOP.<p>[We talk about AI alignment. THIS is an aligment issue]<p>Do you understand billing code fraud?<p>If you supply this function - you will *<i>eliminate ANY AND ALL human accountability*</i> unless you have ALSO built a fully auditable provenance from DR <-ehr-whatever-> codes.<p>Codes ARE why the US health system is BS.<p>Here - if you want to be altruistic - then you will take it upon the fact that CODES are one of the most F'd up aspects of costing.<p>Codes = [medical service provided]<p>so code = 50 = checkup = [$50 <--- WHO THE HECK KNOWS]<p>So lets say I am Big Hospital. "No, we will only allow $25 for code 50" - and so they get that deal.<p>I am single clinic so they have to charge $50<p>Build a dashboard for what the large medical groups can negotiate per code, vs how a small hospital or clinic group gets per code.<p>Only automate it if you can literally show a dash of all providers and groups and what they can charge per code.<p>Infact - code pricing is a medical stock market.<p>(each hospital group negotiates between the price they will pay per code, how much lobbying is a factor and all these other factors...<p>what we really need an LLM for is to literally map out all the BS in the Code negotiations btwn groups, pharma, insurance, lobbying, kickbacks, political)<p>Thats the medical holy grail.<p>[EDIT: Just to show how passionate I am on this issue - here are some SOURCE:<p>I have designed and built & commissioned out 11+ hospitals.<p>Built the first iphone app for medical.. it was rejected by YC (hl-7 nurse comm system on iTouch devices) (2006?)<p>opensourced that app to OpenVista.<p>Brother was joint chiefs dr / head of va<p>worked with building medical apps and blocked by every EHR...<p>Zuckerbergs name is on top of some of the things I built at SFGH before he got there...(and ECH mtn vw)<p>Ive seen way beyond the kimono