TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

Show HN: Tarsier – Vision utilities for web interaction agents

192 点作者 KhoomeiK大约 1 年前
Hey HN! I built a tool that gives LLMs the ability to understand the visual structure of a webpage even if they don&#x27;t accept image input. We&#x27;ve found that unimodal GPT-4 + Tarsier&#x27;s textual webpage representation consistently beats multimodal GPT-4V&#x2F;4o + webpage screenshot by 10-20%, probably because multimodal LLMs still aren&#x27;t as performant as they&#x27;re hyped to be.<p>Over the course of experimenting with pruned HTML, accessibility trees, and other perception systems for web agents, we&#x27;ve iterated on Tarsier&#x27;s components to maximize downstream agent&#x2F;codegen performance.<p>Here&#x27;s the Tarsier pipeline in a nutshell:<p>1. tag interactable elements with IDs for the LLM to act upon &amp; grab a full-sized webpage screenshot<p>2. for text-only LLMs, run OCR on the screenshot &amp; convert it to whitespace-structured text (this is the coolest part imo)<p>3. map LLM intents back to actions on elements in the browser via an ID-to-XPath dict<p>Humans interact with the web through visually-rendered pages, and agents should too. We run Tarsier in production for thousands of web data extraction agents a day at Reworkd (<a href="https:&#x2F;&#x2F;reworkd.ai">https:&#x2F;&#x2F;reworkd.ai</a>).<p>By the way, we&#x27;re hiring backend&#x2F;infra engineers with experience in compute-intensive distributed systems!<p><a href="https:&#x2F;&#x2F;reworkd.ai&#x2F;careers">https:&#x2F;&#x2F;reworkd.ai&#x2F;careers</a>

16 条评论

bckmn大约 1 年前
Reminds me of [Language as Intermediate Representation](<a href="https:&#x2F;&#x2F;chrisvoncsefalvay.com&#x2F;posts&#x2F;lair&#x2F;" rel="nofollow">https:&#x2F;&#x2F;chrisvoncsefalvay.com&#x2F;posts&#x2F;lair&#x2F;</a>) - LLMs are optimized for language, so translate an image into language and they&#x27;ll do better at modeling it.
评论 #40370089 未加载
abrichr大约 1 年前
Congratulations on shipping!<p>In <a href="https:&#x2F;&#x2F;github.com&#x2F;OpenAdaptAI&#x2F;OpenAdapt&#x2F;blob&#x2F;main&#x2F;openadapt&#x2F;strategies&#x2F;visual.py">https:&#x2F;&#x2F;github.com&#x2F;OpenAdaptAI&#x2F;OpenAdapt&#x2F;blob&#x2F;main&#x2F;openadapt...</a> we use FastSAM to first segment the UI elements, then have the LLM describe each segment individually. This seems to work quite well; see <a href="https:&#x2F;&#x2F;twitter.com&#x2F;OpenAdaptAI&#x2F;status&#x2F;1789430587314336212" rel="nofollow">https:&#x2F;&#x2F;twitter.com&#x2F;OpenAdaptAI&#x2F;status&#x2F;1789430587314336212</a> for a demo.<p>More coming soon!
评论 #40371651 未加载
davedx大约 1 年前
How do you make sure the tagging of elements is robust? With regular browser automation it&#x27;s quite hard to write selectors that will keep working after webpages get updated; often when writing E2E testing teams end up putting [data] attributes into the elements to aid with selection. Using a numerical identifier seems quite fragile.
评论 #40369843 未加载
评论 #40369849 未加载
dbish大约 1 年前
Very cool. We do something similar by combining OCR along with accessiblity data and other data (speech reco et. al.) for desktop based screensharing understanding, but evaluation compared to multi-modal LLMs has not been easy. How are you evaluating to come up with this number &quot;consistently beats multimodal GPT-4V&#x2F;4o + webpage screenshot by 10-20%,&quot;?<p>fwiw so far we&#x27;ve seen that Azure has the best OCR for screenshot type data across the proprietary and open source models, though we are far more focused on grabbing data from desktop based applications then web pages so ymmv
评论 #40369713 未加载
评论 #40370183 未加载
pk19238大约 1 年前
this is such a creative solution. reminds me of how a team rendered wolfenstein into ASCII characters and fine tuned mistral to successfully play it.
评论 #40369569 未加载
shodai80大约 1 年前
How do you know, for a specific webelement, what label it is associated with for a textbox or select?<p>For instance, I might want to tag as you did where elements are, but I still need an association with a label, quite often, to determine what the actual context of the textbox or select is.
评论 #40370919 未加载
reidbarber大约 1 年前
Neat! Been building something similar to the tagging feature in Typescript: <a href="https:&#x2F;&#x2F;github.com&#x2F;reidbarber&#x2F;webmarker">https:&#x2F;&#x2F;github.com&#x2F;reidbarber&#x2F;webmarker</a><p>The Python API on this is really nice though.
wyclif大约 1 年前
Hey! I&#x27;m actually in the Philippines now, and I&#x27;ve spent a lot of time on the island of Bohol, which has the world&#x27;s greatest concentration of tarsiers. In fact, I visited the Tarsier Wildlife Sanctuary on the island of Bohol, Philippines with my wife, which is the world&#x27;s main tarsier sanctuary. So I was instantly intrigued by the name of the app.<p><a href="https:&#x2F;&#x2F;flickr.com&#x2F;photos&#x2F;wyclif&#x2F;3271137617&#x2F;in&#x2F;album-72157613440681039&#x2F;" rel="nofollow">https:&#x2F;&#x2F;flickr.com&#x2F;photos&#x2F;wyclif&#x2F;3271137617&#x2F;in&#x2F;album-7215761...</a>
评论 #40380598 未加载
savy91大约 1 年前
Am I wrong thinking this could very well be the backbone of an alternative to the Rabbit AI? Where you basically end up having possibly infinite tools for your LLM assistant to use to reach a goal without having to build api integrations.
评论 #40371329 未加载
shekhar101大约 1 年前
Tangential - I just want a decent (financial transaction) Table to text conversion that can retain the table structure well enough (e.g. merged cells) and have tried everything under the sun short of fine tuning my own model, including all the multimodal LLMs. None of them work very well without a lot of prompt engineering on case by case basis. Can this help? How can I set it up with a large number of pdfs that are sorted by type and extract tabular information? Any other suggestions?
评论 #40370868 未加载
评论 #40370651 未加载
评论 #40370795 未加载
评论 #40370702 未加载
评论 #40371019 未加载
bravura大约 1 年前
A few questions:<p>Does this work in headless mode?<p>Are you getting a screenshot of the whole webpage including scrolling? Or just the visible part. The whole page, like singlepage.js would be great and is much more useful in many circumstances, although I&#x27;m not sure sure how to handle infinite scrolling. (If not, clean simple APIs for scrolling that don&#x27;t require fiddling and experimentation would be great.)<p>Instead of Google OCR (the only OCR), what about Apple&#x27;s native OCR? That would be amazing.
评论 #40374169 未加载
jumploops大约 1 年前
How does the performance compare to VimGPT[0]?<p>I assume the screenshot-based approach is similar, whereas the text approach should be improved?<p>Very cool either way!<p>[0] <a href="https:&#x2F;&#x2F;github.com&#x2F;ishan0102&#x2F;vimGPT">https:&#x2F;&#x2F;github.com&#x2F;ishan0102&#x2F;vimGPT</a>
评论 #40371276 未加载
esha_manideep大约 1 年前
Great work guys! How did you benchmark traiser&#x27;s 10-20% better? Would love to see exactly how each method scored
评论 #40374173 未加载
v3ss0n大约 1 年前
Since it is just a wrapper around hosted API if Google , can&#x27;t be ran as local fully opensource
评论 #40371113 未加载
jadbox大约 1 年前
Anything like this for nodejs? (This is py)
评论 #40380627 未加载
jackienotchan大约 1 年前
Why was the Show HN text removed? Too much self promotion? You&#x27;re a YC company, so I&#x27;m surprised the mods would do that.<p><a href="https:&#x2F;&#x2F;hn.algolia.com&#x2F;?dateRange=pastYear&amp;page=0&amp;prefix=true&amp;query=tarsier&amp;sort=byDate&amp;type=story" rel="nofollow">https:&#x2F;&#x2F;hn.algolia.com&#x2F;?dateRange=pastYear&amp;page=0&amp;prefix=tru...</a><p>&gt; Hey HN! I built a tool that gives LLMs the ability to understand the visual structure of a webpage even if they don&#x27;t accept image input. We&#x27;ve found that unimodal GPT-4 + Tarsier&#x27;s textual webpage representation consistently beats multimodal GPT-4V&#x2F;4o + webpage screenshot by 10-20%, probably because multimodal LLMs still aren&#x27;t as performant as they&#x27;re hyped to be. Over the course of experimenting with pruned HTML, accessibility trees, and other perception systems for web agents, we&#x27;ve iterated on Tarsier&#x27;s components to maximize downstream agent&#x2F;codegen performance.<p>Here&#x27;s the Tarsier pipeline in a nutshell:<p>1. tag interactable elements with IDs for the LLM to act upon &amp; grab a full-sized webpage screenshot<p>2. for text-only LLMs, run OCR on the screenshot &amp; convert it to whitespace-structured text (this is the coolest part imo)<p>3. map LLM intents back to actions on elements in the browser via an ID-to-XPath dict<p>Humans interact with the web through visually-rendered pages, and agents should too. We run Tarsier in production for thousands of web data extraction agents a day at Reworkd (<a href="https:&#x2F;&#x2F;reworkd.ai">https:&#x2F;&#x2F;reworkd.ai</a>).<p>By the way, we&#x27;re hiring backend&#x2F;infra engineers with experience in compute-intensive distributed systems!<p>reworkd.ai&#x2F;careers
评论 #40371116 未加载
评论 #40370993 未加载