TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Skyvern Browser Agent 2.0: How We Reached State of the Art in Evals

49 pointsby suchintan4 months ago

6 comments

happyopossum4 months ago
Many of the examples given for agents such as this are things I just flat wouldn’t trust an LLM to do - buying something on Amazon for example: Will it pick new or ‘renewed’? Will it select an item that is from a janky looking vendor and may be counterfeit? Will it pick the cheapest option for me? What if multiple colors are offered?<p>This one example alone has so many branches that would require knowing what’s in my head.<p>On the flip side, it’s a ridiculously simple task for a human to do for themselves, so what am I truly saving?<p>Call me when I can ask it to check the professional reviews of X category on N websites (plus YouTube), summarize them for me, and find the cheapest source for the top 2 options in the category that will arrive in Y days or sooner.<p>That would be useful.
评论 #42742518 未加载
评论 #42746109 未加载
评论 #42742436 未加载
评论 #42751171 未加载
评论 #42744388 未加载
mkagenius4 months ago
Pre-planned steps by Planner will go wrong more often than not, as it will try to guess the UI layers from its memory&#x2F;training data. Its better to just ask the &quot;next step&quot; by giving it current state of the UI.<p>I have built a similar project for mobile automation [1] and the validator phase is not separate rather it&#x27;s inherently baked in each step since we only ask next step based on current UI and previous actions.<p>My Planner sometimes goes &quot;Oh, we are still on home screen, let&#x27;s find the Uber app icon&quot;. This sort of self-correcting behaviour was not programmed but the LLM does it on its own.<p>1. <a href="https:&#x2F;&#x2F;github.com&#x2F;BandarLabs&#x2F;ClickClickClick">https:&#x2F;&#x2F;github.com&#x2F;BandarLabs&#x2F;ClickClickClick</a> - A framework to automate mobile use via any LLM (local&#x2F;remote)
lyime4 months ago
This is an impressive tool. I especially like the observability around the workflow and the steps it takes to achieve the outcome. We are potentially interested in exploring this if we can get the cost down at scale.
评论 #42743230 未加载
wejick4 months ago
UI is most common interface but not particularly AI friendly, i&#x27;ll wait for more standardized interface that&#x27;s both human and AI friendly. Hoping it will still br a browser based.
评论 #42745592 未加载
skull88888884 months ago
isn&#x27;t browser use sota on web voyager? At this point web voyager seems to be outdated, there&#x27;s def a need for a new harder benchmark.
评论 #42744614 未加载
评论 #42744200 未加载
govindsb4 months ago
congrats Suchintan! huge achievement!