TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

Using GPT-4 Vision with Vimium to browse the web

437 点作者 wvoch235超过 1 年前

33 条评论

e12e超过 1 年前
It&#x27;s insane that this is now possible:<p><a href="https:&#x2F;&#x2F;github.com&#x2F;ishan0102&#x2F;vimGPT&#x2F;blob&#x2F;682b5e539541cd6d710e6723ef891f70506f64e9&#x2F;vision.py#L35">https:&#x2F;&#x2F;github.com&#x2F;ishan0102&#x2F;vimGPT&#x2F;blob&#x2F;682b5e539541cd6d710...</a><p>&gt; &quot;You need to choose which action to take to help a user do this task: {objective}. Your options are navigate, type, click, and done. Navigate should take you to the specified URL. Type and click take strings where if you want to click on an object, return the string with the yellow character sequence you want to click on, and to type just a string with the message you want to type. For clicks, please only respond with the 1-2 letter sequence in the yellow box, and if there are multiple valid options choose the one you think a user would select. For typing, please return a click to click on the box along with a type with the message to write. When the page seems satisfactory, return done as a key with no value. You must respond in JSON only with no other fluff or bad things will happen. The JSON keys must ONLY be one of navigate, type, or click. Do not return the JSON inside a code block.&quot;
评论 #38208668 未加载
transistorfan超过 1 年前
At my work there are a large contingent of people who essentially do manual data copying between legacy programs (govt), because the tech debt is so large that we can&#x27;t figure out a way to plug these things together. Excited for tools like this to eventually act as a layer that can run over these sort of problems, as bizarre a solution as it is from a compute perspective
评论 #38204370 未加载
评论 #38202035 未加载
评论 #38200950 未加载
评论 #38201860 未加载
评论 #38200975 未加载
评论 #38201520 未加载
评论 #38200845 未加载
评论 #38202364 未加载
评论 #38203583 未加载
评论 #38202499 未加载
评论 #38218172 未加载
评论 #38201604 未加载
评论 #38202014 未加载
评论 #38203096 未加载
评论 #38205230 未加载
lachlan_gray超过 1 年前
I think vim is unintentionally a great “embodiment” for chatgpt. There’s nothing that can’t be done with a stream of text, and the internet is full of vimscript already<p>I started a similar experiment if anyone else is thinking along the same lines :)<p><a href="https:&#x2F;&#x2F;github.com&#x2F;LachlanGray&#x2F;vim-agent">https:&#x2F;&#x2F;github.com&#x2F;LachlanGray&#x2F;vim-agent</a>
评论 #38207200 未加载
ishan0102超过 1 年前
Hey! Creator here, thanks for sharing! Let me know if anyone has questions and feel free to contribute, I&#x27;ve left some potential next steps in the README.
评论 #38201027 未加载
评论 #38202049 未加载
评论 #38200819 未加载
评论 #38201620 未加载
评论 #38205816 未加载
评论 #38203448 未加载
maccam912超过 1 年前
I&#x27;ve been playing with a similar idea of screenshots and actions from GPT-4 Vision for browsing, but after trying and failing to overlay info in the screenshot, I ended up just getting the accessibility tree from playwright and sending that along as text so the model would know what options it had for interaction. In my case it seemed to work better, I see the creator is here and has a list of future ideas, maybe add this to the list if you think its a good idea?
评论 #38201428 未加载
评论 #38201528 未加载
mackross超过 1 年前
Been playing with this through the ChatGPT interface for the past few weeks. Couple of tips. Update the css to get rid of the gradients and rounded corners. I found red with bold white text to be most consistent. Increase the font size. If two labels overlap, push them apart and add an arrow to the element. Send both images to the API, a version with the annotations added and a version without.
karmasimida超过 1 年前
We can create an autopilot for browser.<p>It is going to incredibly difficult moving forward to distinguish bot traffic, if this is deployed at scale.<p>The problem I see is this isn&#x27;t going to be cheap or even affordable in short term.
评论 #38201519 未加载
reqo超过 1 年前
How will tools like this affect web tracking or generally advertisements on the internet? Imagine you could have an agent browse the web for you and fetch exactly what you are seraching for without you seeing any ads&#x2F;pop ups or being tracked along the way! Could be a great ”ad blocker”! Could it perhaps also make SEO useless and thus improve the quality of internet? But I wonder if it also could have negative effects such as the ads being “interweaved” into the fetch content somehow!
评论 #38212418 未加载
FooBarWidget超过 1 年前
Many Dutch companies pay salaries by<p>1. receiving payslips from the accountant, and then<p>2. manually initiating bank transfers to each employee for the amount in the corresponding payslip, and then<p>3. manually initiating a bank transfer to the tax authority to pay the withholded salary taxes.<p>This is completely useless manual labor. There should be no reason for this to be a manual procedure. And yet it&#x27;s almost impossible to automate this. The accountant portal either has no API, or it has an API but lets you download the data as PDF, and&#x2F;or the API costs good money. The bank either has no API, or it requires you to sign up for a developer account as if you&#x27;re going to publish a public app, when you&#x27;re just looking to automate some internal procedures.<p>So the easiest way to pay salaries and taxes is still to hire a person to do it manually. Hopefully one day that won&#x27;t be necessary anymore. I wouldn&#x27;t trust an AI to actually initiate the bank transfers, but maybe they can just prepare the transactions and then a person has to approve the submission.
评论 #38202630 未加载
评论 #38204482 未加载
评论 #38204205 未加载
评论 #38218216 未加载
snake_doc超过 1 年前
Ah, very similar to Adept’s[1] concept? Though, their product seems not yet ready.<p>[1] <a href="https:&#x2F;&#x2F;www.adept.ai&#x2F;" rel="nofollow noreferrer">https:&#x2F;&#x2F;www.adept.ai&#x2F;</a>
评论 #38202120 未加载
评论 #38209698 未加载
评论 #38201434 未加载
评论 #38201475 未加载
dangerwill超过 1 年前
How is this making your browsing experience any better? You still have to know what you want to do, and it is just faster to type Rick roll into youtube directly and click the links directly instead of having to type k, or vh, or whatever. You are just adding a useless chatgpt middleman between you and the browser that you likely spend all day in anyway and should be adept at navigating
评论 #38240736 未加载
bnchrch超过 1 年前
Personally. This is what Im really excited about chatgpt for. Data has just become alot more free to access.
thekid314超过 1 年前
I&#x27;m curious to see what it does when it sees a captcha.
评论 #38201531 未加载
burcs超过 1 年前
This is amazing, I feel like these vision models are going to make everything so much more accessible. Between the Be My Eyes app integration and now this, I&#x27;m really excited for how this transforms the web.
评论 #38200794 未加载
ternaus超过 1 年前
Love the idea.<p>It also shows that GPT-4V created a new angle in web scraping.<p>I guess, this or similar code would be leveraged in many projects like:<p>1. Scrape XXX websites, say LinkedIn or Twitter use all types of methods in the DOM to prevent it, but fighting working well GPT-4V + OCR would be ultra hard.<p>2. Give me an analysis of what these XXX companies are doing. And this could be done for competitors, to understand the landscape of some industry, or even plainly to get news.<p>Large-scale scrapping, not depending on the source code of the pages is a powerful infrastructural change.
评论 #38228677 未加载
DalasNoin超过 1 年前
I tried to use it, but unfortunately it often did not add the little annotations for the different options to the screen and it got stuck in a loop. This bot works by adding a two letter combination to each clickable option, but sometimes they don&#x27;t show up. It managed to sign in to twitter ones, but really quickly I burned through the 100 images api limit.<p>Maybe for a future version it only uses vision for difficult situations in which it gets stuck and otherwise uses the text based browser?
comment_ran超过 1 年前
It&#x27;s so cool. I was wondering if we can make crawler tool much easier and better. It&#x27;s more similar to the &quot;human&quot; way to interact with a website.
ranulo超过 1 年前
This could enable human language test automation scripts and could either improve my life as a QA engineer a lot or completely destroy it. Not sure yet.
评论 #38203370 未加载
jackconsidine超过 1 年前
Looks extremely cool. Trying to run it though, I get stuck at &quot;Getting actions for the given objective...&quot; (using the example on the repo)
评论 #38200878 未加载
silentguy超过 1 年前
Usually there are a lot of comments about how text is the best interface and it&#x27;s making a comeback in the LLMs but in this case picture is the better medium since parsing the webpage js would prove too difficult. I think a screenshot of a webpage has a smaller footprint than the raw payloads (js, assets, etc.).
snthpy超过 1 年前
Looks cool. Unfortunately I expected this to enhance my Vimium experience but it looks like this is using Vimium to enhance GPT4, right?
silentguy超过 1 年前
I think this can be extended to desktop as well. There are programs that act like vimium for your desktop (win-vind, etc.). I don&#x27;t have the openai API key to try it but I wish someone gave it a try (in obviously an isolated environment).
jonathanlb超过 1 年前
Hmm interesting. I&#x27;m curious what this means for accessibility and screen readers.
imranq超过 1 年前
Is the vision model directly reading the screen and therefore also reading the Vimeo tags? It might be more effective to export the DOM tags and the associated elements as a Json object that is fed into chatGPT without using the vision component
评论 #38200949 未加载
gvv超过 1 年前
Nice job! The horrors GPT-4 must endure to watch ads, truly inhumane
doctorM超过 1 年前
i think this is actively dangerous. well not yet. but getting there.<p>i know - ai isn&#x27;t meant to be sentient. but if it looks like a duck and quacks like a duck...<p>how do i know that the comments here aren&#x27;t done by dedicated hacker news ai bots?<p>the potential danger could come from lack of supervision down the road.<p>i didn&#x27;t get much sleep last night so this is less coherent than it could be.
braindead_in超过 1 年前
Why not build a new browser with GPT baked in?
评论 #38202381 未加载
owenpalmer超过 1 年前
This will be fantastic for accessibility
nostrowski超过 1 年前
This will be in a future history book under a chapter titled &quot;the beginning of the end&quot;
startages超过 1 年前
There is just so much you can do with GPT-4 vision, I just hope it&#x27;s more affordable.
mediumsmart超过 1 年前
this is awesome and great news, <i>nevermind that the AI found the wrong video in the demo</i><p><a href="https:&#x2F;&#x2F;www.youtube.com&#x2F;watch?v=jRyX1tC2OS0">https:&#x2F;&#x2F;www.youtube.com&#x2F;watch?v=jRyX1tC2OS0</a>
bilekas超过 1 年前
This is actually pretty interesting.. I am thinking maybe it would be faster than writing up selenium tests themselves if we could just give a few instructions.<p>I&#x27;m still going through the source, but really nice idea and great example of enriching the GPT with tools like vimium.
rpigab超过 1 年前
This is amazing that it&#x27;s possible and works, but I wonder if the electricity cost is sustainable in the long run.<p>For handicapped people who depend on tools like this for accessibility, it&#x27;s justified, but I wouldn&#x27;t use it myself if it uses too much power.<p>I&#x27;m sure OpenAI and friends love operating at a loss until everyone uses their products, then enshittify or raise prices, like Netflix, Microsoft, Google, etc., but CO2 emissions can&#x27;t be easily reversed.<p>I&#x27;d be glad to listen to other points of view though, maybe everything we do on computers is already bad for the environment anyway and comparing which one pollutes more is vain, idk.