TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Show HN: GPT-V and OCR for Screen Control

22 pointsby rchavesover 1 year ago

4 comments

rchavesover 1 year ago
Hey there everyone, now that AI can &quot;see&quot; very well with GPT-V, I was wondering if it can interact with a computer like we do, just by looking at it. Well, one of the shortcommings of GPT-V is that it cannot really pinpoint the x,y coordinates of something in the screen very well, but I solved it by combining it with simple OCR, and annotating for GPT-V to tell where it wants to click<p>Turns out with very few lines of code the results are already impressive, GPT-V can really control my computer super well and I can as it to do whatever tasks by itself, it clicks around, type stuff and press buttons to navigate<p>Would love to hear your thoughts on it!
评论 #38627470 未加载
评论 #38627178 未加载
anonzzziesover 1 year ago
Nice work, I was looking for this for a while and no time to do it myself. I would say it&#x27;s probably a good idea to make it ai-assisted ; many things you can do faster yourself by saying &#x27;click h2&#x27; , fill in text &#x27;hello world&#x27; etc instead of having the LLM figure it out. So a combination of things basically. But very good start!<p>Edit; also probably good to, in case it is not sure, to open the browser and try there.
评论 #38642048 未加载
okishover 1 year ago
Take a look at this related work <a href="https:&#x2F;&#x2F;arxiv.org&#x2F;abs&#x2F;2310.11441" rel="nofollow noreferrer">https:&#x2F;&#x2F;arxiv.org&#x2F;abs&#x2F;2310.11441</a>
xeliaover 1 year ago
I wonder if this can be optimized by letting GPT provide multiple instructions per screenshot instead of just one.<p>For example in the twitter screenshot, it could use just the one image.
评论 #38642078 未加载