Hey there everyone, now that AI can "see" very well with GPT-V, I was wondering if it can interact with a computer like we do, just by looking at it. Well, one of the shortcommings of GPT-V is that it cannot really pinpoint the x,y coordinates of something in the screen very well, but I solved it by combining it with simple OCR, and annotating for GPT-V to tell where it wants to click<p>Turns out with very few lines of code the results are already impressive, GPT-V can really control my computer super well and I can as it to do whatever tasks by itself, it clicks around, type stuff and press buttons to navigate<p>Would love to hear your thoughts on it!
Nice work, I was looking for this for a while and no time to do it myself. I would say it's probably a good idea to make it ai-assisted ; many things you can do faster yourself by saying 'click h2' , fill in text 'hello world' etc instead of having the LLM figure it out. So a combination of things basically. But very good start!<p>Edit; also probably good to, in case it is not sure, to open the browser and try there.
Take a look at this related work
<a href="https://arxiv.org/abs/2310.11441" rel="nofollow noreferrer">https://arxiv.org/abs/2310.11441</a>
I wonder if this can be optimized by letting GPT provide multiple instructions per screenshot instead of just one.<p>For example in the twitter screenshot, it could use just the one image.