UFO: A UI-Focused AI Agent for Windows OS Interaction

87 点作者 DreamGen超过 1 年前

23 条评论

The following section from the readme stands out:The GPT-V accepts screenshots of your desktop and application GUI as input. Please ensure that no sensitive or confidential information is visible or captured during the execution process.

评论 #39407019 未加载

评论 #39407107 未加载

smashah大约 1 年前

If anyone else did this then some useless lawyer would be activated from their slumber to send out OSS devs cease & desists based on supposed violations of sacred terms and conditions.This project is the future that we were promised but it is under threat by supposed legal challenges.What if the demo asked the email to be sent as a message via whatsapp desktop instead? That would (according to their anti-freedom lawyers) constitute an offense worthy of legal threats.The tech industry needs to reckon with ToS trolls before it's too late.

评论 #39408786 未加载

评论 #39408726 未加载

hyperhello大约 1 年前

So they built an AI that can use the windows environment. Maybe they could make a simple graphical shell to control the AI.

评论 #39406979 未加载

anonzzzies大约 1 年前

This is from MS , so it works better than other attempts for other OSs? I tried a few Mac, browser and linux attempts and they were unusable; clicking in the wrong place, having no clue what to do etc…. And can this handle a browser or is that off limits (considering how many projects tried that and failed)?

yonatan8070大约 1 年前

With all the emoji in the README, this looks like a JS framework

评论 #39410331 未加载

DonHopkins大约 1 年前

It's more useful to use AI (or simply your human brain) to design good user interfaces in the first place, instead of using it to paper over flabbergastingly terrible user interface designs.

DonHopkins大约 1 年前

I had ChatGPT analyze some Factorio screen dumps, which was interesting:<a href="https://docs.google.com/document/d/14p_iPhIKjDoTGa2Zr_5gPV9_fJXoFbHErHA_vYOgq2I/edit?usp=sharing" rel="nofollow">https://docs.google.com/document/d/14p_iPhIKjDoTGa2Zr_5gPV9_...</a>

评论 #39408454 未加载

rewgs大约 1 年前

On one hand, I'm really glad to have come across this this morning, as I'm trying to automate a task that requires GUI automation due to a lack of an API. But I'm stuck even on that because some windows don't accept key input to select the "Enter" button (why? no idea), so now I'm having to drop into mouse automation, which of course is super imprecise, being DPI-dependent. So, taking screenshots and then feeding those into a LLM is a solid solution, I guess.On the other hand, I shudder to think of the millions of man hours required to arrive at this solution, when simple UI guidelines, or better yet, an API, would have solved my problem far more simply and efficiently.

ukuina大约 1 年前

Note this is from Microsoft!They are on a roll lately, and seem to have beaten OpenAI to GPT-Agents with this release.

评论 #39407335 未加载

评论 #39407432 未加载

jpalomaki大约 1 年前

Interesting approach. In some organizatons you have those ancient apps that are difficult to use/deploy and impossible to replace.Sometimes these are automated with ”robotic procesd automation” tools. Something like UFO could streamline the process.

ThinkBeat大约 1 年前

Is this a way to automate /(create a UI macro) of windows applications? Sort of like AutoHotKey but with a nicer developing experience? Reading the page and watching the demo and I am still a bit confused about what it is.

评论 #39408281 未加载

_sword大约 1 年前

Like a Windows-focused copy of the the Self Operating Computer Framework. There's clearly something here with this approach if it's popping up in multiple open source frameworks

评论 #39417803 未加载

beefnugs大约 1 年前

Interesting... but... you know this means to make this work well they are going to start recording EVERYONES screens against their will to get more training data right?

nxobject大约 1 年前

If this idea's in the air, I wouldn't be surprised if Apple's working on a similar concept, but with an accessibility or a "Siri remake" bent to it.

评论 #39408375 未加载

hubraumhugo大约 1 年前

That's quite big news for many RPA startups I guess?

francesco314大约 1 年前

It would be a lot better if we could use a local model. Doesn't seem anyone has done that fork yet?

评论 #39420768 未加载

lloydatkinson大约 1 年前

Python! Some of their teams have absolutely no faith in their own languages and frameworks.

DonHopkins大约 1 年前

Not as impressive or interesting as CogAgent:CogVLM: Visual Expert for Pretrained Language ModelsCogAgent: A Visual Language Model for GUI Agents<a href="https://arxiv.org/abs/2312.08914" rel="nofollow">https://arxiv.org/abs/2312.08914</a><a href="https://github.com/THUDM/CogVLM">https://github.com/THUDM/CogVLM</a><a href="https://arxiv.org/pdf/2312.08914.pdf" rel="nofollow">https://arxiv.org/pdf/2312.08914.pdf</a>CogAgent: A Visual Language Model for GUI AgentsAbstractPeople are spending an enormous amount of time on digital devices through graphical user interfaces (GUIs), e.g., computer or smartphone screens. Large language models (LLMs) such as ChatGPT can assist people in tasks like writing emails, but struggle to understand and interact with GUIs, thus limiting their potential to increase automation levels. In this paper, we introduce CogAgent, an 18-billion-parameter visual language model (VLM) specializing in GUI understanding and navigation. By utilizing both low-resolution and high-resolution image encoders, CogAgent supports input at a resolution of 1120×1120, enabling it to recognize tiny page elements and text. As a generalist visual language model, CogAgent achieves the state of the art on five text-rich and four general VQA benchmarks, including VQAv2, OK-VQA, Text-VQA, ST-VQA, ChartQA, infoVQA, DocVQA, MM-Vet, and POPE. CogAgent, using only screenshots as input, outperforms LLM-based methods that consume extracted HTML text on both PC and Android GUI navigation tasks—Mind2Web and AITW, advancing the state of the art. The model and codes are available at <a href="https://github.com/THUDM/CogVLM">https://github.com/THUDM/CogVLM</a> .1. IntroductionAutonomous agents in the digital world are ideal assistants that many modern people dream of. Picture this scenario: You type in a task description, then relax and enjoy a cup of coffee while watching tasks like booking tickets online, conducting web searches, managing files, and creating PowerPoint presentations get completed automatically.Recently, the emergence of agents based on large language models (LLMs) is bringing us closer to this dream. For example, AutoGPT [33], a 150,000-star open-source project, leverages ChatGPT [29] to integrate language understanding with pre-defined actions like Google searches and local file operations. Researchers are also starting to develop agent-oriented LLMs [7, 42]. However, the potential of purely language-based agents is quite limited in realworld scenarios, as most applications interact with humans through Graphical User Interfaces (GUIs), which are characterized by the following perspectives:• Standard APIs for interaction are often lacking.• Important information including icons, images, diagrams, and spatial relations are difficult to directly convey in words.• Even in text-rendered GUIs like web pages, elements like canvas and iframe cannot be parsed to grasp their functionality via HTML.Agents based on visual language models (VLMs) have the potential to overcome these limitations. Instead of relying exclusively on textual inputs such as HTML [28] or OCR results [31], VLM-based agents directly perceive visual GUI signals. Since GUIs are designed for human users, VLM-based agents can perform as effectively as humans, as long as the VLMs match human-level vision understanding. In addition, VLMs are also capable of skills such as extremely fast reading and programming that are usually beyond the reach of most human users, extending the potential of VLM-based agents. A few prior studies utilized visual features merely as auxiliaries in specific scenarios. e.g. WebShop [39] which employs visual features primarily for object recognition purposes. With the rapid development of VLM, can we naturally achieve universality on GUIs by relying solely on visual inputs?In this work, we present CogAgent, a visual language foundation model specializing in GUI understanding and planning while maintaining a strong ability for general cross-modality tasks. By building upon CogVLM [38]—a recent open-source VLM, CogAgent tackles the following challenges for building GUI agents: [...]

dartharva大约 1 年前

Does it support other visual-input-accepting language models? GPT-V is paywalled.

评论 #39409742 未加载

评论 #39409001 未加载

tflol大约 1 年前

have they fixed teams yet?that app that like the entire united states uses for pc work every day?i still cant copy paste a code block, or copy paste literally anything. i think microsoft should use AI to learn how to code code blocks in chat or they should ask chatgpt how to use the clipboard of their own OS

评论 #39407884 未加载

评论 #39407511 未加载

评论 #39408318 未加载

评论 #39407231 未加载

cuckatoo大约 1 年前

They can't migrate control panel to the settings app in a time measured in decades, but I'm supposed to believe that they're going to magically produce something like this that does what it claims? Go home Microsoft. You're drunk.

评论 #39410141 未加载

评论 #39409779 未加载

DonHopkins大约 1 年前

Anyone who uses the term "work smarter not harder" in a demo video, or at any point in their life in any situation, should not be taken seriously.

评论 #39409911 未加载

v3ss0n大约 1 年前

Finally! Disclosure happened . The company which David Grush said holding UFO technology provided by AlIens are Microsoft! Who would have thought that. Truly non-Human intelligence assisted technology.

23 条评论

gareth_untether大约 1 年前

评论 #39407019 未加载

评论 #39407107 未加载

smashah大约 1 年前

评论 #39408786 未加载

评论 #39408726 未加载

hyperhello大约 1 年前

So they built an AI that can use the windows environment. Maybe they could make a simple graphical shell to control the AI.

评论 #39406979 未加载

anonzzzies大约 1 年前

yonatan8070大约 1 年前

With all the emoji in the README, this looks like a JS framework

评论 #39410331 未加载

DonHopkins大约 1 年前

It's more useful to use AI (or simply your human brain) to design good user interfaces in the first place, instead of using it to paper over flabbergastingly terrible user interface designs.

DonHopkins大约 1 年前

评论 #39408454 未加载

rewgs大约 1 年前

ukuina大约 1 年前

Note this is from Microsoft!They are on a roll lately, and seem to have beaten OpenAI to GPT-Agents with this release.

评论 #39407335 未加载

评论 #39407432 未加载

jpalomaki大约 1 年前

ThinkBeat大约 1 年前

评论 #39408281 未加载

_sword大约 1 年前

Like a Windows-focused copy of the the Self Operating Computer Framework. There's clearly something here with this approach if it's popping up in multiple open source frameworks

评论 #39417803 未加载

beefnugs大约 1 年前

Interesting... but... you know this means to make this work well they are going to start recording EVERYONES screens against their will to get more training data right?

nxobject大约 1 年前

If this idea's in the air, I wouldn't be surprised if Apple's working on a similar concept, but with an accessibility or a "Siri remake" bent to it.

评论 #39408375 未加载

hubraumhugo大约 1 年前

That's quite big news for many RPA startups I guess?

francesco314大约 1 年前

It would be a lot better if we could use a local model. Doesn't seem anyone has done that fork yet?

评论 #39420768 未加载

lloydatkinson大约 1 年前

Python! Some of their teams have absolutely no faith in their own languages and frameworks.

DonHopkins大约 1 年前

dartharva大约 1 年前

Does it support other visual-input-accepting language models? GPT-V is paywalled.

评论 #39409742 未加载

评论 #39409001 未加载

tflol大约 1 年前

评论 #39407884 未加载

评论 #39407511 未加载

评论 #39408318 未加载

评论 #39407231 未加载

cuckatoo大约 1 年前

评论 #39410141 未加载

评论 #39409779 未加载

DonHopkins大约 1 年前

Anyone who uses the term "work smarter not harder" in a demo video, or at any point in their life in any situation, should not be taken seriously.

评论 #39409911 未加载

v3ss0n大约 1 年前