Not as impressive or interesting as CogAgent:<p>CogVLM: Visual Expert for Pretrained Language Models<p>CogAgent: A Visual Language Model for GUI Agents<p><a href="https://arxiv.org/abs/2312.08914" rel="nofollow">https://arxiv.org/abs/2312.08914</a><p><a href="https://github.com/THUDM/CogVLM">https://github.com/THUDM/CogVLM</a><p><a href="https://arxiv.org/pdf/2312.08914.pdf" rel="nofollow">https://arxiv.org/pdf/2312.08914.pdf</a><p>CogAgent: A Visual Language Model for GUI Agents<p>Abstract<p>People are spending an enormous amount of time on digital devices through graphical user interfaces (GUIs), e.g.,
computer or smartphone screens. Large language models (LLMs) such as ChatGPT can assist people in tasks
like writing emails, but struggle to understand and interact
with GUIs, thus limiting their potential to increase automation levels. In this paper, we introduce CogAgent, an 18-billion-parameter visual language model (VLM) specializing in GUI understanding and navigation. By utilizing both
low-resolution and high-resolution image encoders, CogAgent supports input at a resolution of 1120×1120, enabling
it to recognize tiny page elements and text. As a generalist visual language model, CogAgent achieves the state of
the art on five text-rich and four general VQA benchmarks,
including VQAv2, OK-VQA, Text-VQA, ST-VQA, ChartQA,
infoVQA, DocVQA, MM-Vet, and POPE. CogAgent, using
only screenshots as input, outperforms LLM-based methods that consume extracted HTML text on both PC and
Android GUI navigation tasks—Mind2Web and AITW, advancing the state of the art. The model and codes are available at <a href="https://github.com/THUDM/CogVLM">https://github.com/THUDM/CogVLM</a> .<p>1. Introduction<p>Autonomous agents in the digital world are ideal assistants
that many modern people dream of. Picture this scenario:
You type in a task description, then relax and enjoy a cup
of coffee while watching tasks like booking tickets online,
conducting web searches, managing files, and creating PowerPoint presentations get completed automatically.<p>Recently, the emergence of agents based on large language models (LLMs) is bringing us closer to this dream.
For example, AutoGPT [33], a 150,000-star open-source
project, leverages ChatGPT [29] to integrate language understanding with pre-defined actions like Google searches
and local file operations. Researchers are also starting to
develop agent-oriented LLMs [7, 42]. However, the potential of purely language-based agents is quite limited in realworld scenarios, as most applications interact with humans
through Graphical User Interfaces (GUIs), which are characterized by the following perspectives:<p>• Standard APIs for interaction are often lacking.<p>• Important information including icons, images, diagrams, and spatial relations are difficult to directly convey in words.<p>• Even in text-rendered GUIs like web pages, elements
like canvas and iframe cannot be parsed to grasp their
functionality via HTML.<p>Agents based on visual language models (VLMs) have
the potential to overcome these limitations. Instead of relying exclusively on textual inputs such as HTML [28] or
OCR results [31], VLM-based agents directly perceive visual GUI signals. Since GUIs are designed for human users,
VLM-based agents can perform as effectively as humans,
as long as the VLMs match human-level vision understanding. In addition, VLMs are also capable of skills such as
extremely fast reading and programming that are usually
beyond the reach of most human users, extending the potential of VLM-based agents. A few prior studies utilized
visual features merely as auxiliaries in specific scenarios.
e.g. WebShop [39] which employs visual features primarily for object recognition purposes. With the rapid development of VLM, can we naturally achieve universality on
GUIs by relying solely on visual inputs?<p>In this work, we present CogAgent, a visual language
foundation model specializing in GUI understanding and
planning while maintaining a strong ability for general
cross-modality tasks. By building upon CogVLM [38]—a
recent open-source VLM, CogAgent tackles the following
challenges for building GUI agents: [...]