64 点作者 punnerud3 个月前

4 条评论

The OS has additional information including how different graphics layers are composited, and what accessibility metadata is attached to interface elements. It ought to be useful to exploit this to do better than screenshot parsing.

icodar3 个月前

This is not the intended use but it good working on parsing document layout from image.

nighthawk4543 个月前

One ponders the connections with the Recall feature

NewUser763123 个月前

Very cool work. Accurate GUI text and element parsing is exactly the kind of input that LLMs need to be effective agents.

OmniParser V2 – A simple screen parsing tool towards pure vision based GUI agent

4 条评论

OmniParser V2 – A simple screen parsing tool towards pure vision based GUI agent

4 条评论