[🌐 Website] •
[📜 Paper] •
[🐱 GitHub]•
[ Zhihu]•
[ OpenReview]
This is the repo for the paper OS Agents: A Survey on MLLM-based Agents for General Computing Devices Use. This paper conducts a comprehensive survey on OS Agents, which are (M)LLM-based agents that use computing devices (e.g., computers and mobile phones) by operating within the environments and interfaces (e.g., Graphical User Interface (GUI)) provided by operating systems (OS) to automate tasks. The survey is aimed to consolidates the state of OS Agents research, providing insights to guide both academic inquiry and industrial development. In this repository, we have listed relevant papers related to our work in four areas: Foundation Models, Agent Frameworks, Evaluation & Benchmarks, and Safety & Privacy and this collection will be continuously updated. We aim to provide you with comprehensive knowledge in the OS Agent field, hoping it can help you quickly familiarize yourself with this research direction.
This paper was rejected by arXiv surprisingly with the justification: "Our moderators determined that your submission does not contain sufficient original or substantive scholarly research and is not of interest to arXiv." This reasoning appears to be inconsistent with the content and contribution of the paper. We attempted an appeal, but unfortunately, this was unsuccessful, and no further explanation was provided. A resubmission did not resolve the issue either. As a result, the ONLY way to access the paper at the moment is through our GitHub repository or via OpenReview Archive. We are disappointed by the lack of transparency in arXiv’s moderation process.
(Some teams involved in this survey is hiring. Information will be continuously updated, please stay tuned. Detailed information is here.)
[OPPO] Personal AI Team
OPPO is seeking algorithm interns, new graduates, and experienced candidates for its Personal AI Team. Focused on multimodal LLMs, AI Agents, and personalization, the team works to develop AI Native Phones. Interested candidates, contact: [email protected].
[01.AI] Foundation Model Post-Training Team
01.AI is hiring algorithm interns, new graduates, and experienced candidates for its post-training team. Join the team behind Yi-Lightning, ranked #1 in China and #6 globally in the LMSys Chatbot Arena updated on October 14, 2024. Interested candidates, contact: [email protected].
This survey aims to advance the research and development of OS Agents by providing a detailed exploration of their fundamental capabilities, methodologies for building them using (M)LLMs, and emerging trends in the field. While OS Agents are still in the early stages of growth, the rapid evolution of technology continues to introduce innovative approaches and applications. This work seeks to highlight ongoing challenges, future opportunities, and the latest developments, encouraging further research and industrial adoption. Ultimately, we hope this study will serve as a catalyst for innovation, driving meaningful progress in both academia and industry.
Table 1: Recent foundation models for OS Agents. Arch.: Architecture, Exist.: Existing, Mod.: Modified, Concat.: Concatenated, PT: Pre-Train, SFT: Supervised Fine-Tune, RL: Reinforcement Learning.
Paper | Model | Arch. | PT | SFT | RL | Date | Link |
---|---|---|---|---|---|---|---|
OS-ATLAS: A Foundation Action Model for Generalist GUI Agents | OS-Atlas | Exist. MLLMs | √ | √ | - | 10/2024 | [paper] |
AutoGLM: Autonomous Foundation Agents for GUIs | AutoGLM | Exist. LLMs | √ | √ | √ | 10/2024 | [paper] |
EDGE: Enhanced Grounded GUI Understanding with Enriched Multi-Granularity Synthetic Data | EDGE | Exist. MLLMs | - | √ | - | 10/2024 | [paper] |
Ferret-UI 2: Mastering Universal User Interface Understanding Across Platforms | Ferret-UI 2 | Exist. MLLMs | - | √ | - | 10/2024 | [paper] |
ShowUI: One Vision-Language-Action Model for Generalist GUI Agent | ShowUI | Exist. MLLMs | √ | √ | - | 10/2024 | [paper] |
Harnessing Webpage UIs for Text-Rich Visual Understanding | UIX | Exist. MLLMs | - | √ | - | 10/2024 | [paper] |
TinyClick: Single-Turn Agent for Empowering GUI Automation | TinyClick | Exist. MLLMs | √ | - | - | 10/2024 | [paper] |
Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents | UGround | Exist. MLLMs | - | √ | - | 10/2024 | [paper] |
NNetscape Navigator: Complex Demonstrations for Web Agents Without a Demonstrator | NNetNav | Exist. LLMs | - | √ | - | 10/2024 | [paper] |
Synatra: Turning indirect knowledge into direct demonstrations for digital agents at scale | Synatra | Exist. LLMs | - | √ | - | 09/2024 | [paper] |
MobileVLM: A Vision-Language Model for Better Intra- and Inter-UI Understanding | MobileVLM | Exist. MLLMs | √ | √ | - | 09/2024 | [paper] |
UI-Hawk: Unleashing the screen stream understanding for gui agents | UI-Hawk | Mod. MLLMs | √ | √ | - | 08/2024 | [paper] |
GUI Action Narrator: Where and When Did That Action Take Place? | GUI Action Narrator | Exist. MLLMs | - | √ | - | 07/2024 | [paper] |
MobileFlow: A Multimodal LLM for Mobile GUI Agent | MobileFlow | Mod. MLLMs | √ | √ | - | 07/2024 | [paper] |
VGA: Vision GUI Assistant - Minimizing Hallucinations through Image-Centric Fine-Tuning | VGA | Exist. MLLMs | - | √ | - | 06/2024 | [paper] |
GUI Odyssey: A Comprehensive Dataset for Cross-App GUI Navigation on Mobile Devices | OdysseyAgent | Exist. MLLMs | - | √ | - | 06/2024 | [paper] |
Tell Me What's Next: Textual Foresight for Generic UI Representations | Textual Foresight | Concat. MLLMs | √ | √ | - | 06/2024 | [paper] |
Navigating WebAI: Training Agents to Complete Web Tasks with Large Language Models and Reinforcement Learning | WebAI | Concat. MLLMs | - | √ | √ | 05/2024 | [paper] |
Search Beyond Queries: Training Smaller Language Models for Web Interactions via Reinforcement Learning | GLAINTEL | Exist. LLMs | - | - | √ | 04/2024 | [paper] |
Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs | Ferret-UI | Exist. MLLMs | - | √ | - | 04/2024 | [paper] |
AutoWebGLM: A Large Language Model-based Web Navigating Agent | AutoWebGLM | Exist. LLMs | - | √ | √ | 04/2024 | [paper] |
Large Language Models Can Self-Improve At Web Agent Tasks | - | Exist. LLMs | - | √ | - | 03/2024 | [paper] |
ScreenAI: A Vision-Language Model for UI and Infographics Understanding | ScreenAI | Exist. MLLMs | √ | √ | - | 02/2024 | [paper] |
Dual-View Visual Contextualization for Web Navigation | Dual-VCR | Concat MLLMs | - | √ | - | 02/2024 | [paper] |
SeeClick: Harnessing GUI Grounding for Advanced Visual GUI Agents | SeeClick | Exist. MLLMs | √ | √ | - | 01/2024 | [paper] |
CogAgent: A Visual Language Model for GUI Agents | CogAgent | Mod. MLLMs | √ | √ | - | 12/2023 | [paper] |
ILuvUI: Instruction-tuned Language-Vision modeling of UIs from Machine Conversations | ILuvUI | Mod. MLLMs | - | √ | - | 10/2023 | [paper] |
Reinforced UI Instruction Grounding: Towards a Generic UI Task Automation API | RUIG | Concat. MLLMs | - | - | √ | 10/2023 | [paper] |
A Real-World WebAgent with Planning, Long Context Understanding, and Program Synthesis | WebAgent | Concat. MLLMs | √ | √ | - | 07/2023 | [paper] |
Multimodal Web Navigation with Instruction-Finetuned Foundation Models | WebGUM | Concat. MLLMs | - | √ | - | 05/2023 | [paper] |
Table 2: Recent agent frameworks for OS Agents. TD: Textual Description, GS: GUI Screenshots, VG: Visual Grounding, SG: Semantic Grounding, DG: Dual Grounding, GL: Global, IT: Iterative, AE: Automated Exploration, EA: Experience-Augmented, MA: Management, IO: Input Operations, NO: Navigation Operations, EO: Extended Operations.
Paper | Model | Perception | Planning | Memory | Action | Date | Link |
---|---|---|---|---|---|---|---|
OpenWebVoyager: Building Multimodal Web Agents via Iterative Real-World Exploration, Feedback and Optimization | OpenWebVoyager | GS,SG | - | - | IO,NO | 10/2024 | [paper] |
OSCAR: Operating System Control via State-Aware Reasoning and Re-Planning | OSCAR | GS,DG | IT | AE | EO | 10/2024 | [paper] |
Large Language Models Empowered Personalized Web Agents | PUMA | TD | - | - | IO,NO,EO | 10/2024 | [paper] |
AgentOccam: A Simple Yet Strong Baseline for LLM-Based Web Agents | AgentOccam | TD | IT | MA | IO,NO | 10/2024 | [paper] |
Agent S: An Open Agentic Framework that Uses Computers Like a Human | Agent S | GS,SG | GL | EA,AE,MA | IO,NO | 10/2024 | [paper] |
ClickAgent: Enhancing UI Location Capabilities of Autonomous Agents | ClickAgent | GS | IT | AE | IO,NO | 10/2024 | [paper] |
From Commands to Prompts: LLM-based Semantic File System for AIOS | LSFS | GS,SG | - | - | EO | 09/2024 | [paper] |
NaviQAte: Functionality-Guided Web Application Navigation | NaviQate | GS,SG | - | - | IO | 09/2024 | [paper] |
PeriGuru: A Peripheral Robotic Mobile App Operation Assistant based on GUI Image Understanding and Prompting with LLM | PeriGuru | GS,DG | IT | EA,AE | IO,NO | 09/2024 | [paper] |
OpenWebAgent: An Open Toolkit to Enable Web Agents on Large Language Models | OpenWebAgent | GS,DG | - | - | IO | 08/2024 | [paper] |
Towards LLMCI: Multimodal AI for LLM-Vision UI Operation | LLMCI | GS,SG | - | - | EO | 07/2024 | [paper] |
Agent-e: From autonomous web navigation to foundational design principles in agentic systems | Agent-E | TD | IT | AE,MA | IO,NO | 07/2024 | [paper] |
Cradle: Empowering Foundation Agents Towards General Computer Control | Cradle | GS | IT | EA,AE,MA | EO | 03/2024 | [paper] |
Android in the zoo: Chain-of-action-thought for gui agents | CoAT | GS | IT | - | IO,NO | 03/2024 | [paper] |
On the Multi-turn Instruction Following for Conversational Web Agents | Self-MAP | - | IT | EA | IO | 02/2024 | [paper] |
OS-Copilot: Towards Generalist Computer Agents with Self-Improvement | OS-Copilot | TD | GL | EA,AE | IO,EO | 02/2024 | [paper] |
Mobile-Agent: Autonomous Multi-Modal Mobile Device Agent with Visual Perception | Mobile-Agent | GS,SG | IT | AE | IO,NO | 01/2024 | [paper] |
WebVoyager: Building an End-to-End Web Agent with Large Multimodal Models | WebVoyager | GS,VG | IT | MA | IO,NO | 01/2024 | [paper] |
MobileAgent: enhancing mobile control via human-machine interaction and SOP integration | AIA | GS,VG | GL | - | IO,NO | 01/2024 | [paper] |
GPT-4V(ision) is a Generalist Web Agent, if Grounded | SeeAct | GS,SG | - | AE | IO | 01/2024 | [paper] |
AppAgent: Multimodal Agents as Smartphone Users | AppAgent | GS,DG | IT | AE | IO,NO | 12/2023 | [paper] |
Assistgui: Task-oriented desktop graphical user interface automation | ACE | TD | GL | AE | IO,NO | 12/2023 | [paper] |
MobileGPT: Augmenting LLM with Human-like App Memory for Mobile Task Automation | MobileGPT | TD | GL | MA | IO,NO | 12/2023 | [paper] |
GPT-4V in Wonderland: Large Multimodal Models for Zero-Shot Smartphone GUI Navigation | MM-Navigator | GS,VG | - | MA | IO,NO | 11/2023 | [paper] |
WebWISE: Web Interface Control and Sequential Exploration with Large Language Models | WebWise | TD | - | MA | IO,NO | 10/2023 | [paper] |
A Zero-Shot Language Agent for Computer Control with Structured Reflection | - | TD | IT | AE | IO,NO | 10/2023 | [paper] |
Laser: Llm agent with state-space exploration for web navigation | Laser | TD | IT | AE | IO,NO | 09/2023 | [paper] |
Synapse: Trajectory-as-Exemplar Prompting with Memory for Computer Control | Synapse | - | - | MA | IO | 06/2023 | [paper] |
Sheetcopilot: Bringing software productivity to the next level through large language models | SheetCopilot | TD | IT | AE | EO | 05/2023 | [paper] |
Language Models can Solve Computer Tasks | RCI | - | IT | AE | IO,NO | 03/2023 | [paper] |
Enabling conversational interaction with mobile ui using large language models | - | TD | - | - | IO | 09/2022 | [paper] |
Table 3: Recent benchmarks for OS Agents. We divided the Benchmarks into three sections based on the Platform and sorted them by release date. The following is an explanation of the abbreviations. BS: Benchmark Settings, M/P: Mobile, PC: Desktop, IT: Interactive, ST: Static, OET: Operation Environment Types, RW: Real-World, SM: Simulated, GG: GUI Grounding, IF: Information Processing, AT: Agentic, CG: Code Generation.
Paper | Benchmark | Platform | BS | OET | Task | Date | Link |
---|---|---|---|---|---|---|---|
On the effects of data scale on computer control agents | AndroidControl | M/P | ST | - | AT | 06/2024 | [paper] |
AndroidWorld: A Dynamic Benchmarking Environment for Autonomous Agents | AndroidWorld | M/P | IT | RW | AT | 05/2024 | [paper] |
Latent state estimation helps ui agents to reason | Android-50 (A-50) | M/P | IT | RW | AT | 05/2024 | [paper] |
Benchmarking mobile device control agents across diverse configurations | B-MoCA | M/P | IT | RW | AT | 04/2024 | [paper] |
Llamatouch: A faithful and scalable testbed for mobile ui task automation | LlamaTouch | M/P | IT | RW | AT | 04/2024 | [paper] |
Understanding the weakness of large language model agents within a complex android environment | AndroidArena | M/P | IT | RW | AT | 02/2024 | [paper] |
Android in the wild: A large-scale dataset for android device control | AITW | M/P | ST | - | AT | 07/2023 | [paper] |
Ugif: Ui grounded instruction following | UGIF-DataSet | M/P | ST | - | AT | 11/2022 | [paper] |
A Dataset for Interactive Vision-Language Navigation with Unknown Command Feasibility | MoTIF | M/P | ST | - | AT | 02/2022 | [paper] |
Mapping natural language instructions to mobile UI action sequences | PIXELHELP | M/P | IT | RW | GG | 05/2020 | [paper] |
Windows Agent Arena: Evaluating Multi-Modal OS Agents at Scale | WindowsAgentArena | PC | IT | RW | AT | 09/2024 | [paper] |
OfficeBench: Benchmarking Language Agents across Multiple Applications for Office Automation | OfficeBench | PC | IT | RW | AT | 07/2024 | [paper] |
OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments | OSWorld | PC | IT | RW | AT | 04/2024 | [paper] |
Omniact: A dataset and benchmark for enabling multimodal generalist autonomous agents for desktop and web | OmniACT | PC | ST | - | GG | 02/2024 | [paper] |
ASSISTGUI: Task-Oriented Desktop Graphical User Interface Automation | ASSISTGUI | PC | IT | RW | AT | 12/2023 | [paper] |
WebCanvas: Benchmarking Web Agents in Online Environments | Mind2Web-Live | Web | IT | RW | IF,AT | 06/2024 | [paper] |
MMInA: Benchmarking multihop multimodal internet agents | MMInA | Web | IT | RW | IF,AT | 04/2024 | [paper] |
AgentStudio: A Toolkit for Building General Virtual Agents | GroundUI | Web | ST | - | GG | 03/2024 | [paper] |
Tur[k]ingBench: A Challenge Benchmark for Web Agents | TurkingBench | Web | IT | RW | AT | 03/2024 | [paper] |
Workarena: How capable are web agents at solving common knowledge work tasks? | WorkArena | Web | IT | RW | IF,AT | 03/2024 | [paper] |
WebLINX: Real-World website navigation with Multi-Turn dialogue | WebLINX | Web | ST | - | IF,AT | 02/2024 | [paper] |
Visualwebarena: Evaluating multimodal agents on realistic visual web tasks | Visualwebarena | Web | IT | RW | GG,AT | 01/2024 | [paper] |
WebVLN: Vision-and-Language Navigation on Websites | WebVLN-v1 | Web | IT | RW | IF,AT | 12/2023 | [paper] |
Webarena: A realistic web environment for building autonomous agents | WebArena | Web | IT | RW | AT | 07/2023 | [paper] |
Mind2Web: Towards a Generalist Agent for the Web | Mind2Web | Web | ST | - | IF,AT | 06/2023 | [paper] |
Webshop: Towards scalable real-world web interaction with grounded language agents | WebShop | Web | ST | - | AT | 07/2022 | [paper] |
Mapping natural language commands to web elements | PhraseNode | Web | ST | - | GG | 08/2018 | [paper] |
World of bits: An open-domain platform for web-based agents | MiniWoB | Web | ST | - | AT | 08/2017 | [paper] |
World of bits: An open-domain platform for web-based agents | FormWoB | Web | IT | SM | AT | 08/2017 | [paper] |
- [2024/12/13] Falcon-UI: Understanding GUI Before Following User Instructions. [paper]
- [2024/12/13] AgentTrek: Agent Trajectory Synthesis via Guiding Replay with Web Tutorials. [paper]
- [2024/12/05] Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction. [paper]
- [2024/11/26] ShowUI: One Vision-Language-Action Model for GUI Visual Agent. [paper]
- [2024/11/22] ScribeAgent: Towards Specialized Web Agents Using Production-Scale Workflow Data. [paper]
- [2024/11/18] Improved GUI Grounding via Iterative Narrowing. [paper]
- [2024/10/31] AndroidLab: Training and Systematic Benchmarking of Android Autonomous Agents. [paper]
- [2024/10/30] OS-ATLAS: A Foundation Action Model for Generalist GUI Agents. [paper]
- [2024/10/28] AutoGLM: Autonomous Foundation Agents for GUIs. [paper]
- [2024/10/25] EDGE: Enhanced Grounded GUI Understanding with Enriched Multi-Granularity Synthetic Data. [paper]
- [2024/10/24] Ferret-UI One: Mastering Universal User Interface Understanding Across Platforms. [paper]
- [2024/10/22] ShowUI: One Vision-Language-Action Model for Generalist GUI Agent. [paper]
- [2024/10/17] Harnessing Webpage UIs for Text-Rich Visual Understanding. [paper]
- [2024/10/09] TinyClick: Single-Turn Agent for Empowering GUI Automation. [paper]
- [2024/10/07] Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents. [paper]
- [2024/10/03] NNetscape Navigator: Complex Demonstrations for Web Agents Without a Demonstrator. [paper]
- [2024/09/30] MM1.5: Methods, Analysis & Insights from Multimodal LLM Fine-tuning. [paper]
- [2024/09/24] Synatra: Turning indirect knowledge into direct demonstrations for digital agents at scale. [paper]
- [2024/09/23] MobileVLM: A Vision-Language Model for Better Intra- and Inter-UI Understanding. [paper]
- [2024/09/22] MobileViews: A Large-Scale Mobile GUI Dataset. [paper]
- [2024/08/30] UI-Hawk: Unleashing the screen stream understanding for gui agents. [paper]
- [2024/07/19] GUI Action Narrator: Where and When Did That Action Take Place? [paper]
- [2024/07/05] MobileFlow: A Multimodal LLM for Mobile GUI Agent. [paper]
- [2024/06/20] VGA: Vision GUI Assistant - Minimizing Hallucinations through Image-Centric Fine-Tuning. [paper]
- [2024/06/12] GUI Odyssey: A Comprehensive Dataset for Cross-App GUI Navigation on Mobile Devices. [paper]
- [2024/06/12] Tell Me What's Next: Textual Foresight for Generic UI Representations. [paper]
- [2024/05/08] Visual Grounding for User Interfaces. [paper]
- [2024/05/05] Visual grounding for desktop graphical user interfaces. [paper]
- [2024/05/05] Android in the Zoo:Chain-of-Action-Thought for GUI Agents. [paper]
- [2024/05/01] Navigating WebAI: Training Agents to Complete Web Tasks with Large Language Models and Reinforcement Learning. [paper]
- [2024/04/16] Search Beyond Queries: Training Smaller Language Models for Web Interactions via Reinforcement Learning. [paper]
- [2024/04/12] Training a Vision Language Model as Smartphone Assistant. [paper]
- [2024/04/09] Autonomous Evaluation and Refinement of Web Agents. [paper]
- [2024/04/08] Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs. [paper]
- [2024/04/04] AutoWebGLM: A Large Language Model-based Web Navigating Agent. [paper]
- [2024/04/02] Octopus v2: On-device language model for super agent. [paper]
- [2024/03/30] Large Language Models Can Self-Improve At Web Agent Tasks. [paper]
- [2024/02/08] WebLINX: Real-World website navigation with Multi-Turn dialogue. [paper]
- [2024/02/07] ScreenAI: A Vision-Language Model for UI and Infographics Understanding. [paper]
- [2024/02/06] Dual-View Visual Contextualization for Web Navigation. [paper]
- [2024/01/20] E-ANT: A Large-Scale Dataset for Efficient Automatic GUI NavigaTion. [paper]
- [2024/01/17] SeeClick: Harnessing GUI Grounding for Advanced Visual GUI Agents. [paper]
- [2024/01/17] GUICourse: From General Vision Language Models to Versatile GUI Agents. [paper]
- [2023/12/25] WebVLN: Vision-and-Language Navigation on Websites. [paper]
- [2023/12/25] UINav: A Practical Approach to Train On-Device Automation Agents. [paper]
- [2023/12/14] CogAgent: A Visual Language Model for GUI Agents. [paper]
- [2023/12/04] Intelligent Virtual Assistants with LLM-based Process Automation. [paper]
- [2023/11/30] Exposing Limitations of Language Model Agents in Sequential-Task Compositions on the Web. [paper]
- [2023/10/27] Android in the wild: A large-scale dataset for android device control. [paper]
- [2023/10/08] UReader: Universal OCR-free Visually-situated Language Understanding with Multimodal Large Language Model. [paper]
- [2023/10/07] ILuvUI: Instruction-tuned Language-Vision modeling of UIs from Machine Conversations. [paper]
- [2023/10/07] Reinforced UI Instruction Grounding: Towards a Generic UI Task Automation API. [paper]
- [2023/07/24] A Real-World WebAgent with Planning, Long Context Understanding, and Program Synthesis. [paper]
- [2023/05/31] From pixels to UI actions: Learning to follow instructions via graphical user interfaces. [paper]
- [2023/05/19] Multimodal Web Navigation with Instruction-Finetuned Foundation Models. [paper]
- [2023/01/30] WebUI: A Dataset for Enhancing Visual UI Understanding with Web Semantics. [paper]
- [2023/01/23] Lexi: Self-Supervised Learning of the UI Language. [paper]
- [2022/10/06] Towards Better Semantic Understanding of Mobile Interfaces. [paper]
- [2022/09/29] Spotlight: Mobile UI Understanding using Vision-Language Models with a Focus. [paper]
- [2022/07/04] WebShop: Towards Scalable Real-World Web Interaction with Grounded Language Agents. [paper]
- [2022/05/23] Meta-gui: Towards multi-modal conversational agents on mobile gui. [paper]
- [2022/02/16] A data-driven approach for learning to control computers. [paper]
- [2024/12/02] Ponder & Press: Advancing Visual GUI Agent towards General Computer Control. [paper]
- [2024/11/20] AdaptAgent: Adapting Multimodal Web Agents with Few-Shot Learning from Human Demonstrations. [paper]
- [2024/11/18] Improved GUI Grounding via Iterative Narrowing. [paper]
- [2024/11/15] The Dawn of GUI Agent: A Preliminary Case Study with Claude 3.5 Computer Use. [paper]
- [2024/11/10] Is Your LLM Secretly a World Model of the Internet? Model-Based Planning for Web Agents. [paper]
- [2024/11/01] WebOlympus: An Open Platform for Web Agents on Live Websites. [paper]
- [2024/10/29] Auto-Intent: Automated Intent Discovery and Self-Exploration for Large Language Model Web Agents. [paper]
- [2024/10/25] OpenWebVoyager: Building Multimodal Web Agents via Iterative Real-World Exploration, Feedback and Optimization. [paper]
- [2024/10/24] OSCAR: Operating System Control via State-Aware Reasoning and Re-Planning. [paper]
- [2024/10/24] AgentStore: Scalable Integration of Heterogeneous Agents As Specialized Generalist Computer Assistant. [paper]
- [2024/10/24] Infogent: An Agent-Based Framework for Web Information Aggregation. [paper]
- [2024/10/22] Large Language Models Empowered Personalized Web Agents. [paper]
- [2024/10/21] Beyond Browsing: API-Based Web Agents. [paper]
- [2024/10/17] AgentOccam: A Simple Yet Strong Baseline for LLM-Based Web Agents. [paper]
- [2024/10/17] MobA: A Two-Level Agent System for Efficient Mobile Task Automation. [paper]
- [2024/10/11] VisionTasker: Mobile Task Automation Using Vision Based UI Understanding and LLM Task Planning. [paper]
- [2024/10/10] Agent S: An Open Agentic Framework that Uses Computers Like a Human. [paper]
- [2024/10/09] ClickAgent: Enhancing UI Location Capabilities of Autonomous Agents. [paper]
- [2024/10/01] Dynamic Planning for LLM-based Graphical User Interface Automation. [paper]
- [2024/10/01] Multimodal Auto Validation For Self-Refinement in Web Agents. [paper]
- [2024/09/25] Turn Every Application into an Agent: Towards Efficient Human-Agent-Computer Interaction with API-First LLM-Based Agents. [paper]
- [2024/09/23] From Commands to Prompts: LLM-based Semantic File System for AIOS. [paper]
- [2024/09/23] Steward: Natural Language Web Automation. [paper]
- [2024/09/16] NaviQAte: Functionality-Guided Web Application Navigation. [paper]
- [2024/09/14] PeriGuru: A Peripheral Robotic Mobile App Operation Assistant based on GUI Image Understanding and Prompting with LLM. [paper]
- [2024/09/12] Windows Agent Arena: Evaluating Multi-Modal OS Agents at Scale. [paper]
- [2024/09/11] Agent Workflow Memory. [paper]
- [2024/08/28] Webpilot: A versatile and autonomous multi-agent system for web task execution with strategic exploration. [paper]
- [2024/08/24] AutoWebGLM: A Large Language Model-based Web Navigating Agent. [paper]
- [2024/08/24] Intelligent Agents with LLM-based Process Automation. [paper]
- [2024/08/01] OpenWebAgent: An Open Toolkit to Enable Web Agents on Large Language Models. [paper]
- [2024/08/01] Omniparser for pure vision based gui agent. [paper]
- [2024/07/21] Towards LLMCI: Multimodal AI for LLM-Vision UI Operation. [paper]
- [2024/07/17] Agent-e: From autonomous web navigation to foundational design principles in agentic systems. [paper]
- [2024/07/04] MobileExperts: A Dynamic Tool-Enabled Agent Team in Mobile Devices. [paper]
- [2024/07/01] Tree Search for Language Model Agents. [paper]
- [2024/06/27] Read anywhere pointed: Layout-aware gui screen reading with tree-of-lens grounding. [paper]
- [2024/06/11] CAAP: Context-Aware Action Planning Prompting to Solve Computer Tasks with Front-End UI Only. [paper]
- [2024/06/03] Mobile-Agent-v2: Mobile Device Operation Assistant with Effective Navigation via Multi-Agent Collaboration. [paper]
- [2024/05/23] AndroidWorld: A Dynamic Benchmarking Environment for Autonomous Agents. [paper]
- [2024/05/17] Latent State Estimation Helps UI Agents to Reason. [paper]
- [2024/04/28] MMAC-Copilot: Multi-modal Agent Collaboration Operating System Copilot. [paper]
- [2024/04/09] Autonomous Evaluation and Refinement of Web Agents. [paper]
- [2024/04/03] PromptRPA: Generating Robotic Process Automation on Smartphones from Textual Prompts. [paper]
- [2024/03/25] AIOS: LLM Agent Operating System. [paper]
- [2024/03/05] Android in the zoo: Chain-of-action-thought for gui agents. [paper]
- [2024/03/05] Cradle: Empowering Foundation Agents Towards General Computer Control. [paper]
- [2024/02/23] On the Multi-turn Instruction Following for Conversational Web Agents. [paper]
- [2024/02/19] CoCo-Agent: A Comprehensive Cognitive MLLM Agent for Smartphone GUI Automation. [paper]
- [2024/02/12] OS-Copilot: Towards Generalist Computer Agents with Self-Improvement. [paper]
- [2024/02/09] ScreenAgent: A Vision Language Model-driven Computer Control Agent. [paper]
- [2024/02/08] Ufo: A ui-focused agent for windows os interaction. [paper]
- [2024/02/06] Dual-View Visual Contextualization for Web Navigation. [paper]
- [2024/01/29] Mobile-Agent: Autonomous Multi-Modal Mobile Device Agent with Visual Perception. [paper]
- [2024/01/25] WebVoyager: Building an End-to-End Web Agent with Large Multimodal Models. [paper]
- [2024/01/17] Seeclick: Harnessing gui grounding for advanced visual gui agents. [paper]
- [2024/01/04] MobileAgent: enhancing mobile control via human-machine interaction and SOP integration. [paper]
- [2024/01/03] GPT-4V(ision) is a Generalist Web Agent, if Grounded. [paper]
- [2023/12/21] AppAgent: Multimodal Agents as Smartphone Users. [paper]
- [2023/12/20] Assistgui: Task-oriented desktop graphical user interface automation. [paper]
- [2023/12/04] MobileGPT: Augmenting LLM with Human-like App Memory for Mobile Task Automation. [paper]
- [2023/12/04] Intelligent Virtual Assistants with LLM-based Process Automation. [paper]
- [2023/11/13] GPT-4V in Wonderland: Large Multimodal Models for Zero-Shot Smartphone GUI Navigation. [paper]
- [2023/10/24] WebWISE: Web Interface Control and Sequential Exploration with Large Language Models. [paper]
- [2023/10/12] A Zero-Shot Language Agent for Computer Control with Structured Reflection. [paper]
- [2023/09/20] You only look at screens: Multimodal chain-of-action agents. [paper]
- [2023/09/15] Laser: Llm agent with state-space exploration for web navigation. [paper]
- [2023/08/29] AutoDroid: LLM-powered Task Automation in Android. [paper]
- [2023/07/24] A Real-World WebAgent with Planning, Long Context Understanding, and Program Synthesis. [paper]
- [2023/06/14] Synapse: Trajectory-as-Exemplar Prompting with Memory for Computer Control. [paper]
- [2023/06/09] Mind2Web: Towards a Generalist Agent for the Web. [paper]
- [2023/05/30] Sheetcopilot: Bringing software productivity to the next level through large language models. [paper]
- [2023/05/23] Hierarchical prompting assists large language model on web navigation. [paper]
- [2023/05/09] InternGPT: Solving Vision-Centric Tasks by Interacting with ChatGPT Beyond Language. [paper]
- [2023/03/30] Language Models can Solve Computer Tasks. [paper]
- [2022/09/19] Enabling conversational interaction with mobile ui using large language models. [paper]
- [2022/05/23] Meta-gui: Towards multi-modal conversational agents on mobile gui. [paper]
- [2024/12/09] BigDocs: An Open and Permissively-Licensed Dataset for Training Multimodal Models on Document and Code Tasks. [paper]
- [2024/12/05] Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction. [paper]
- [2024/10/31] AndroidLab: Training and Systematic Benchmarking of Android Autonomous Agents. [paper]
- [2024/10/28] AssistEditor: Multi-Agent Collaboration for GUI Workflow Automation in Video Creation. [paper]
- [2024/10/28] Shopping MMLU: A Massive Multi-Task Online Shopping Benchmark for Large Language Models. [paper]
- [2024/10/28] MMT-Bench: A Comprehensive Multimodal Benchmark for Evaluating Large Vision-Language Models Towards Multitask AGI. [paper]
- [2024/10/24] VideoWebArena: Evaluating Long Context Multimodal Agents with Video Understanding Web Tasks. [paper]
- [2024/10/22] Large Language Models Empowered Personalized Web Agents. [paper]
- [2024/10/19] SPA-Bench: A Comprehensive Benchmark for SmartPhone Agent Evaluation. [paper]
- [2024/10/17] MobA: A Two-Level Agent System for Efficient Mobile Task Automation. [paper]
- [2024/10/07] Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents. [paper]
- [2024/09/22] MobileViews: A Large-Scale Mobile GUI Dataset. [paper]
- [2024/09/12] Windows Agent Arena: Evaluating Multi-Modal OS Agents at Scale. [paper]
- [2024/09/06] WebQuest: A Benchmark for Multimodal QA on Web Page Sequences. [paper]
- [2024/08/01] Omniparser for pure vision based gui agent. [paper]
- [2024/07/26] OfficeBench: Benchmarking Language Agents across Multiple Applications for Office Automation. [paper]
- [2024/07/22] AssistantBench: Can Web Agents Solve Realistic and Time-Consuming Tasks? [paper]
- [2024/07/15] Spider2-V: How Far Are Multimodal Agents From Automating Data Science and Engineering Workflows? [paper]
- [2024/07/13] RealWeb: A Benchmark for Universal Instruction Following in Realistic Web Services Navigation. [paper]
- [2024/07/11] UICrit: Enhancing Automated Design Evaluation with a UI Critique Dataset. [paper]
- [2024/07/07] WorkArena++: Towards Compositional Planning and Reasoning-based Common Knowledge Work Tasks. [paper]
- [2024/07/04] MobileExperts: A Dynamic Tool-Enabled Agent Team in Mobile Devices. [paper]
- [2024/07/03] Amex: Android multi-annotation expo dataset for mobile gui agents. [paper]
- [2024/07/01] CRAB: Cross-environment Agent Benchmark for Multimodal Language Model Agents. [paper]
- [2024/07/01] Mobile-Bench: An Evaluation Benchmark for LLM-based Mobile Agents. [paper]
- [2024/06/27] Read Anywhere Pointed: Layout-aware GUI Screen Reading with Tree-of-Lens Grounding. [paper]
- [2024/06/20] E-ANT: A Large-Scale Dataset for Efficient Automatic GUI NavigaTion. [paper]
- [2024/06/20] Identifying User Goals from UI Trajectories. [paper]
- [2024/06/19] GUI Action Narrator: Where and When Did That Action Take Place? [paper]
- [2024/06/18] WebCanvas: Benchmarking Web Agents in Online Environments. [paper]
- [2024/06/17] GUICourse: From General Vision Language Models to Versatile GUI Agents. [paper]
- [2024/06/14] VideoGUI: A Benchmark f om Instructional Videos. [paper]
- [2024/06/12] GUI Odyssey: A Comprehensive Dataset for Cross-App GUI Navigation on Mobile Devices. [paper]
- [2024/06/12] GUI-WORLD: A Dataset for GUI-oriented Multimodal LLM-based Agents. [paper]
- [2024/06/12] MobileAgentBench: An Efficient and User-Friendly Benchmark for Mobile LLM Agents. [paper]
- [2024/06/06] On the Effects of Data Scale on UI Control Agents. [paper]
- [2024/06/05] WebOlympus: An Open Platform for Web Agents on Live Websites. [paper]
- [2024/06/05] UGIF-DataSet: A New Dataset for Cross-lingual, Cross-modal Sequential actions on the UI. [paper]
- [2024/06/01] WebSuite: Systematically Evaluating Why Web Agents Fail. [paper]
- [2024/05/23] AndroidWorld: A Dynamic Benchmarking Environment for Autonomous Agents. [paper]
- [2024/05/23] AGILE: A Novel Reinforcement Learning Framework of LLM Agents. [paper]
- [2024/05/17] Latent state estimation helps ui agents to reason. [paper]
- [2024/05/07] Mapping natural language instructions to mobile UI action sequences. [paper]
- [2024/05/05] Visual grounding for desktop graphical user interfaces. [paper]
- [2024/04/28] MMAC-Copilot: Multi-modal Agent Collaboration Operating System Copilot. [paper]
- [2024/04/25] Benchmarking mobile device control agents across diverse configurations. [paper]
- [2024/04/15] MMInA: Benchmarking multihop multimodal internet agents. [paper]
- [2024/04/12] Llamatouch: A faithful and scalable testbed for mobile ui task automation. [paper]
- [2024/04/11] OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments. [paper]
- [2024/04/09] VisualWebBench: How Far Have Multimodal LLMs Evolved in Web Page Understanding and Grounding? [paper]
- [2024/04/09] GUIDE: Graphical User Interface Data for Execution. [paper]
- [2024/04/08] Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs. [paper]
- [2024/04/04] Autowebglm: Bootstrap and reinforce a large language model-based web navigating agent. [paper]
- [2024/03/29] Evaluating language model agents on realistic autonomous tasks. [paper]
- [2024/03/29] Draw-and-Understand: Leveraging Visual Prompts to Enable MLLMs to Comprehend What You Want. [paper]
- [2024/03/26] AgentStudio: A Toolkit for Building General Virtual Agents. [paper]
- [2024/03/18] Tur[k]ingBench: A Challenge Benchmark for Web Agents. [paper]
- [2024/03/15] Computer User Interface Understanding. A New Dataset and a Learning Framework. [paper]
- [2024/03/12] Workarena: How capable are web agents at solving common knowledge work tasks? [paper]
- [2024/03/06] PPTC-R benchmark: Towards Evaluating the Robustness of Large Language Models for PowerPoint Task Completion. [paper]
- [2024/03/05] Android in the Zoo:Chain-of-Action-Thought for GUI Agents. [paper]
- [2024/02/27] Omniact: A dataset and benchmark for enabling multimodal generalist autonomous agents for desktop and web. [paper]
- [2024/02/23] On the Multi-turn Instruction Following for Conversational Web Agents. [paper]
- [2024/02/09] Understanding the weakness of large language model agents within a complex android environment. [paper]
- [2024/02/08] WebLINX: Real-World website navigation with Multi-Turn dialogue. [paper]
- [2024/01/29] Mobile-Agent: Autonomous Multi-Modal Mobile Device Agent with Visual Perception. [paper]
- [2024/01/25] Webvoyager: Building an end-to-end web agent with large multimodal models. [paper]
- [2024/01/25] GPTVoiceTasker: Advancing Multi-step Mobile Task Efficiency Through Dynamic Interface Exploration and Learning. [paper]
- [2024/01/24] Visualwebarena: Evaluating multimodal agents on realistic visual web tasks. [paper]
- [2024/01/24] Agentboard: An analytical evaluation board of multiturn LLM agents. [paper]
- [2024/01/17] Seeclick: Harnessing gui grounding for advanced visual gui agents. [paper]
- [2023/12/26] AutoTask: Executing Arbitrary Voice Commands by Exploring and Learning from Mobile GUI. [paper]
- [2023/12/25] WebVLN: Vision-and-Language Navigation on Websites. [paper]
- [2023/12/20] ASSISTGUI: Task-Oriented Desktop Graphical User Interface Automation. [paper]
- [2023/11/30] Exposing Limitations of Language Model Agents in Sequential-Task Compositions on the Web. [paper]
- [2023/11/21] GAIA: a benchmark for general AI assistants. [paper]
- [2023/11/13] GPT-4V in wonderland: Large multimodal models for Zero-Shot smartphone GUI navigation. [paper]
- [2023/11/03] Pptc benchmark: Evaluating large language models for powerpoint task completion. [paper]
- [2023/10/16] OpenAgents: An Open Platform for Language Agents in the Wild. [paper]
- [2023/09/20] You Only Look at Screens:Multimodal Chain-of-Action Agents. [paper]
- [2023/08/29] AutoDroid: LLM-powered Task Automation in Android. [paper]
- [2023/08/07] Agentbench: Evaluating llms as agents. [paper]
- [2023/07/25] Webarena: A realistic web environment for building autonomous agents. [paper]
- [2023/07/19] Android in the wild: A large-scale dataset for android device control. [paper]
- [2023/06/09] Mind2Web: Towards a Generalist Agent for the Web. [paper]
- [2023/05/30] Sheetcopilot: Bringing software productivity to the next level through large language models. [paper]
- [2023/05/25] On the tool manipulation capability of open-source large language models. [paper]
- [2023/05/14] Mobile-env: A universal platform for training and evaluation of mobile interaction. [paper]
- [2023/05/14] Mobile-env: Building qualified evaluation benchmarks for llm-gui interaction. [paper]
- [2023/04/14] Droidbot-gpt: Gpt-powered ui automation for android. [paper]
- [2023/04/10] OpenAGI: When LLM Meets Domain Experts. [paper]
- [2023/03/13] Vision-Language models as success detectors. [paper]
- [2022/11/14] Ugif: Ui grounded instruction following. [paper]
- [2022/10/08] Understanding html with large language models. [paper]
- [2022/09/29] MUG: Interactive Multimodal Grounding on User Interfaces. [paper]
- [2022/09/16] ScreenQA: Large-Scale Question-Answer Pairs over Mobile App Screenshots. [paper]
- [2022/07/04] Webshop: Towards scalable real-world web interaction with grounded language agents. [paper]
- [2022/05/23] Meta-gui: Towards multi-modal conversational agents on mobile gui. [paper]
- [2022/02/04] A Dataset for Interactive Vision-Language Navigation with Unknown Command Feasibility. [paper]
- [2021/04/17] Mobile app tasks with iterative feedback (motif): Addressing task feasibility in interactive visual environments. [paper]
- [2021/01/23] Websrc: A dataset for web-based structural reading comprehension. [paper]
- [2018/08/28] Mapping natural language commands to web elements. [paper]
- [2017/11/06] Building natural language interfaces to web apis. [paper]
- [2017/08/06] World of bits: An open-domain platform for web-based agents. [paper]
- [2024/11/04] Attacking Vision-Language Computer Agents via Pop-ups. [paper]
- [2024/10/23] MobileSafetyBench: Evaluating Safety of Autonomous Agents in Mobile Device Control. [paper]
- [2024/10/22] Advweb: Controllable black-box attacks on vlm-powered web agents. [paper]
- [2024/10/11] Refusal-Trained LLMs Are Easily Jailbroken As Browser Agents. [paper]
- [2024/10/09] ST-WebAgentBench: A Benchmark for Evaluating Safety and Trustworthiness in Web Agents. [paper]
- [2024/09/17] Eia: Environmental injection attack on generalist web agents for privacy leakage. [paper]
- [2024/08/05] Caution for the Environment: Multimodal Agents are Susceptible to Environmental Distractions. [paper]
- [2024/07/12] Security Matrix for Multimodal Agents on Mobile Devices: A Systematic and Proof of Concept Study. [paper]
- [2024/06/18] Adversarial Attacks on Multimodal Agents. [paper]
- [2024/02/26] WIPI: A New Web Threat for LLM-Driven Web Agents. [paper]
- [2023/08/03] From Prompt Injections to SQL Injection Attacks: How Protected is Your LLM-Integrated Web Application? [paper]
- [2024/12/19] Agent-SafetyBench: Evaluating the Safety of LLM Agents. [paper]
[OPPO] Personal AI Team Hiring Algorithm Interns / New Graduates / Experienced Candidates
OPPO is dedicated to developing the most advanced foundation model technologies and creating personalized user experiences. Our mission is focused on developing revolutionary AI Native Phones that will shape the future. The Personal AI Team is primarily focused on cutting-edge research in (multimodal) LLMs, AI Agents and personalization (e.g., team's new work AI PERSONA). The team leverages OPPO’s ample platform resources, including data and computing power, to continuously invest in these key areas. We are currently hiring algorithm interns, new graduates, and experienced candidates. We expect you to be proficient in algorithms or engineering of large foundation models. If interested, please contact via email: [email protected].
[01.AI] Foundation Model Post-training Team Hiring Algorithm Interns / New Graduates / Experienced Candidates
01.AI is a global leader in technology and applications of large foundation models. Its latest flagship model, Yi-Lightning, ranked #1 in China and #6 globally in the LMSys Chatbot Arena updated on October 14, 2024. The post-training team is responsible for cutting-edge research and engineering of post-training techniques for foundation models. We are currently hiring algorithm interns, new graduates, and experienced candidates. We expect you to be proficient in algorithms or engineering of large foundation models. If interested, please contact via email: [email protected].
To better foster community collaboration, we have listed some repositories related to OS Agents:
- aialt/awesome-mobile-agents: https://github.com/aialt/awesome-mobile-agents
- OSU-NLP-Group/GUI-Agents-Paper-List: https://github.com/OSU-NLP-Group/GUI-Agents-Paper-List
- vyokky/LLM-Brained-GUI-Agents-Survey: https://github.com/vyokky/LLM-Brained-GUI-Agents-Survey
- wendell0218/GVA-Survey: https://github.com/wendell0218/GVA-Survey
The repo is still being updated rapidly🚀. Please let us know if you notice any mistakes or would like any work related to OS Agents to be included in our list by e-mail: [email protected].
Considering that the current bib citation points to our repository, we will update it to point to the paper as soon as the preprint server is available. Please stay tuned for updates. Before this, if you find our repository helpful, we would appreciate it if you could cite:
@misc{hu2024osagents,
title = {OS Agents: A Survey on MLLM-based Agents for General Computing Devices Use},
author = {Xueyu Hu and Tao Xiong and Biao Yi and Zishu Wei and Ruixuan Xiao and Yurun Chen and Jiasheng Ye and Meiling Tao and Xiangxin Zhou and Ziyu Zhao and Yuhuai Li and Shengze Xu and Shawn Wang and Xinchen Xu and Shuofei Qiao and Kun Kuang and Tieyong Zeng and Liang Wang and Jiwei Li and Yuchen Eleanor Jiang and Wangchunshu Zhou and Guoyin Wang and Keting Yin and Zhou Zhao and Hongxia Yang and Fan Wu and Shengyu Zhang and Fei Wu},
year = {2024},
howpublished = {\url{https://github.com/OS-Agent-Survey/OS-Agent-Survey/}},
}