Skip to content

This is the repo for "OS Agents: A Survey on MLLM-based Agents for General Computing Devices Control".

Notifications You must be signed in to change notification settings

YurunChen/OS-Agent-Survey

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 

Repository files navigation

ToRA
OS Agents: A Survey on MLLM-based Agents
for General Computing Devices Use

[🌐 Website][📜 Paper][🐱 GitHub] [ 知乎 Zhihu]

This is the repo for the paper OS Agents: A Survey on MLLM-based Agents for General Computing Devices Use. This paper conducts a comprehensive survey on OS Agents, which are (M)LLM-based agents that use computing devices (e.g., computers and mobile phones) by operating within the environments and interfaces (e.g., Graphical User Interface (GUI)) provided by operating systems (OS) to automate tasks. The survey is aimed to consolidates the state of OS Agents research, providing insights to guide both academic inquiry and industrial development. In this repository, we have listed relevant papers related to our work in four areas: Foundation Models, Agent Frameworks, Evaluation & Benchmarks, and Safety & Privacy and this collection will be continuously updated. We aim to provide you with comprehensive knowledge in the OS Agent field, hoping it can help you quickly familiarize yourself with this research direction.

❗Why is there no arXiv link for this paper?

This paper was rejected by arXiv surprisingly with the justification: "Our moderators determined that your submission does not contain sufficient original or substantive scholarly research and is not of interest to arXiv." This reasoning appears to be inconsistent with the content and contribution of the paper. We attempted an appeal, but unfortunately, this was unsuccessful, and no further explanation was provided. A resubmission did not resolve the issue either. As a result, the ONLY way to access the paper at the moment is through our GitHub repository. We are disappointed by the lack of transparency in arXiv’s moderation process.

🔔We are hiring!

(Some institutes involved in this survey is hiring. Information will be continuously updated, please stay tuned. Detailed Information is here.)

[OPPO] Personal AI Team

OPPO is seeking algorithm interns, new graduates, and experienced candidates for its Personal AI Team. Focused on multimodal LLMs, AI Agents, and personalization, the team works to develop AI Native Phones. Interested candidates, contact: [email protected].

[01.AI] Foundation Model Post-Training Team

01.AI is hiring algorithm interns, new graduates, and experienced candidates for its post-training team. Join the team behind Yi-Lightning, ranked #1 in China and #6 globally in the LMSys Chatbot Arena updated on October 14, 2024. Interested candidates, contact: [email protected].

Table of Contents

Overview of OS Agent Survey

This survey aims to advance the research and development of OS Agents by providing a detailed exploration of their fundamental capabilities, methodologies for building them using (M)LLMs, and emerging trends in the field. While OS Agents are still in the early stages of growth, the rapid evolution of technology continues to introduce innovative approaches and applications. This work seeks to highlight ongoing challenges, future opportunities, and the latest developments, encouraging further research and industrial adoption. Ultimately, we hope this study will serve as a catalyst for innovation, driving meaningful progress in both academia and industry.

Tables

Table 1: Recent foundation models for OS Agents. Arch.: Architecture, Exist.: Existing, Mod.: Modified, Concat.: Concatenated, PT: Pre-Train, SFT: Supervised Fine-Tune, RL: Reinforcement Learning.

Paper Model Arch. PT SFT RL Date Link
OS-ATLAS: A Foundation Action Model for Generalist GUI Agents OS-Atlas Exist. MLLMs - 10/2024 [paper]
AutoGLM: Autonomous Foundation Agents for GUIs AutoGLM Exist. LLMs 10/2024 [paper]
EDGE: Enhanced Grounded GUI Understanding with Enriched Multi-Granularity Synthetic Data EDGE Exist. MLLMs - - 10/2024 [paper]
Ferret-UI 2: Mastering Universal User Interface Understanding Across Platforms Ferret-UI 2 Exist. MLLMs - - 10/2024 [paper]
ShowUI: One Vision-Language-Action Model for Generalist GUI Agent ShowUI Exist. MLLMs - 10/2024 [paper]
Harnessing Webpage UIs for Text-Rich Visual Understanding UIX Exist. MLLMs - - 10/2024 [paper]
TinyClick: Single-Turn Agent for Empowering GUI Automation TinyClick Exist. MLLMs - - 10/2024 [paper]
Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents UGround Exist. MLLMs - - 10/2024 [paper]
NNetscape Navigator: Complex Demonstrations for Web Agents Without a Demonstrator NNetNav Exist. LLMs - - 10/2024 [paper]
Synatra: Turning indirect knowledge into direct demonstrations for digital agents at scale Synatra Exist. LLMs - - 09/2024 [paper]
MobileVLM: A Vision-Language Model for Better Intra- and Inter-UI Understanding MobileVLM Exist. MLLMs - 09/2024 [paper]
UI-Hawk: Unleashing the screen stream understanding for gui agents UI-Hawk Mod. MLLMs - 08/2024 [paper]
GUI Action Narrator: Where and When Did That Action Take Place? GUI Action Narrator Exist. MLLMs - - 07/2024 [paper]
MobileFlow: A Multimodal LLM for Mobile GUI Agent MobileFlow Mod. MLLMs - 07/2024 [paper]
VGA: Vision GUI Assistant - Minimizing Hallucinations through Image-Centric Fine-Tuning VGA Exist. MLLMs - - 06/2024 [paper]
GUI Odyssey: A Comprehensive Dataset for Cross-App GUI Navigation on Mobile Devices OdysseyAgent Exist. MLLMs - - 06/2024 [paper]
Tell Me What's Next: Textual Foresight for Generic UI Representations Textual Foresight Concat. MLLMs - 06/2024 [paper]
Navigating WebAI: Training Agents to Complete Web Tasks with Large Language Models and Reinforcement Learning WebAI Concat. MLLMs - 05/2024 [paper]
Search Beyond Queries: Training Smaller Language Models for Web Interactions via Reinforcement Learning GLAINTEL Exist. LLMs - - 04/2024 [paper]
Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs Ferret-UI Exist. MLLMs - - 04/2024 [paper]
AutoWebGLM: A Large Language Model-based Web Navigating Agent AutoWebGLM Exist. LLMs - 04/2024 [paper]
Large Language Models Can Self-Improve At Web Agent Tasks - Exist. LLMs - - 03/2024 [paper]
ScreenAI: A Vision-Language Model for UI and Infographics Understanding ScreenAI Exist. MLLMs - 02/2024 [paper]
Dual-View Visual Contextualization for Web Navigation Dual-VCR Concat MLLMs - - 02/2024 [paper]
SeeClick: Harnessing GUI Grounding for Advanced Visual GUI Agents SeeClick Exist. MLLMs - 01/2024 [paper]
CogAgent: A Visual Language Model for GUI Agents CogAgent Mod. MLLMs - 12/2023 [paper]
ILuvUI: Instruction-tuned Language-Vision modeling of UIs from Machine Conversations ILuvUI Mod. MLLMs - - 10/2023 [paper]
Reinforced UI Instruction Grounding: Towards a Generic UI Task Automation API RUIG Concat. MLLMs - - 10/2023 [paper]
A Real-World WebAgent with Planning, Long Context Understanding, and Program Synthesis WebAgent Concat. MLLMs - 07/2023 [paper]
Multimodal Web Navigation with Instruction-Finetuned Foundation Models WebGUM Concat. MLLMs - - 05/2023 [paper]

Table 2: Recent agent frameworks for OS Agents. TD: Textual Description, GS: GUI Screenshots, VG: Visual Grounding, SG: Semantic Grounding, DG: Dual Grounding, GL: Global, IT: Iterative, AE: Automated Exploration, EA: Experience-Augmented, MA: Management, IO: Input Operations, NO: Navigation Operations, EO: Extended Operations.

Paper Model Perception Planning Memory Action Date Link
OpenWebVoyager: Building Multimodal Web Agents via Iterative Real-World Exploration, Feedback and Optimization OpenWebVoyager GS,SG - - IO,NO 10/2024 [paper]
OSCAR: Operating System Control via State-Aware Reasoning and Re-Planning OSCAR GS,DG IT AE EO 10/2024 [paper]
Large Language Models Empowered Personalized Web Agents PUMA TD - - IO,NO,EO 10/2024 [paper]
AgentOccam: A Simple Yet Strong Baseline for LLM-Based Web Agents AgentOccam TD IT MA IO,NO 10/2024 [paper]
Agent S: An Open Agentic Framework that Uses Computers Like a Human Agent S GS,SG GL EA,AE,MA IO,NO 10/2024 [paper]
ClickAgent: Enhancing UI Location Capabilities of Autonomous Agents ClickAgent GS IT AE IO,NO 10/2024 [paper]
From Commands to Prompts: LLM-based Semantic File System for AIOS LSFS GS,SG - - EO 09/2024 [paper]
NaviQAte: Functionality-Guided Web Application Navigation NaviQate GS,SG - - IO 09/2024 [paper]
PeriGuru: A Peripheral Robotic Mobile App Operation Assistant based on GUI Image Understanding and Prompting with LLM PeriGuru GS,DG IT EA,AE IO,NO 09/2024 [paper]
OpenWebAgent: An Open Toolkit to Enable Web Agents on Large Language Models OpenWebAgent GS,DG - - IO 08/2024 [paper]
Towards LLMCI: Multimodal AI for LLM-Vision UI Operation LLMCI GS,SG - - EO 07/2024 [paper]
Agent-e: From autonomous web navigation to foundational design principles in agentic systems Agent-E TD IT AE,MA IO,NO 07/2024 [paper]
Cradle: Empowering Foundation Agents Towards General Computer Control Cradle GS IT EA,AE,MA EO 03/2024 [paper]
Android in the zoo: Chain-of-action-thought for gui agents CoAT GS IT - IO,NO 03/2024 [paper]
On the Multi-turn Instruction Following for Conversational Web Agents Self-MAP - IT EA IO 02/2024 [paper]
OS-Copilot: Towards Generalist Computer Agents with Self-Improvement OS-Copilot TD GL EA,AE IO,EO 02/2024 [paper]
Mobile-Agent: Autonomous Multi-Modal Mobile Device Agent with Visual Perception Mobile-Agent GS,SG IT AE IO,NO 01/2024 [paper]
WebVoyager: Building an End-to-End Web Agent with Large Multimodal Models WebVoyager GS,VG IT MA IO,NO 01/2024 [paper]
MobileAgent: enhancing mobile control via human-machine interaction and SOP integration AIA GS,VG GL - IO,NO 01/2024 [paper]
GPT-4V(ision) is a Generalist Web Agent, if Grounded SeeAct GS,SG - AE IO 01/2024 [paper]
AppAgent: Multimodal Agents as Smartphone Users AppAgent GS,DG IT AE IO,NO 12/2023 [paper]
Assistgui: Task-oriented desktop graphical user interface automation ACE TD GL AE IO,NO 12/2023 [paper]
MobileGPT: Augmenting LLM with Human-like App Memory for Mobile Task Automation MobileGPT TD GL MA IO,NO 12/2023 [paper]
GPT-4V in Wonderland: Large Multimodal Models for Zero-Shot Smartphone GUI Navigation MM-Navigator GS,VG - MA IO,NO 11/2023 [paper]
WebWISE: Web Interface Control and Sequential Exploration with Large Language Models WebWise TD - MA IO,NO 10/2023 [paper]
A Zero-Shot Language Agent for Computer Control with Structured Reflection - TD IT AE IO,NO 10/2023 [paper]
Laser: Llm agent with state-space exploration for web navigation Laser TD IT AE IO,NO 09/2023 [paper]
Synapse: Trajectory-as-Exemplar Prompting with Memory for Computer Control Synapse - - MA IO 06/2023 [paper]
Sheetcopilot: Bringing software productivity to the next level through large language models SheetCopilot TD IT AE EO 05/2023 [paper]
Language Models can Solve Computer Tasks RCI - IT AE IO,NO 03/2023 [paper]
Enabling conversational interaction with mobile ui using large language models - TD - - IO 09/2022 [paper]

Table 3: Recent benchmarks for OS Agents. We divided the Benchmarks into three sections based on the Platform and sorted them by release date. The following is an explanation of the abbreviations. BS: Benchmark Settings, M/P: Mobile, PC: Desktop, IT: Interactive, ST: Static, OET: Operation Environment Types, RW: Real-World, SM:Simulated, GG: GUI Grounding, IF: Information Processing, AT:Agentic, CG: Code Generation.

Paper Benchmark Platform BS OET Task Date Link
On the effects of data scale on computer control agents AndroidControl M/P ST - AT 06/2024 [paper]
AndroidWorld: A Dynamic Benchmarking Environment for Autonomous Agents AndroidWorld M/P IT RW AT 05/2024 [paper]
Latent state estimation helps ui agents to reason Android-50 (A-50) M/P IT RW AT 05/2024 [paper]
Benchmarking mobile device control agents across diverse configurations B-MoCA M/P IT RW AT 04/2024 [paper]
Llamatouch: A faithful and scalable testbed for mobile ui task automation LlamaTouch M/P IT RW AT 04/2024 [paper]
Understanding the weakness of large language model agents within a complex android environment AndroidArena M/P IT RW AT 02/2024 [paper]
Android in the wild: A large-scale dataset for android device control AITW M/P ST - AT 07/2023 [paper]
Ugif: Ui grounded instruction following UGIF-DataSet M/P ST - AT 11/2022 [paper]
A Dataset for Interactive Vision-Language Navigation with Unknown Command Feasibility MoTIF M/P ST - AT 02/2022 [paper]
Mapping natural language instructions to mobile UI action sequences PIXELHELP M/P IT RW GG 05/2020 [paper]
Windows Agent Arena: Evaluating Multi-Modal OS Agents at Scale WindowsAgentArena PC IT RW AT 09/2024 [paper]
OfficeBench: Benchmarking Language Agents across Multiple Applications for Office Automation OfficeBench PC IT RW AT 07/2024 [paper]
OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments OSWorld PC IT RW AT 04/2024 [paper]
Omniact: A dataset and benchmark for enabling multimodal generalist autonomous agents for desktop and web OmniACT PC ST - GG 02/2024 [paper]
ASSISTGUI: Task-Oriented Desktop Graphical User Interface Automation ASSISTGUI PC IT RW AT 12/2023 [paper]
WebCanvas: Benchmarking Web Agents in Online Environments Mind2Web-Live Web IT RW IF,AT 06/2024 [paper]
MMInA: Benchmarking multihop multimodal internet agents MMInA Web IT RW IF,AT 04/2024 [paper]
AgentStudio: A Toolkit for Building General Virtual Agents GroundUI Web ST - GG 03/2024 [paper]
Tur[k]ingBench: A Challenge Benchmark for Web Agents TurkingBench Web IT RW AT 03/2024 [paper]
Workarena: How capable are web agents at solving common knowledge work tasks? WorkArena Web IT RW IF,AT 03/2024 [paper]
WebLINX: Real-World website navigation with Multi-Turn dialogue WebLINX Web ST - IF,AT 02/2024 [paper]
Visualwebarena: Evaluating multimodal agents on realistic visual web tasks Visualwebarena Web IT RW GG,AT 01/2024 [paper]
WebVLN: Vision-and-Language Navigation on Websites WebVLN-v1 Web IT RW IF,AT 12/2023 [paper]
Webarena: A realistic web environment for building autonomous agents WebArena Web IT RW AT 07/2023 [paper]
Mind2Web: Towards a Generalist Agent for the Web Mind2Web Web ST - IF,AT 06/2023 [paper]
Webshop: Towards scalable real-world web interaction with grounded language agents WebShop Web ST - AT 07/2022 [paper]
Mapping natural language commands to web elements PhraseNode Web ST - GG 08/2018 [paper]
World of bits: An open-domain platform for web-based agents MiniWoB Web ST - AT 08/2017 [paper]
World of bits: An open-domain platform for web-based agents FormWoB Web IT SM AT 08/2017 [paper]

Full List

Foundation Models

  1. [2024/12/13] Falcon-UI: Understanding GUI Before Following User Instructions. [paper]
  2. [2024/12/13] AgentTrek: Agent Trajectory Synthesis via Guiding Replay with Web Tutorials. [paper]
  3. [2024/12/05] Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction. [paper]
  4. [2024/11/26] ShowUI: One Vision-Language-Action Model for GUI Visual Agent. [paper]
  5. [2024/11/22] ScribeAgent: Towards Specialized Web Agents Using Production-Scale Workflow Data. [paper]
  6. [2024/11/18] Improved GUI Grounding via Iterative Narrowing. [paper]
  7. [2024/10/31] AndroidLab: Training and Systematic Benchmarking of Android Autonomous Agents. [paper]
  8. [2024/10/30] OS-ATLAS: A Foundation Action Model for Generalist GUI Agents. [paper]
  9. [2024/10/28] AutoGLM: Autonomous Foundation Agents for GUIs. [paper]
  10. [2024/10/25] EDGE: Enhanced Grounded GUI Understanding with Enriched Multi-Granularity Synthetic Data. [paper]
  11. [2024/10/24] Ferret-UI One: Mastering Universal User Interface Understanding Across Platforms. [paper]
  12. [2024/10/22] ShowUI: One Vision-Language-Action Model for Generalist GUI Agent. [paper]
  13. [2024/10/17] Harnessing Webpage UIs for Text-Rich Visual Understanding. [paper]
  14. [2024/10/09] TinyClick: Single-Turn Agent for Empowering GUI Automation. [paper]
  15. [2024/10/07] Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents. [paper]
  16. [2024/10/03] NNetscape Navigator: Complex Demonstrations for Web Agents Without a Demonstrator. [paper]
  17. [2024/09/30] MM1.5: Methods, Analysis & Insights from Multimodal LLM Fine-tuning. [paper]
  18. [2024/09/24] Synatra: Turning indirect knowledge into direct demonstrations for digital agents at scale. [paper]
  19. [2024/09/23] MobileVLM: A Vision-Language Model for Better Intra- and Inter-UI Understanding. [paper]
  20. [2024/09/22] MobileViews: A Large-Scale Mobile GUI Dataset. [paper]
  21. [2024/08/30] UI-Hawk: Unleashing the screen stream understanding for gui agents. [paper]
  22. [2024/07/19] GUI Action Narrator: Where and When Did That Action Take Place? [paper]
  23. [2024/07/05] MobileFlow: A Multimodal LLM for Mobile GUI Agent. [paper]
  24. [2024/06/20] VGA: Vision GUI Assistant - Minimizing Hallucinations through Image-Centric Fine-Tuning. [paper]
  25. [2024/06/12] GUI Odyssey: A Comprehensive Dataset for Cross-App GUI Navigation on Mobile Devices. [paper]
  26. [2024/06/12] Tell Me What's Next: Textual Foresight for Generic UI Representations. [paper]
  27. [2024/05/08] Visual Grounding for User Interfaces. [paper]
  28. [2024/05/05] Visual grounding for desktop graphical user interfaces. [paper]
  29. [2024/05/05] Android in the Zoo:Chain-of-Action-Thought for GUI Agents. [paper]
  30. [2024/05/01] Navigating WebAI: Training Agents to Complete Web Tasks with Large Language Models and Reinforcement Learning. [paper]
  31. [2024/04/16] Search Beyond Queries: Training Smaller Language Models for Web Interactions via Reinforcement Learning. [paper]
  32. [2024/04/12] Training a Vision Language Model as Smartphone Assistant. [paper]
  33. [2024/04/09] Autonomous Evaluation and Refinement of Web Agents. [paper]
  34. [2024/04/08] Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs. [paper]
  35. [2024/04/04] AutoWebGLM: A Large Language Model-based Web Navigating Agent. [paper]
  36. [2024/04/02] Octopus v2: On-device language model for super agent. [paper]
  37. [2024/03/30] Large Language Models Can Self-Improve At Web Agent Tasks. [paper]
  38. [2024/02/08] WebLINX: Real-World website navigation with Multi-Turn dialogue. [paper]
  39. [2024/02/07] ScreenAI: A Vision-Language Model for UI and Infographics Understanding. [paper]
  40. [2024/02/06] Dual-View Visual Contextualization for Web Navigation. [paper]
  41. [2024/01/20] E-ANT: A Large-Scale Dataset for Efficient Automatic GUI NavigaTion. [paper]
  42. [2024/01/17] SeeClick: Harnessing GUI Grounding for Advanced Visual GUI Agents. [paper]
  43. [2024/01/17] GUICourse: From General Vision Language Models to Versatile GUI Agents. [paper]
  44. [2023/12/25] WebVLN: Vision-and-Language Navigation on Websites. [paper]
  45. [2023/12/25] UINav: A Practical Approach to Train On-Device Automation Agents. [paper]
  46. [2023/12/14] CogAgent: A Visual Language Model for GUI Agents. [paper]
  47. [2023/12/04] Intelligent Virtual Assistants with LLM-based Process Automation. [paper]
  48. [2023/11/30] Exposing Limitations of Language Model Agents in Sequential-Task Compositions on the Web. [paper]
  49. [2023/10/27] Android in the wild: A large-scale dataset for android device control. [paper]
  50. [2023/10/08] UReader: Universal OCR-free Visually-situated Language Understanding with Multimodal Large Language Model. [paper]
  51. [2023/10/07] ILuvUI: Instruction-tuned Language-Vision modeling of UIs from Machine Conversations. [paper]
  52. [2023/10/07] Reinforced UI Instruction Grounding: Towards a Generic UI Task Automation API. [paper]
  53. [2023/07/24] A Real-World WebAgent with Planning, Long Context Understanding, and Program Synthesis. [paper]
  54. [2023/05/31] From pixels to UI actions: Learning to follow instructions via graphical user interfaces. [paper]
  55. [2023/05/19] Multimodal Web Navigation with Instruction-Finetuned Foundation Models. [paper]
  56. [2023/01/30] WebUI: A Dataset for Enhancing Visual UI Understanding with Web Semantics. [paper]
  57. [2023/01/23] Lexi: Self-Supervised Learning of the UI Language. [paper]
  58. [2022/10/06] Towards Better Semantic Understanding of Mobile Interfaces. [paper]
  59. [2022/09/29] Spotlight: Mobile UI Understanding using Vision-Language Models with a Focus. [paper]
  60. [2022/07/04] WebShop: Towards Scalable Real-World Web Interaction with Grounded Language Agents. [paper]
  61. [2022/05/23] Meta-gui: Towards multi-modal conversational agents on mobile gui. [paper]
  62. [2022/02/16] A data-driven approach for learning to control computers. [paper]

Agent Frameworks

  1. [2024/12/02] Ponder & Press: Advancing Visual GUI Agent towards General Computer Control. [paper]
  2. [2024/11/20] AdaptAgent: Adapting Multimodal Web Agents with Few-Shot Learning from Human Demonstrations. [paper]
  3. [2024/11/18] Improved GUI Grounding via Iterative Narrowing. [paper]
  4. [2024/11/15] The Dawn of GUI Agent: A Preliminary Case Study with Claude 3.5 Computer Use. [paper]
  5. [2024/11/10] Is Your LLM Secretly a World Model of the Internet? Model-Based Planning for Web Agents. [paper]
  6. [2024/11/01] WebOlympus: An Open Platform for Web Agents on Live Websites. [paper]
  7. [2024/10/29] Auto-Intent: Automated Intent Discovery and Self-Exploration for Large Language Model Web Agents. [paper]
  8. [2024/10/25] OpenWebVoyager: Building Multimodal Web Agents via Iterative Real-World Exploration, Feedback and Optimization. [paper]
  9. [2024/10/24] OSCAR: Operating System Control via State-Aware Reasoning and Re-Planning. [paper]
  10. [2024/10/24] AgentStore: Scalable Integration of Heterogeneous Agents As Specialized Generalist Computer Assistant. [paper]
  11. [2024/10/24] Infogent: An Agent-Based Framework for Web Information Aggregation. [paper]
  12. [2024/10/22] Large Language Models Empowered Personalized Web Agents. [paper]
  13. [2024/10/21] Beyond Browsing: API-Based Web Agents. [paper]
  14. [2024/10/17] AgentOccam: A Simple Yet Strong Baseline for LLM-Based Web Agents. [paper]
  15. [2024/10/17] MobA: A Two-Level Agent System for Efficient Mobile Task Automation. [paper]
  16. [2024/10/11] VisionTasker: Mobile Task Automation Using Vision Based UI Understanding and LLM Task Planning. [paper]
  17. [2024/10/10] Agent S: An Open Agentic Framework that Uses Computers Like a Human. [paper]
  18. [2024/10/09] ClickAgent: Enhancing UI Location Capabilities of Autonomous Agents. [paper]
  19. [2024/10/01] Dynamic Planning for LLM-based Graphical User Interface Automation. [paper]
  20. [2024/10/01] Multimodal Auto Validation For Self-Refinement in Web Agents. [paper]
  21. [2024/09/25] Turn Every Application into an Agent: Towards Efficient Human-Agent-Computer Interaction with API-First LLM-Based Agents. [paper]
  22. [2024/09/23] From Commands to Prompts: LLM-based Semantic File System for AIOS. [paper]
  23. [2024/09/23] Steward: Natural Language Web Automation. [paper]
  24. [2024/09/16] NaviQAte: Functionality-Guided Web Application Navigation. [paper]
  25. [2024/09/14] PeriGuru: A Peripheral Robotic Mobile App Operation Assistant based on GUI Image Understanding and Prompting with LLM. [paper]
  26. [2024/09/12] Windows Agent Arena: Evaluating Multi-Modal OS Agents at Scale. [paper]
  27. [2024/09/11] Agent Workflow Memory. [paper]
  28. [2024/08/28] Webpilot: A versatile and autonomous multi-agent system for web task execution with strategic exploration. [paper]
  29. [2024/08/24] AutoWebGLM: A Large Language Model-based Web Navigating Agent. [paper]
  30. [2024/08/24] Intelligent Agents with LLM-based Process Automation. [paper]
  31. [2024/08/01] OpenWebAgent: An Open Toolkit to Enable Web Agents on Large Language Models. [paper]
  32. [2024/08/01] Omniparser for pure vision based gui agent. [paper]
  33. [2024/07/21] Towards LLMCI: Multimodal AI for LLM-Vision UI Operation. [paper]
  34. [2024/07/17] Agent-e: From autonomous web navigation to foundational design principles in agentic systems. [paper]
  35. [2024/07/04] MobileExperts: A Dynamic Tool-Enabled Agent Team in Mobile Devices. [paper]
  36. [2024/07/01] Tree Search for Language Model Agents. [paper]
  37. [2024/06/27] Read anywhere pointed: Layout-aware gui screen reading with tree-of-lens grounding. [paper]
  38. [2024/06/11] CAAP: Context-Aware Action Planning Prompting to Solve Computer Tasks with Front-End UI Only. [paper]
  39. [2024/06/03] Mobile-Agent-v2: Mobile Device Operation Assistant with Effective Navigation via Multi-Agent Collaboration. [paper]
  40. [2024/05/23] AndroidWorld: A Dynamic Benchmarking Environment for Autonomous Agents. [paper]
  41. [2024/05/17] Latent State Estimation Helps UI Agents to Reason. [paper]
  42. [2024/04/28] MMAC-Copilot: Multi-modal Agent Collaboration Operating System Copilot. [paper]
  43. [2024/04/09] Autonomous Evaluation and Refinement of Web Agents. [paper]
  44. [2024/04/03] PromptRPA: Generating Robotic Process Automation on Smartphones from Textual Prompts. [paper]
  45. [2024/03/25] AIOS: LLM Agent Operating System. [paper]
  46. [2024/03/05] Android in the zoo: Chain-of-action-thought for gui agents. [paper]
  47. [2024/03/05] Cradle: Empowering Foundation Agents Towards General Computer Control. [paper]
  48. [2024/02/23] On the Multi-turn Instruction Following for Conversational Web Agents. [paper]
  49. [2024/02/19] CoCo-Agent: A Comprehensive Cognitive MLLM Agent for Smartphone GUI Automation. [paper]
  50. [2024/02/12] OS-Copilot: Towards Generalist Computer Agents with Self-Improvement. [paper]
  51. [2024/02/09] ScreenAgent: A Vision Language Model-driven Computer Control Agent. [paper]
  52. [2024/02/08] Ufo: A ui-focused agent for windows os interaction. [paper]
  53. [2024/02/06] Dual-View Visual Contextualization for Web Navigation. [paper]
  54. [2024/01/29] Mobile-Agent: Autonomous Multi-Modal Mobile Device Agent with Visual Perception. [paper]
  55. [2024/01/25] WebVoyager: Building an End-to-End Web Agent with Large Multimodal Models. [paper]
  56. [2024/01/17] Seeclick: Harnessing gui grounding for advanced visual gui agents. [paper]
  57. [2024/01/04] MobileAgent: enhancing mobile control via human-machine interaction and SOP integration. [paper]
  58. [2024/01/03] GPT-4V(ision) is a Generalist Web Agent, if Grounded. [paper]
  59. [2023/12/21] AppAgent: Multimodal Agents as Smartphone Users. [paper]
  60. [2023/12/20] Assistgui: Task-oriented desktop graphical user interface automation. [paper]
  61. [2023/12/04] MobileGPT: Augmenting LLM with Human-like App Memory for Mobile Task Automation. [paper]
  62. [2023/12/04] Intelligent Virtual Assistants with LLM-based Process Automation. [paper]
  63. [2023/11/13] GPT-4V in Wonderland: Large Multimodal Models for Zero-Shot Smartphone GUI Navigation. [paper]
  64. [2023/10/24] WebWISE: Web Interface Control and Sequential Exploration with Large Language Models. [paper]
  65. [2023/10/12] A Zero-Shot Language Agent for Computer Control with Structured Reflection. [paper]
  66. [2023/09/20] You only look at screens: Multimodal chain-of-action agents. [paper]
  67. [2023/09/15] Laser: Llm agent with state-space exploration for web navigation. [paper]
  68. [2023/08/29] AutoDroid: LLM-powered Task Automation in Android. [paper]
  69. [2023/07/24] A Real-World WebAgent with Planning, Long Context Understanding, and Program Synthesis. [paper]
  70. [2023/06/14] Synapse: Trajectory-as-Exemplar Prompting with Memory for Computer Control. [paper]
  71. [2023/06/09] Mind2Web: Towards a Generalist Agent for the Web. [paper]
  72. [2023/05/30] Sheetcopilot: Bringing software productivity to the next level through large language models. [paper]
  73. [2023/05/23] Hierarchical prompting assists large language model on web navigation. [paper]
  74. [2023/05/09] InternGPT: Solving Vision-Centric Tasks by Interacting with ChatGPT Beyond Language. [paper]
  75. [2023/03/30] Language Models can Solve Computer Tasks. [paper]
  76. [2022/09/19] Enabling conversational interaction with mobile ui using large language models. [paper]
  77. [2022/05/23] Meta-gui: Towards multi-modal conversational agents on mobile gui. [paper]

Evaluation & Benchmark

  1. [2024/12/09] BigDocs: An Open and Permissively-Licensed Dataset for Training Multimodal Models on Document and Code Tasks. [paper]
  2. [2024/12/05] Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction. [paper]
  3. [2024/10/31] AndroidLab: Training and Systematic Benchmarking of Android Autonomous Agents. [paper]
  4. [2024/10/28] AssistEditor: Multi-Agent Collaboration for GUI Workflow Automation in Video Creation. [paper]
  5. [2024/10/28] Shopping MMLU: A Massive Multi-Task Online Shopping Benchmark for Large Language Models. [paper]
  6. [2024/10/28] MMT-Bench: A Comprehensive Multimodal Benchmark for Evaluating Large Vision-Language Models Towards Multitask AGI. [paper]
  7. [2024/10/24] VideoWebArena: Evaluating Long Context Multimodal Agents with Video Understanding Web Tasks. [paper]
  8. [2024/10/22] Large Language Models Empowered Personalized Web Agents. [paper]
  9. [2024/10/19] SPA-Bench: A Comprehensive Benchmark for SmartPhone Agent Evaluation. [paper]
  10. [2024/10/17] MobA: A Two-Level Agent System for Efficient Mobile Task Automation. [paper]
  11. [2024/10/07] Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents. [paper]
  12. [2024/09/22] MobileViews: A Large-Scale Mobile GUI Dataset. [paper]
  13. [2024/09/12] Windows Agent Arena: Evaluating Multi-Modal OS Agents at Scale. [paper]
  14. [2024/09/06] WebQuest: A Benchmark for Multimodal QA on Web Page Sequences. [paper]
  15. [2024/08/01] Omniparser for pure vision based gui agent. [paper]
  16. [2024/07/26] OfficeBench: Benchmarking Language Agents across Multiple Applications for Office Automation. [paper]
  17. [2024/07/22] AssistantBench: Can Web Agents Solve Realistic and Time-Consuming Tasks? [paper]
  18. [2024/07/15] Spider2-V: How Far Are Multimodal Agents From Automating Data Science and Engineering Workflows? [paper]
  19. [2024/07/13] RealWeb: A Benchmark for Universal Instruction Following in Realistic Web Services Navigation. [paper]
  20. [2024/07/11] UICrit: Enhancing Automated Design Evaluation with a UI Critique Dataset. [paper]
  21. [2024/07/07] WorkArena++: Towards Compositional Planning and Reasoning-based Common Knowledge Work Tasks. [paper]
  22. [2024/07/04] MobileExperts: A Dynamic Tool-Enabled Agent Team in Mobile Devices. [paper]
  23. [2024/07/03] Amex: Android multi-annotation expo dataset for mobile gui agents. [paper]
  24. [2024/07/01] CRAB: Cross-environment Agent Benchmark for Multimodal Language Model Agents. [paper]
  25. [2024/07/01] Mobile-Bench: An Evaluation Benchmark for LLM-based Mobile Agents. [paper]
  26. [2024/06/27] Read Anywhere Pointed: Layout-aware GUI Screen Reading with Tree-of-Lens Grounding. [paper]
  27. [2024/06/20] E-ANT: A Large-Scale Dataset for Efficient Automatic GUI NavigaTion. [paper]
  28. [2024/06/20] Identifying User Goals from UI Trajectories. [paper]
  29. [2024/06/19] GUI Action Narrator: Where and When Did That Action Take Place? [paper]
  30. [2024/06/18] WebCanvas: Benchmarking Web Agents in Online Environments. [paper]
  31. [2024/06/17] GUICourse: From General Vision Language Models to Versatile GUI Agents. [paper]
  32. [2024/06/14] VideoGUI: A Benchmark f om Instructional Videos. [paper]
  33. [2024/06/12] GUI Odyssey: A Comprehensive Dataset for Cross-App GUI Navigation on Mobile Devices. [paper]
  34. [2024/06/12] GUI-WORLD: A Dataset for GUI-oriented Multimodal LLM-based Agents. [paper]
  35. [2024/06/12] MobileAgentBench: An Efficient and User-Friendly Benchmark for Mobile LLM Agents. [paper]
  36. [2024/06/06] On the Effects of Data Scale on UI Control Agents. [paper]
  37. [2024/06/05] WebOlympus: An Open Platform for Web Agents on Live Websites. [paper]
  38. [2024/06/05] UGIF-DataSet: A New Dataset for Cross-lingual, Cross-modal Sequential actions on the UI. [paper]
  39. [2024/06/01] WebSuite: Systematically Evaluating Why Web Agents Fail. [paper]
  40. [2024/05/23] AndroidWorld: A Dynamic Benchmarking Environment for Autonomous Agents. [paper]
  41. [2024/05/23] AGILE: A Novel Reinforcement Learning Framework of LLM Agents. [paper]
  42. [2024/05/17] Latent state estimation helps ui agents to reason. [paper]
  43. [2024/05/07] Mapping natural language instructions to mobile UI action sequences. [paper]
  44. [2024/05/05] Visual grounding for desktop graphical user interfaces. [paper]
  45. [2024/04/28] MMAC-Copilot: Multi-modal Agent Collaboration Operating System Copilot. [paper]
  46. [2024/04/25] Benchmarking mobile device control agents across diverse configurations. [paper]
  47. [2024/04/15] MMInA: Benchmarking multihop multimodal internet agents. [paper]
  48. [2024/04/12] Llamatouch: A faithful and scalable testbed for mobile ui task automation. [paper]
  49. [2024/04/11] OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments. [paper]
  50. [2024/04/09] VisualWebBench: How Far Have Multimodal LLMs Evolved in Web Page Understanding and Grounding? [paper]
  51. [2024/04/09] GUIDE: Graphical User Interface Data for Execution. [paper]
  52. [2024/04/08] Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs. [paper]
  53. [2024/04/04] Autowebglm: Bootstrap and reinforce a large language model-based web navigating agent. [paper]
  54. [2024/03/29] Evaluating language model agents on realistic autonomous tasks. [paper]
  55. [2024/03/29] Draw-and-Understand: Leveraging Visual Prompts to Enable MLLMs to Comprehend What You Want. [paper]
  56. [2024/03/26] AgentStudio: A Toolkit for Building General Virtual Agents. [paper]
  57. [2024/03/18] Tur[k]ingBench: A Challenge Benchmark for Web Agents. [paper]
  58. [2024/03/15] Computer User Interface Understanding. A New Dataset and a Learning Framework. [paper]
  59. [2024/03/12] Workarena: How capable are web agents at solving common knowledge work tasks? [paper]
  60. [2024/03/06] PPTC-R benchmark: Towards Evaluating the Robustness of Large Language Models for PowerPoint Task Completion. [paper]
  61. [2024/03/05] Android in the Zoo:Chain-of-Action-Thought for GUI Agents. [paper]
  62. [2024/02/27] Omniact: A dataset and benchmark for enabling multimodal generalist autonomous agents for desktop and web. [paper]
  63. [2024/02/23] On the Multi-turn Instruction Following for Conversational Web Agents. [paper]
  64. [2024/02/09] Understanding the weakness of large language model agents within a complex android environment. [paper]
  65. [2024/02/08] WebLINX: Real-World website navigation with Multi-Turn dialogue. [paper]
  66. [2024/01/29] Mobile-Agent: Autonomous Multi-Modal Mobile Device Agent with Visual Perception. [paper]
  67. [2024/01/25] Webvoyager: Building an end-to-end web agent with large multimodal models. [paper]
  68. [2024/01/25] GPTVoiceTasker: Advancing Multi-step Mobile Task Efficiency Through Dynamic Interface Exploration and Learning. [paper]
  69. [2024/01/24] Visualwebarena: Evaluating multimodal agents on realistic visual web tasks. [paper]
  70. [2024/01/24] Agentboard: An analytical evaluation board of multiturn LLM agents. [paper]
  71. [2024/01/17] Seeclick: Harnessing gui grounding for advanced visual gui agents. [paper]
  72. [2023/12/26] AutoTask: Executing Arbitrary Voice Commands by Exploring and Learning from Mobile GUI. [paper]
  73. [2023/12/25] WebVLN: Vision-and-Language Navigation on Websites. [paper]
  74. [2023/12/20] ASSISTGUI: Task-Oriented Desktop Graphical User Interface Automation. [paper]
  75. [2023/11/30] Exposing Limitations of Language Model Agents in Sequential-Task Compositions on the Web. [paper]
  76. [2023/11/21] GAIA: a benchmark for general AI assistants. [paper]
  77. [2023/11/13] GPT-4V in wonderland: Large multimodal models for Zero-Shot smartphone GUI navigation. [paper]
  78. [2023/11/03] Pptc benchmark: Evaluating large language models for powerpoint task completion. [paper]
  79. [2023/10/16] OpenAgents: An Open Platform for Language Agents in the Wild. [paper]
  80. [2023/09/20] You Only Look at Screens:Multimodal Chain-of-Action Agents. [paper]
  81. [2023/08/29] AutoDroid: LLM-powered Task Automation in Android. [paper]
  82. [2023/08/07] Agentbench: Evaluating llms as agents. [paper]
  83. [2023/07/25] Webarena: A realistic web environment for building autonomous agents. [paper]
  84. [2023/07/19] Android in the wild: A large-scale dataset for android device control. [paper]
  85. [2023/06/09] Mind2Web: Towards a Generalist Agent for the Web. [paper]
  86. [2023/05/30] Sheetcopilot: Bringing software productivity to the next level through large language models. [paper]
  87. [2023/05/25] On the tool manipulation capability of open-source large language models. [paper]
  88. [2023/05/14] Mobile-env: A universal platform for training and evaluation of mobile interaction. [paper]
  89. [2023/05/14] Mobile-env: Building qualified evaluation benchmarks for llm-gui interaction. [paper]
  90. [2023/04/14] Droidbot-gpt: Gpt-powered ui automation for android. [paper]
  91. [2023/04/10] OpenAGI: When LLM Meets Domain Experts. [paper]
  92. [2023/03/13] Vision-Language models as success detectors. [paper]
  93. [2022/11/14] Ugif: Ui grounded instruction following. [paper]
  94. [2022/10/08] Understanding html with large language models. [paper]
  95. [2022/09/29] MUG: Interactive Multimodal Grounding on User Interfaces. [paper]
  96. [2022/09/16] ScreenQA: Large-Scale Question-Answer Pairs over Mobile App Screenshots. [paper]
  97. [2022/07/04] Webshop: Towards scalable real-world web interaction with grounded language agents. [paper]
  98. [2022/05/23] Meta-gui: Towards multi-modal conversational agents on mobile gui. [paper]
  99. [2022/02/04] A Dataset for Interactive Vision-Language Navigation with Unknown Command Feasibility. [paper]
  100. [2021/04/17] Mobile app tasks with iterative feedback (motif): Addressing task feasibility in interactive visual environments. [paper]
  101. [2021/01/23] Websrc: A dataset for web-based structural reading comprehension. [paper]
  102. [2018/08/28] Mapping natural language commands to web elements. [paper]
  103. [2017/11/06] Building natural language interfaces to web apis. [paper]
  104. [2017/08/06] World of bits: An open-domain platform for web-based agents. [paper]

Safety & Privacy

  1. [2024/11/04] Attacking Vision-Language Computer Agents via Pop-ups. [paper]
  2. [2024/10/23] MobileSafetyBench: Evaluating Safety of Autonomous Agents in Mobile Device Control. [paper]
  3. [2024/10/22] Advweb: Controllable black-box attacks on vlm-powered web agents. [paper]
  4. [2024/10/11] Refusal-Trained LLMs Are Easily Jailbroken As Browser Agents. [paper]
  5. [2024/10/09] ST-WebAgentBench: A Benchmark for Evaluating Safety and Trustworthiness in Web Agents. [paper]
  6. [2024/09/17] Eia: Environmental injection attack on generalist web agents for privacy leakage. [paper]
  7. [2024/08/05] Caution for the Environment: Multimodal Agents are Susceptible to Environmental Distractions. [paper]
  8. [2024/07/12] Security Matrix for Multimodal Agents on Mobile Devices: A Systematic and Proof of Concept Study. [paper]
  9. [2024/06/18] Adversarial Attacks on Multimodal Agents. [paper]
  10. [2024/02/26] WIPI: A New Web Threat for LLM-Driven Web Agents. [paper]
  11. [2023/08/03] From Prompt Injections to SQL Injection Attacks: How Protected is Your LLM-Integrated Web Application? [paper]

Hiring

[OPPO] Personal AI Team Hiring Algorithm Interns / New Graduates / Experienced Candidates

OPPO is dedicated to developing the most advanced foundation model technologies and creating personalized user experiences. Our mission is focused on developing revolutionary AI Native Phones that will shape the future. The Personal AI Team is primarily focused on cutting-edge research in (multimodal) LLMs, AI Agents and personalization. The team leverages OPPO’s ample platform resources, including data and computing power, to continuously invest in these key areas. We are currently hiring algorithm interns, new graduates, and experienced candidates. We expect you to be proficient in algorithms or engineering of large foundation models. If interested, please contact via email: [email protected].

[01.AI] Foundation Model Post-training Team Hiring Algorithm Interns / New Graduates / Experienced Candidates

01.AI is a global leader in technology and applications of large foundation models. Its latest flagship model, Yi-Lightning, ranked #1 in China and #6 globally in the LMSys Chatbot Arena updated on October 14, 2024. The post-training team is responsible for cutting-edge research and engineering of post-training techniques for foundation models. We are currently hiring algorithm interns, new graduates, and experienced candidates. We expect you to be proficient in algorithms or engineering of large foundation models. If interested, please contact via email: [email protected].

Useful Links

  1. aialt/awesome-mobile-agents: https://github.com/aialt/awesome-mobile-agents
  2. OSU-NLP-Group/GUI-Agents-Paper-List: https://github.com/OSU-NLP-Group/GUI-Agents-Paper-List
  3. vyokky/LLM-Brained-GUI-Agents-Survey: https://github.com/vyokky/LLM-Brained-GUI-Agents-Survey

Contact

The repo is still being updated rapidly🚀. Please let us know if you notice any mistakes or would like any work related to OS Agents to be included in our list by e-mail: [email protected].

Citation

Considering that the current bib citation points to our repository, we will update it to point to the paper as soon as the preprint server is available. Please stay tuned for updates. Before this, if you find our repository helpful, we would appreciate it if you could cite:

@misc{hu2024osagents,  
  title        = {OS Agents: A Survey on MLLM-based Agents for General Computing Devices Use},  
  author       = {Xueyu Hu and Tao Xiong and Biao Yi and Zishu Wei and Ruixuan Xiao and Yurun Chen and Jiasheng Ye and Meiling Tao and Xiangxin Zhou and Ziyu Zhao and Yuhuai Li and Shengze Xu and Shawn Wang and Xinchen Xu and Shuofei Qiao and Kun Kuang and Tieyong Zeng and Liang Wang and Jiwei Li and Yuchen Eleanor Jiang and Wangchunshu Zhou and Guoyin Wang and Keting Yin and Zhou Zhao and Hongxia Yang and Fan Wu and Shengyu Zhang and Fei Wu},  
  year         = {2024},  
  howpublished = {\url{https://github.com/OS-Agent-Survey/OS-Agent-Survey/}},  
}  

About

This is the repo for "OS Agents: A Survey on MLLM-based Agents for General Computing Devices Control".

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published