OS-Agent-Survey OS-Agent-Survey

OS Agents: A Survey on MLLM-based Agents for Computer, Phone and Browser Use

[🌐 Website] • [📜 Paper] • [🐱 GitHub]• [ Zhihu]• [ OpenReview]• [ Twitter]

This is the repo for the paper "OS Agents: A Survey on MLLM-based Agents for Computer, Phone and Browser Use". This paper conducts a comprehensive survey on OS Agents, which are (M)LLM-based Agents using computers, phones and browsers by operating within the environments and interfaces (e.g., Graphical User Interface (GUI) and Command Line Interface (CLI)) provided by operating systems (OS) to automate tasks. The survey is aimed to consolidates the state of OS Agents research, providing insights to guide both academic inquiry and industrial development. In this repository, we have listed relevant papers related to our work in four areas: Foundation Models, Agent Frameworks, Evaluation & Benchmarks, and Safety & Privacy and this collection will be continuously updated. We aim to provide you with comprehensive knowledge in the OS Agent field, hoping it can help you quickly familiarize yourself with this research direction.

❗Why is there no arXiv link for this paper?

This paper was rejected by arXiv surprisingly with the justification: "Our moderators determined that your submission does not contain sufficient original or substantive scholarly research and is not of interest to arXiv." This reasoning appears to be inconsistent with the content and contribution of the paper. We attempted an appeal, but unfortunately, this was unsuccessful, and no further explanation was provided. A resubmission did not resolve the issue either. As a result, the ONLY way to access the paper at the moment is through our GitHub repository or via OpenReview Archive. We are disappointed by the lack of transparency in arXiv’s moderation process.

🔔We are hiring!

(Some teams involved in this survey is hiring. Information will be continuously updated, please stay tuned. Detailed information is here.)

[OPPO] Personal AI Team

OPPO is seeking algorithm interns, new graduates, and experienced candidates for its Personal AI Team. Focused on multimodal LLMs, AI Agents, and personalization, the team works to develop AI Native Phones. Interested candidates, contact: [email protected].

Overview of OS Agent Survey

This survey aims to advance the research and development of OS Agents by providing a detailed exploration of their fundamental capabilities, methodologies for building them using (M)LLMs, and emerging trends in the field. While OS Agents are still in the early stages of growth, the rapid evolution of technology continues to introduce innovative approaches and applications. This work seeks to highlight ongoing challenges, future opportunities, and the latest developments, encouraging further research and industrial adoption. Ultimately, we hope this study will serve as a catalyst for innovation, driving meaningful progress in both academia and industry.

Tables

Table 1: Recent foundation models for OS Agents. Arch.: Architecture, Exist.: Existing, Mod.: Modified, Concat.: Concatenated, PT: Pre-Train, SFT: Supervised Fine-Tune, RL: Reinforcement Learning.

Paper	Model	Arch.	PT	SFT	RL	Date	Link
OS-ATLAS: A Foundation Action Model for Generalist GUI Agents	OS-Atlas	Exist. MLLMs	√	√	-	10/2024	[paper]
AutoGLM: Autonomous Foundation Agents for GUIs	AutoGLM	Exist. LLMs	√	√	√	10/2024	[paper]
EDGE: Enhanced Grounded GUI Understanding with Enriched Multi-Granularity Synthetic Data	EDGE	Exist. MLLMs	-	√	-	10/2024	[paper]
Ferret-UI 2: Mastering Universal User Interface Understanding Across Platforms	Ferret-UI 2	Exist. MLLMs	-	√	-	10/2024	[paper]
ShowUI: One Vision-Language-Action Model for Generalist GUI Agent	ShowUI	Exist. MLLMs	√	√	-	10/2024	[paper]
Harnessing Webpage UIs for Text-Rich Visual Understanding	UIX	Exist. MLLMs	-	√	-	10/2024	[paper]
TinyClick: Single-Turn Agent for Empowering GUI Automation	TinyClick	Exist. MLLMs	√	-	-	10/2024	[paper]
Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents	UGround	Exist. MLLMs	-	√	-	10/2024	[paper]
NNetscape Navigator: Complex Demonstrations for Web Agents Without a Demonstrator	NNetNav	Exist. LLMs	-	√	-	10/2024	[paper]
Synatra: Turning indirect knowledge into direct demonstrations for digital agents at scale	Synatra	Exist. LLMs	-	√	-	09/2024	[paper]
MobileVLM: A Vision-Language Model for Better Intra- and Inter-UI Understanding	MobileVLM	Exist. MLLMs	√	√	-	09/2024	[paper]
UI-Hawk: Unleashing the Screen Stream Understanding for GUI Agents	UI-Hawk	Mod. MLLMs	√	√	-	08/2024	[paper]
GUI Action Narrator: Where and When Did That Action Take Place?	GUI Action Narrator	Exist. MLLMs	-	√	-	07/2024	[paper]
MobileFlow: A Multimodal LLM for Mobile GUI Agent	MobileFlow	Mod. MLLMs	√	√	-	07/2024	[paper]
VGA: Vision GUI Assistant - Minimizing Hallucinations through Image-Centric Fine-Tuning	VGA	Exist. MLLMs	-	√	-	06/2024	[paper]
GUI Odyssey: A Comprehensive Dataset for Cross-App GUI Navigation on Mobile Devices	OdysseyAgent	Exist. MLLMs	-	√	-	06/2024	[paper]
Tell Me What's Next: Textual Foresight for Generic UI Representations	Textual Foresight	Concat. MLLMs	√	√	-	06/2024	[paper]
Navigating WebAI: Training Agents to Complete Web Tasks with Large Language Models and Reinforcement Learning	WebAI	Concat. MLLMs	-	√	√	05/2024	[paper]
Search Beyond Queries: Training Smaller Language Models for Web Interactions via Reinforcement Learning	GLAINTEL	Exist. LLMs	-	-	√	04/2024	[paper]
Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs	Ferret-UI	Exist. MLLMs	-	√	-	04/2024	[paper]
AutoWebGLM: A Large Language Model-based Web Navigating Agent	AutoWebGLM	Exist. LLMs	-	√	√	04/2024	[paper]
Large Language Models Can Self-Improve At Web Agent Tasks	-	Exist. LLMs	-	√	-	03/2024	[paper]
ScreenAI: A Vision-Language Model for UI and Infographics Understanding	ScreenAI	Exist. MLLMs	√	√	-	02/2024	[paper]
Dual-View Visual Contextualization for Web Navigation	Dual-VCR	Concat MLLMs	-	√	-	02/2024	[paper]
SeeClick: Harnessing GUI Grounding for Advanced Visual GUI Agents	SeeClick	Exist. MLLMs	√	√	-	01/2024	[paper]
CogAgent: A Visual Language Model for GUI Agents	CogAgent	Mod. MLLMs	√	√	-	12/2023	[paper]
ILuvUI: Instruction-tuned Language-Vision modeling of UIs from Machine Conversations	ILuvUI	Mod. MLLMs	-	√	-	10/2023	[paper]
Reinforced UI Instruction Grounding: Towards a Generic UI Task Automation API	RUIG	Concat. MLLMs	-	-	√	10/2023	[paper]
A Real-World WebAgent with Planning, Long Context Understanding, and Program Synthesis	WebAgent	Concat. MLLMs	√	√	-	07/2023	[paper]
Multimodal Web Navigation with Instruction-Finetuned Foundation Models	WebGUM	Concat. MLLMs	-	√	-	05/2023	[paper]

Table 2: Recent agent frameworks for OS Agents. TD: Textual Description, GS: GUI Screenshots, VG: Visual Grounding, SG: Semantic Grounding, DG: Dual Grounding, GL: Global, IT: Iterative, AE: Automated Exploration, EA: Experience-Augmented, MA: Management, IO: Input Operations, NO: Navigation Operations, EO: Extended Operations.

Paper	Model	Perception	Planning	Memory	Action	Date	Link
OpenWebVoyager: Building Multimodal Web Agents via Iterative Real-World Exploration, Feedback and Optimization	OpenWebVoyager	GS,SG	-	-	IO,NO	10/2024	[paper]
OSCAR: Operating System Control via State-Aware Reasoning and Re-Planning	OSCAR	GS,DG	IT	AE	EO	10/2024	[paper]
Large Language Models Empowered Personalized Web Agents	PUMA	TD	-	-	IO,NO,EO	10/2024	[paper]
AgentOccam: A Simple Yet Strong Baseline for LLM-Based Web Agents	AgentOccam	TD	IT	MA	IO,NO	10/2024	[paper]
Agent S: An Open Agentic Framework that Uses Computers Like a Human	Agent S	GS,SG	GL	EA,AE,MA	IO,NO	10/2024	[paper]
ClickAgent: Enhancing UI Location Capabilities of Autonomous Agents	ClickAgent	GS	IT	AE	IO,NO	10/2024	[paper]
From Commands to Prompts: LLM-based Semantic File System for AIOS	LSFS	GS,SG	-	-	EO	09/2024	[paper]
NaviQAte: Functionality-Guided Web Application Navigation	NaviQate	GS,SG	-	-	IO	09/2024	[paper]
PeriGuru: A Peripheral Robotic Mobile App Operation Assistant based on GUI Image Understanding and Prompting with LLM	PeriGuru	GS,DG	IT	EA,AE	IO,NO	09/2024	[paper]
OpenWebAgent: An Open Toolkit to Enable Web Agents on Large Language Models	OpenWebAgent	GS,DG	-	-	IO	08/2024	[paper]
Towards LLMCI: Multimodal AI for LLM-Vision UI Operation	LLMCI	GS,SG	-	-	EO	07/2024	[paper]
Agent-e: From autonomous web navigation to foundational design principles in agentic systems	Agent-E	TD	IT	AE,MA	IO,NO	07/2024	[paper]
Cradle: Empowering Foundation Agents Towards General Computer Control	Cradle	GS	IT	EA,AE,MA	EO	03/2024	[paper]
Android in the zoo: Chain-of-action-thought for gui agents	CoAT	GS	IT	-	IO,NO	03/2024	[paper]
On the Multi-turn Instruction Following for Conversational Web Agents	Self-MAP	-	IT	EA	IO	02/2024	[paper]
OS-Copilot: Towards Generalist Computer Agents with Self-Improvement	OS-Copilot	TD	GL	EA,AE	IO,EO	02/2024	[paper]
Mobile-Agent: Autonomous Multi-Modal Mobile Device Agent with Visual Perception	Mobile-Agent	GS,SG	IT	AE	IO,NO	01/2024	[paper]
WebVoyager: Building an End-to-End Web Agent with Large Multimodal Models	WebVoyager	GS,VG	IT	MA	IO,NO	01/2024	[paper]
MobileAgent: enhancing mobile control via human-machine interaction and SOP integration	AIA	GS,VG	GL	-	IO,NO	01/2024	[paper]
GPT-4V(ision) is a Generalist Web Agent, if Grounded	SeeAct	GS,SG	-	AE	IO	01/2024	[paper]
AppAgent: Multimodal Agents as Smartphone Users	AppAgent	GS,DG	IT	AE	IO,NO	12/2023	[paper]
Assistgui: Task-oriented desktop graphical user interface automation	ACE	TD	GL	AE	IO,NO	12/2023	[paper]
MobileGPT: Augmenting LLM with Human-like App Memory for Mobile Task Automation	MobileGPT	TD	GL	MA	IO,NO	12/2023	[paper]
GPT-4V in Wonderland: Large Multimodal Models for Zero-Shot Smartphone GUI Navigation	MM-Navigator	GS,VG	-	MA	IO,NO	11/2023	[paper]
WebWISE: Web Interface Control and Sequential Exploration with Large Language Models	WebWise	TD	-	MA	IO,NO	10/2023	[paper]
A Zero-Shot Language Agent for Computer Control with Structured Reflection	-	TD	IT	AE	IO,NO	10/2023	[paper]
Laser: Llm agent with state-space exploration for web navigation	Laser	TD	IT	AE	IO,NO	09/2023	[paper]
Synapse: Trajectory-as-Exemplar Prompting with Memory for Computer Control	Synapse	-	-	MA	IO	06/2023	[paper]
Sheetcopilot: Bringing software productivity to the next level through large language models	SheetCopilot	TD	IT	AE	EO	05/2023	[paper]
Language Models can Solve Computer Tasks	RCI	-	IT	AE	IO,NO	03/2023	[paper]
Enabling conversational interaction with mobile ui using large language models	-	TD	-	-	IO	09/2022	[paper]

Table 3: Recent benchmarks for OS Agents. We divided the Benchmarks into three sections based on the Platform and sorted them by release date. The following is an explanation of the abbreviations. BS: Benchmark Settings, M/P: Mobile, PC: Desktop, IT: Interactive, ST: Static, OET: Operation Environment Types, RW: Real-World, SM: Simulated, GG: GUI Grounding, IF: Information Processing, AT: Agentic, CG: Code Generation.

Paper	Benchmark	Platform	BS	OET	Task	Date	Link
On the effects of data scale on computer control agents	AndroidControl	M/P	ST	-	AT	06/2024	[paper]
AndroidWorld: A Dynamic Benchmarking Environment for Autonomous Agents	AndroidWorld	M/P	IT	RW	AT	05/2024	[paper]
Latent state estimation helps ui agents to reason	Android-50 (A-50)	M/P	IT	RW	AT	05/2024	[paper]
Benchmarking mobile device control agents across diverse configurations	B-MoCA	M/P	IT	RW	AT	04/2024	[paper]
Llamatouch: A faithful and scalable testbed for mobile ui task automation	LlamaTouch	M/P	IT	RW	AT	04/2024	[paper]
Understanding the weakness of large language model agents within a complex android environment	AndroidArena	M/P	IT	RW	AT	02/2024	[paper]
Android in the wild: A large-scale dataset for android device control	AITW	M/P	ST	-	AT	07/2023	[paper]
Ugif: Ui grounded instruction following	UGIF-DataSet	M/P	ST	-	AT	11/2022	[paper]
A Dataset for Interactive Vision-Language Navigation with Unknown Command Feasibility	MoTIF	M/P	ST	-	AT	02/2022	[paper]
Mapping natural language instructions to mobile UI action sequences	PIXELHELP	M/P	IT	RW	GG	05/2020	[paper]
Windows Agent Arena: Evaluating Multi-Modal OS Agents at Scale	WindowsAgentArena	PC	IT	RW	AT	09/2024	[paper]
OfficeBench: Benchmarking Language Agents across Multiple Applications for Office Automation	OfficeBench	PC	IT	RW	AT	07/2024	[paper]
OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments	OSWorld	PC	IT	RW	AT	04/2024	[paper]
Omniact: A dataset and benchmark for enabling multimodal generalist autonomous agents for desktop and web	OmniACT	PC	ST	-	GG	02/2024	[paper]
ASSISTGUI: Task-Oriented Desktop Graphical User Interface Automation	ASSISTGUI	PC	IT	RW	AT	12/2023	[paper]
WebCanvas: Benchmarking Web Agents in Online Environments	Mind2Web-Live	Web	IT	RW	IF,AT	06/2024	[paper]
MMInA: Benchmarking multihop multimodal internet agents	MMInA	Web	IT	RW	IF,AT	04/2024	[paper]
AgentStudio: A Toolkit for Building General Virtual Agents	GroundUI	Web	ST	-	GG	03/2024	[paper]
Tur[k]ingBench: A Challenge Benchmark for Web Agents	TurkingBench	Web	IT	RW	AT	03/2024	[paper]
Workarena: How capable are web agents at solving common knowledge work tasks?	WorkArena	Web	IT	RW	IF,AT	03/2024	[paper]
WebLINX: Real-World website navigation with Multi-Turn dialogue	WebLINX	Web	ST	-	IF,AT	02/2024	[paper]
Visualwebarena: Evaluating multimodal agents on realistic visual web tasks	Visualwebarena	Web	IT	RW	GG,AT	01/2024	[paper]
WebVLN: Vision-and-Language Navigation on Websites	WebVLN-v1	Web	IT	RW	IF,AT	12/2023	[paper]
Webarena: A realistic web environment for building autonomous agents	WebArena	Web	IT	RW	AT	07/2023	[paper]
Mind2Web: Towards a Generalist Agent for the Web	Mind2Web	Web	ST	-	IF,AT	06/2023	[paper]
Webshop: Towards scalable real-world web interaction with grounded language agents	WebShop	Web	ST	-	AT	07/2022	[paper]
Mapping natural language commands to web elements	PhraseNode	Web	ST	-	GG	08/2018	[paper]
World of Bits: An Open-Domain Platform for Web-Based Agents	MiniWoB	Web	ST	-	AT	08/2017	[paper]
World of Bits: An Open-Domain Platform for Web-Based Agents	FormWoB	Web	IT	SM	AT	08/2017	[paper]

Full List

Foundation Models

[2024/12/13] Falcon-UI: Understanding GUI Before Following User Instructions. [paper]
[2024/12/13] AgentTrek: Agent Trajectory Synthesis via Guiding Replay with Web Tutorials. [paper]
[2024/12/05] AGUVIS: Unified Pure Vision Agents for Autonomous GUI Interaction. [paper]
[2024/11/26] ShowUI: One Vision-Language-Action Model for GUI Visual Agent. [paper]
[2024/11/22] ScribeAgent: Towards Specialized Web Agents Using Production-Scale Workflow Data. [paper]
[2024/11/18] Improved GUI Grounding via Iterative Narrowing. [paper]
[2024/10/31] AndroidLab: Training and Systematic Benchmarking of Android Autonomous Agents. [paper]
[2024/10/30] OS-ATLAS: A Foundation Action Model for Generalist GUI Agents. [paper]
[2024/10/28] AutoGLM: Autonomous Foundation Agents for GUIs. [paper]
[2024/10/25] EDGE: Enhanced Grounded GUI Understanding with Enriched Multi-Granularity Synthetic Data. [paper]
[2024/10/24] Ferret-UI One: Mastering Universal User Interface Understanding Across Platforms. [paper]
[2024/10/22] ShowUI: One Vision-Language-Action Model for Generalist GUI Agent. [paper]
[2024/10/17] Harnessing Webpage UIs for Text-Rich Visual Understanding. [paper]
[2024/10/09] TinyClick: Single-Turn Agent for Empowering GUI Automation. [paper]
[2024/10/07] Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents. [paper]
[2024/10/03] NNetscape Navigator: Complex Demonstrations for Web Agents Without a Demonstrator. [paper]
[2024/09/30] MM1.5: Methods, Analysis & Insights from Multimodal LLM Fine-tuning. [paper]
[2024/09/24] Synatra: Turning indirect knowledge into direct demonstrations for digital agents at scale. [paper]
[2024/09/23] MobileVLM: A Vision-Language Model for Better Intra- and Inter-UI Understanding. [paper]
[2024/09/22] MobileViews: A Large-Scale Mobile GUI Dataset. [paper]
[2024/08/30] UI-Hawk: Unleashing the Screen Stream Understanding for GUI Agents. [paper]
[2024/07/19] GUI Action Narrator: Where and When Did That Action Take Place? [paper]
[2024/07/05] MobileFlow: A Multimodal LLM for Mobile GUI Agent. [paper]
[2024/06/20] VGA: Vision GUI Assistant - Minimizing Hallucinations through Image-Centric Fine-Tuning. [paper]
[2024/06/12] GUI Odyssey: A Comprehensive Dataset for Cross-App GUI Navigation on Mobile Devices. [paper]
[2024/06/12] Tell Me What's Next: Textual Foresight for Generic UI Representations. [paper]
[2024/05/08] Visual Grounding for User Interfaces. [paper]
[2024/05/05] Visual grounding for desktop graphical user interfaces. [paper]
[2024/05/05] Android in the Zoo：Chain-of-Action-Thought for GUI Agents. [paper]
[2024/05/01] Navigating WebAI: Training Agents to Complete Web Tasks with Large Language Models and Reinforcement Learning. [paper]
[2024/04/16] Search Beyond Queries: Training Smaller Language Models for Web Interactions via Reinforcement Learning. [paper]
[2024/04/12] Training a Vision Language Model as Smartphone Assistant. [paper]
[2024/04/09] Autonomous Evaluation and Refinement of Web Agents. [paper]
[2024/04/08] Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs. [paper]
[2024/04/04] AutoWebGLM: A Large Language Model-based Web Navigating Agent. [paper]
[2024/04/02] Octopus v2: On-device language model for super agent. [paper]
[2024/03/30] Large Language Models Can Self-Improve At Web Agent Tasks. [paper]
[2024/02/08] WebLINX: Real-World website navigation with Multi-Turn dialogue. [paper]
[2024/02/07] ScreenAI: A Vision-Language Model for UI and Infographics Understanding. [paper]
[2024/02/06] Dual-View Visual Contextualization for Web Navigation. [paper]
[2024/01/20] E-ANT: A Large-Scale Dataset for Efficient Automatic GUI NavigaTion. [paper]
[2024/01/17] SeeClick: Harnessing GUI Grounding for Advanced Visual GUI Agents. [paper]
[2024/01/17] GUICourse: From General Vision Language Models to Versatile GUI Agents. [paper]
[2023/12/25] WebVLN: Vision-and-Language Navigation on Websites. [paper]
[2023/12/25] UINav: A Practical Approach to Train On-Device Automation Agents. [paper]
[2023/12/14] CogAgent: A Visual Language Model for GUI Agents. [paper]
[2023/12/04] Intelligent Virtual Assistants with LLM-based Process Automation. [paper]
[2023/11/30] Exposing Limitations of Language Model Agents in Sequential-Task Compositions on the Web. [paper]
[2023/10/27] Android in the wild: A large-scale dataset for android device control. [paper]
[2023/10/08] UReader: Universal OCR-free Visually-situated Language Understanding with Multimodal Large Language Model. [paper]
[2023/10/07] ILuvUI: Instruction-tuned Language-Vision modeling of UIs from Machine Conversations. [paper]
[2023/10/07] Reinforced UI Instruction Grounding: Towards a Generic UI Task Automation API. [paper]
[2023/07/24] A Real-World WebAgent with Planning, Long Context Understanding, and Program Synthesis. [paper]
[2023/05/31] From pixels to UI actions: Learning to follow instructions via graphical user interfaces. [paper]
[2023/05/19] Multimodal Web Navigation with Instruction-Finetuned Foundation Models. [paper]
[2023/01/30] WebUI: A Dataset for Enhancing Visual UI Understanding with Web Semantics. [paper]
[2023/01/23] Lexi: Self-Supervised Learning of the UI Language. [paper]
[2022/10/06] Towards Better Semantic Understanding of Mobile Interfaces. [paper]
[2022/09/29] Spotlight: Mobile UI Understanding using Vision-Language Models with a Focus. [paper]
[2022/07/04] WebShop: Towards Scalable Real-World Web Interaction with Grounded Language Agents. [paper]
[2022/05/23] META-GUI: Towards multi-modal conversational agents on mobile gui. [paper]
[2022/02/16] A data-driven approach for learning to control computers. [paper]

Agent Frameworks

[2024/12/02] Ponder & Press: Advancing Visual GUI Agent towards General Computer Control. [paper]
[2024/11/20] AdaptAgent: Adapting Multimodal Web Agents with Few-Shot Learning from Human Demonstrations. [paper]
[2024/11/18] Improved GUI Grounding via Iterative Narrowing. [paper]
[2024/11/15] The Dawn of GUI Agent: A Preliminary Case Study with Claude 3.5 Computer Use. [paper]
[2024/11/10] Is Your LLM Secretly a World Model of the Internet? Model-Based Planning for Web Agents. [paper]
[2024/11/01] WebOlympus: An Open Platform for Web Agents on Live Websites. [paper]
[2024/10/29] Auto-Intent: Automated Intent Discovery and Self-Exploration for Large Language Model Web Agents. [paper]
[2024/10/25] OpenWebVoyager: Building Multimodal Web Agents via Iterative Real-World Exploration, Feedback and Optimization. [paper]
[2024/10/24] OSCAR: Operating System Control via State-Aware Reasoning and Re-Planning. [paper]
[2024/10/24] AgentStore: Scalable Integration of Heterogeneous Agents As Specialized Generalist Computer Assistant. [paper]
[2024/10/24] Infogent: An Agent-Based Framework for Web Information Aggregation. [paper]
[2024/10/22] Large Language Models Empowered Personalized Web Agents. [paper]
[2024/10/21] Beyond Browsing: API-Based Web Agents. [paper]
[2024/10/17] AgentOccam: A Simple Yet Strong Baseline for LLM-Based Web Agents. [paper]
[2024/10/17] MobA: A Two-Level Agent System for Efficient Mobile Task Automation. [paper]
[2024/10/11] VisionTasker: Mobile Task Automation Using Vision Based UI Understanding and LLM Task Planning. [paper]
[2024/10/10] Agent S: An Open Agentic Framework that Uses Computers Like a Human. [paper]
[2024/10/09] ClickAgent: Enhancing UI Location Capabilities of Autonomous Agents. [paper]
[2024/10/01] Dynamic Planning for LLM-based Graphical User Interface Automation. [paper]
[2024/10/01] Multimodal Auto Validation For Self-Refinement in Web Agents. [paper]
[2024/09/25] Turn Every Application into an Agent: Towards Efficient Human-Agent-Computer Interaction with API-First LLM-Based Agents. [paper]
[2024/09/23] From Commands to Prompts: LLM-based Semantic File System for AIOS. [paper]
[2024/09/23] Steward: Natural Language Web Automation. [paper]
[2024/09/16] NaviQAte: Functionality-Guided Web Application Navigation. [paper]
[2024/09/14] PeriGuru: A Peripheral Robotic Mobile App Operation Assistant based on GUI Image Understanding and Prompting with LLM. [paper]
[2024/09/12] Windows Agent Arena: Evaluating Multi-Modal OS Agents at Scale. [paper]
[2024/09/11] Agent Workflow Memory. [paper]
[2024/08/28] Webpilot: A versatile and autonomous multi-agent system for web task execution with strategic exploration. [paper]
[2024/08/24] AutoWebGLM: A Large Language Model-based Web Navigating Agent. [paper]
[2024/08/24] Intelligent Agents with LLM-based Process Automation. [paper]
[2024/08/01] OpenWebAgent: An Open Toolkit to Enable Web Agents on Large Language Models. [paper]
[2024/08/01] Omniparser for pure vision based gui agent. [paper]
[2024/07/21] Towards LLMCI: Multimodal AI for LLM-Vision UI Operation. [paper]
[2024/07/17] Agent-e: From autonomous web navigation to foundational design principles in agentic systems. [paper]
[2024/07/04] MobileExperts: A Dynamic Tool-Enabled Agent Team in Mobile Devices. [paper]
[2024/07/01] Tree Search for Language Model Agents. [paper]
[2024/06/27] Read anywhere pointed: Layout-aware gui screen reading with tree-of-lens grounding. [paper]
[2024/06/11] CAAP: Context-Aware Action Planning Prompting to Solve Computer Tasks with Front-End UI Only. [paper]
[2024/06/03] Mobile-Agent-v2: Mobile Device Operation Assistant with Effective Navigation via Multi-Agent Collaboration. [paper]
[2024/05/23] AndroidWorld: A Dynamic Benchmarking Environment for Autonomous Agents. [paper]
[2024/05/17] Latent State Estimation Helps UI Agents to Reason. [paper]
[2024/04/28] MMAC-Copilot: Multi-modal Agent Collaboration Operating System Copilot. [paper]
[2024/04/09] Autonomous Evaluation and Refinement of Web Agents. [paper]
[2024/04/03] PromptRPA: Generating Robotic Process Automation on Smartphones from Textual Prompts. [paper]
[2024/03/25] AIOS: LLM Agent Operating System. [paper]
[2024/03/05] Android in the zoo: Chain-of-action-thought for gui agents. [paper]
[2024/03/05] Cradle: Empowering Foundation Agents Towards General Computer Control. [paper]
[2024/02/23] On the Multi-turn Instruction Following for Conversational Web Agents. [paper]
[2024/02/19] CoCo-Agent: A Comprehensive Cognitive MLLM Agent for Smartphone GUI Automation. [paper]
[2024/02/12] OS-Copilot: Towards Generalist Computer Agents with Self-Improvement. [paper]
[2024/02/09] ScreenAgent: A Vision Language Model-driven Computer Control Agent. [paper]
[2024/02/08] Ufo: A ui-focused agent for windows os interaction. [paper]
[2024/02/06] Dual-View Visual Contextualization for Web Navigation. [paper]
[2024/01/29] Mobile-Agent: Autonomous Multi-Modal Mobile Device Agent with Visual Perception. [paper]
[2024/01/25] WebVoyager: Building an End-to-End Web Agent with Large Multimodal Models. [paper]
[2024/01/17] Seeclick: Harnessing gui grounding for advanced visual gui agents. [paper]
[2024/01/04] MobileAgent: enhancing mobile control via human-machine interaction and SOP integration. [paper]
[2024/01/03] GPT-4V(ision) is a Generalist Web Agent, if Grounded. [paper]
[2023/12/21] AppAgent: Multimodal Agents as Smartphone Users. [paper]
[2023/12/20] Assistgui: Task-oriented desktop graphical user interface automation. [paper]
[2023/12/04] MobileGPT: Augmenting LLM with Human-like App Memory for Mobile Task Automation. [paper]
[2023/12/04] Intelligent Virtual Assistants with LLM-based Process Automation. [paper]
[2023/11/13] GPT-4V in Wonderland: Large Multimodal Models for Zero-Shot Smartphone GUI Navigation. [paper]
[2023/10/24] WebWISE: Web Interface Control and Sequential Exploration with Large Language Models. [paper]
[2023/10/12] A Zero-Shot Language Agent for Computer Control with Structured Reflection. [paper]
[2023/09/20] You only look at screens: Multimodal chain-of-action agents. [paper]
[2023/09/15] Laser: Llm agent with state-space exploration for web navigation. [paper]
[2023/08/29] AutoDroid: LLM-powered Task Automation in Android. [paper]
[2023/07/24] A Real-World WebAgent with Planning, Long Context Understanding, and Program Synthesis. [paper]
[2023/06/14] Synapse: Trajectory-as-Exemplar Prompting with Memory for Computer Control. [paper]
[2023/06/09] Mind2Web: Towards a Generalist Agent for the Web. [paper]
[2023/05/30] Sheetcopilot: Bringing software productivity to the next level through large language models. [paper]
[2023/05/23] Hierarchical prompting assists large language model on web navigation. [paper]
[2023/05/09] InternGPT: Solving Vision-Centric Tasks by Interacting with ChatGPT Beyond Language. [paper]
[2023/03/30] Language Models can Solve Computer Tasks. [paper]
[2022/09/19] Enabling conversational interaction with mobile ui using large language models. [paper]
[2022/05/23] META-GUI: Towards multi-modal conversational agents on mobile gui. [paper]

Evaluation & Benchmark

[2024/12/09] BigDocs: An Open and Permissively-Licensed Dataset for Training Multimodal Models on Document and Code Tasks. [paper]
[2024/12/05] AGUVIS: Unified Pure Vision Agents for Autonomous GUI Interaction. [paper]
[2024/10/31] AndroidLab: Training and Systematic Benchmarking of Android Autonomous Agents. [paper]
[2024/10/28] AssistEditor: Multi-Agent Collaboration for GUI Workflow Automation in Video Creation. [paper]
[2024/10/28] Shopping MMLU: A Massive Multi-Task Online Shopping Benchmark for Large Language Models. [paper]
[2024/10/28] MMT-Bench: A Comprehensive Multimodal Benchmark for Evaluating Large Vision-Language Models Towards Multitask AGI. [paper]
[2024/10/24] VideoWebArena: Evaluating Long Context Multimodal Agents with Video Understanding Web Tasks. [paper]
[2024/10/22] Large Language Models Empowered Personalized Web Agents. [paper]
[2024/10/19] SPA-Bench: A Comprehensive Benchmark for SmartPhone Agent Evaluation. [paper]
[2024/10/17] MobA: A Two-Level Agent System for Efficient Mobile Task Automation. [paper]
[2024/10/07] Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents. [paper]
[2024/09/22] MobileViews: A Large-Scale Mobile GUI Dataset. [paper]
[2024/09/12] Windows Agent Arena: Evaluating Multi-Modal OS Agents at Scale. [paper]
[2024/09/06] WebQuest: A Benchmark for Multimodal QA on Web Page Sequences. [paper]
[2024/08/01] Omniparser for pure vision based gui agent. [paper]
[2024/07/26] OfficeBench: Benchmarking Language Agents across Multiple Applications for Office Automation. [paper]
[2024/07/22] AssistantBench: Can Web Agents Solve Realistic and Time-Consuming Tasks? [paper]
[2024/07/15] Spider2-V: How Far Are Multimodal Agents From Automating Data Science and Engineering Workflows? [paper]
[2024/07/13] RealWeb: A Benchmark for Universal Instruction Following in Realistic Web Services Navigation. [paper]
[2024/07/11] UICrit: Enhancing Automated Design Evaluation with a UI Critique Dataset. [paper]
[2024/07/07] WorkArena++: Towards Compositional Planning and Reasoning-based Common Knowledge Work Tasks. [paper]
[2024/07/04] MobileExperts: A Dynamic Tool-Enabled Agent Team in Mobile Devices. [paper]
[2024/07/03] Amex: Android multi-annotation expo dataset for mobile gui agents. [paper]
[2024/07/01] CRAB: Cross-environment Agent Benchmark for Multimodal Language Model Agents. [paper]
[2024/07/01] Mobile-Bench: An Evaluation Benchmark for LLM-based Mobile Agents. [paper]
[2024/06/27] Read Anywhere Pointed: Layout-aware GUI Screen Reading with Tree-of-Lens Grounding. [paper]
[2024/06/20] E-ANT: A Large-Scale Dataset for Efficient Automatic GUI NavigaTion. [paper]
[2024/06/20] Identifying User Goals from UI Trajectories. [paper]
[2024/06/19] GUI Action Narrator: Where and When Did That Action Take Place? [paper]
[2024/06/18] WebCanvas: Benchmarking Web Agents in Online Environments. [paper]
[2024/06/17] GUICourse: From General Vision Language Models to Versatile GUI Agents. [paper]
[2024/06/14] VideoGUI: A Benchmark f om Instructional Videos. [paper]
[2024/06/12] GUI Odyssey: A Comprehensive Dataset for Cross-App GUI Navigation on Mobile Devices. [paper]
[2024/06/12] GUI-WORLD: A Dataset for GUI-oriented Multimodal LLM-based Agents. [paper]
[2024/06/12] MobileAgentBench: An Efficient and User-Friendly Benchmark for Mobile LLM Agents. [paper]
[2024/06/06] On the Effects of Data Scale on UI Control Agents. [paper]
[2024/06/05] WebOlympus: An Open Platform for Web Agents on Live Websites. [paper]
[2024/06/05] UGIF-DataSet: A New Dataset for Cross-lingual, Cross-modal Sequential actions on the UI. [paper]
[2024/06/01] WebSuite: Systematically Evaluating Why Web Agents Fail. [paper]
[2024/05/23] AndroidWorld: A Dynamic Benchmarking Environment for Autonomous Agents. [paper]
[2024/05/23] AGILE: A Novel Reinforcement Learning Framework of LLM Agents. [paper]
[2024/05/17] Latent state estimation helps ui agents to reason. [paper]
[2024/05/07] Mapping natural language instructions to mobile UI action sequences. [paper]
[2024/05/05] Visual grounding for desktop graphical user interfaces. [paper]
[2024/04/28] MMAC-Copilot: Multi-modal Agent Collaboration Operating System Copilot. [paper]
[2024/04/25] Benchmarking mobile device control agents across diverse configurations. [paper]
[2024/04/15] MMInA: Benchmarking multihop multimodal internet agents. [paper]
[2024/04/12] Llamatouch: A faithful and scalable testbed for mobile ui task automation. [paper]
[2024/04/11] OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments. [paper]
[2024/04/09] VisualWebBench: How Far Have Multimodal LLMs Evolved in Web Page Understanding and Grounding? [paper]
[2024/04/09] GUIDE: Graphical User Interface Data for Execution. [paper]
[2024/04/08] Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs. [paper]
[2024/04/04] Autowebglm: Bootstrap and reinforce a large language model-based web navigating agent. [paper]
[2024/03/29] Evaluating language model agents on realistic autonomous tasks. [paper]
[2024/03/29] Draw-and-Understand: Leveraging Visual Prompts to Enable MLLMs to Comprehend What You Want. [paper]
[2024/03/26] AgentStudio: A Toolkit for Building General Virtual Agents. [paper]
[2024/03/18] Tur[k]ingBench: A Challenge Benchmark for Web Agents. [paper]
[2024/03/15] Computer User Interface Understanding. A New Dataset and a Learning Framework. [paper]
[2024/03/12] Workarena: How capable are web agents at solving common knowledge work tasks? [paper]
[2024/03/06] PPTC-R benchmark: Towards Evaluating the Robustness of Large Language Models for PowerPoint Task Completion. [paper]
[2024/03/05] Android in the Zoo:Chain-of-Action-Thought for GUI Agents. [paper]
[2024/02/27] Omniact: A dataset and benchmark for enabling multimodal generalist autonomous agents for desktop and web. [paper]
[2024/02/23] On the Multi-turn Instruction Following for Conversational Web Agents. [paper]
[2024/02/09] Understanding the weakness of large language model agents within a complex android environment. [paper]
[2024/02/08] WebLINX: Real-World website navigation with Multi-Turn dialogue. [paper]
[2024/01/29] Mobile-Agent: Autonomous Multi-Modal Mobile Device Agent with Visual Perception. [paper]
[2024/01/25] Webvoyager: Building an end-to-end web agent with large multimodal models. [paper]
[2024/01/25] GPTVoiceTasker: Advancing Multi-step Mobile Task Efficiency Through Dynamic Interface Exploration and Learning. [paper]
[2024/01/24] Visualwebarena: Evaluating multimodal agents on realistic visual web tasks. [paper]
[2024/01/24] Agentboard: An analytical evaluation board of multiturn LLM agents. [paper]
[2024/01/17] Seeclick: Harnessing gui grounding for advanced visual gui agents. [paper]
[2023/12/26] AutoTask: Executing Arbitrary Voice Commands by Exploring and Learning from Mobile GUI. [paper]
[2023/12/25] WebVLN: Vision-and-Language Navigation on Websites. [paper]
[2023/12/20] ASSISTGUI: Task-Oriented Desktop Graphical User Interface Automation. [paper]
[2023/11/30] Exposing Limitations of Language Model Agents in Sequential-Task Compositions on the Web. [paper]
[2023/11/21] GAIA: a benchmark for general AI assistants. [paper]
[2023/11/13] GPT-4V in wonderland: Large multimodal models for Zero-Shot smartphone GUI navigation. [paper]
[2023/11/03] Pptc benchmark: Evaluating large language models for powerpoint task completion. [paper]
[2023/10/16] OpenAgents: An Open Platform for Language Agents in the Wild. [paper]
[2023/09/20] You Only Look at Screens:Multimodal Chain-of-Action Agents. [paper]
[2023/08/29] AutoDroid: LLM-powered Task Automation in Android. [paper]
[2023/08/07] Agentbench: Evaluating llms as agents. [paper]
[2023/07/25] Webarena: A realistic web environment for building autonomous agents. [paper]
[2023/07/19] Android in the wild: A large-scale dataset for android device control. [paper]
[2023/06/09] Mind2Web: Towards a Generalist Agent for the Web. [paper]
[2023/05/30] Sheetcopilot: Bringing software productivity to the next level through large language models. [paper]
[2023/05/25] On the tool manipulation capability of open-source large language models. [paper]
[2023/05/14] Mobile-env: A universal platform for training and evaluation of mobile interaction. [paper]
[2023/05/14] Mobile-env: Building qualified evaluation benchmarks for llm-gui interaction. [paper]
[2023/04/14] Droidbot-gpt: Gpt-powered ui automation for android. [paper]
[2023/04/10] OpenAGI: When LLM Meets Domain Experts. [paper]
[2023/03/13] Vision-Language models as success detectors. [paper]
[2022/11/14] Ugif: Ui grounded instruction following. [paper]
[2022/10/08] Understanding html with large language models. [paper]
[2022/09/29] MUG: Interactive Multimodal Grounding on User Interfaces. [paper]
[2022/09/16] ScreenQA: Large-Scale Question-Answer Pairs over Mobile App Screenshots. [paper]
[2022/07/04] Webshop: Towards scalable real-world web interaction with grounded language agents. [paper]
[2022/05/23] META-GUI: Towards multi-modal conversational agents on mobile gui. [paper]
[2022/02/04] A Dataset for Interactive Vision-Language Navigation with Unknown Command Feasibility. [paper]
[2021/04/17] Mobile app tasks with iterative feedback (motif): Addressing task feasibility in interactive visual environments. [paper]
[2021/01/23] Websrc: A dataset for web-based structural reading comprehension. [paper]
[2018/08/28] Mapping natural language commands to web elements. [paper]
[2017/11/06] Building natural language interfaces to web apis. [paper]
[2017/08/06] World of Bits: An Open-Domain Platform for Web-Based Agents. [paper]

Safety & Privacy

[2024/11/04] Attacking Vision-Language Computer Agents via Pop-ups. [paper]
[2024/10/23] MobileSafetyBench: Evaluating Safety of Autonomous Agents in Mobile Device Control. [paper]
[2024/10/22] Advweb: Controllable black-box attacks on vlm-powered web agents. [paper]
[2024/10/11] Refusal-Trained LLMs Are Easily Jailbroken As Browser Agents. [paper]
[2024/10/09] ST-WebAgentBench: A Benchmark for Evaluating Safety and Trustworthiness in Web Agents. [paper]
[2024/09/17] Eia: Environmental injection attack on generalist web agents for privacy leakage. [paper]
[2024/08/05] Caution for the Environment: Multimodal Agents are Susceptible to Environmental Distractions. [paper]
[2024/07/12] Security Matrix for Multimodal Agents on Mobile Devices: A Systematic and Proof of Concept Study. [paper]
[2025/02/04] Dissecting Adversarial Robustness of Multimodal LM Agents. [paper]
[2024/02/26] WIPI: A New Web Threat for LLM-Driven Web Agents. [paper]
[2023/08/03] From Prompt Injections to SQL Injection Attacks: How Protected is Your LLM-Integrated Web Application? [paper]
[2024/12/19] Agent-SafetyBench: Evaluating the Safety of LLM Agents. [paper]

Hiring

[OPPO] Personal AI Team Hiring Algorithm Interns / New Graduates / Experienced Candidates

OPPO is dedicated to developing the most advanced foundation model technologies and creating personalized user experiences. Our mission is focused on developing revolutionary AI Native Phones that will shape the future. The Personal AI Team is primarily focused on cutting-edge research in (multimodal) LLMs, AI Agents and personalization (e.g., team's new work AI PERSONA). The team leverages OPPO’s ample platform resources, including data and computing power, to continuously invest in these key areas. We are currently hiring algorithm interns, new graduates, and experienced candidates. We expect you to be proficient in algorithms or engineering of large foundation models. If interested, please contact via email: [email protected].

Useful Links

To better foster community collaboration, we have listed some repositories related to OS Agents:

aialt/awesome-mobile-agents: https://github.com/aialt/awesome-mobile-agents
OSU-NLP-Group/GUI-Agents-Paper-List: https://github.com/OSU-NLP-Group/GUI-Agents-Paper-List
vyokky/LLM-Brained-GUI-Agents-Survey: https://github.com/vyokky/LLM-Brained-GUI-Agents-Survey
wendell0218/GVA-Survey: https://github.com/wendell0218/GVA-Survey
WebAgentLab: https://webagentlab.notion.site/homepage
PhoneLLM/Awesome-LLM-Powered-Phone-GUI-Agents: https://github.com/PhoneLLM/Awesome-LLM-Powered-Phone-GUI-Agents
ranpox/awesome-computer-use: https://github.com/ranpox/awesome-computer-use

Contact

The repo is still being updated rapidly🚀. Please let us know if you notice any mistakes or would like any work related to OS Agents to be included in our list by e-mail: [email protected].

Citation

If you find our work valuable for your research or applications, we would greatly appreciate a star ⭐ and a citation using the BibTeX entry provided below.

@article{huagents,
  title={OS Agents: A Survey on MLLM-based Agents for Computer, Phone and Browser Use},
  author={Hu, Xueyu and Xiong, Tao and Yi, Biao and Wei, Zishu and Xiao, Ruixuan and Chen, Yurun and Ye, Jiasheng and Tao, Meiling and Zhou, Xiangxin and Zhao, Ziyu and others}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly