Multimodal Agent
updated
Gemini Robotics: Bringing AI into the Physical World
Paper
• 2503.20020
• Published
• 31
Magma: A Foundation Model for Multimodal AI Agents
Paper
• 2502.13130
• Published
• 58
LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents
Paper
• 2311.05437
• Published
• 51
OS-ATLAS: A Foundation Action Model for Generalist GUI Agents
Paper
• 2410.23218
• Published
• 49
ShowUI: One Vision-Language-Action Model for GUI Visual Agent
Paper
• 2411.17465
• Published
• 90
Mobile-Agent-E: Self-Evolving Mobile Assistant for Complex Tasks
Paper
• 2501.11733
• Published
• 28
Being-0: A Humanoid Robotic Agent with Vision-Language Models and
Modular Skills
Paper
• 2503.12533
• Published
• 68
Embodied-Reasoner: Synergizing Visual Search, Reasoning, and Action for
Embodied Interactive Tasks
Paper
• 2503.21696
• Published
• 23
UI-R1: Enhancing Action Prediction of GUI Agents by Reinforcement
Learning
Paper
• 2503.21620
• Published
• 62
OmniParser for Pure Vision Based GUI Agent
Paper
• 2408.00203
• Published
• 24
UniVLA: Learning to Act Anywhere with Task-centric Latent Actions
Paper
• 2505.06111
• Published
• 25
Visual Agentic Reinforcement Fine-Tuning
Paper
• 2505.14246
• Published
• 32
Robo2VLM: Visual Question Answering from Large-Scale In-the-Wild Robot
Manipulation Datasets
Paper
• 2505.15517
• Published
• 4
Interactive Post-Training for Vision-Language-Action Models
Paper
• 2505.17016
• Published
• 6
InfantAgent-Next: A Multimodal Generalist Agent for Automated Computer
Interaction
Paper
• 2505.10887
• Published
• 10
Paper2Poster: Towards Multimodal Poster Automation from Scientific
Papers
Paper
• 2505.21497
• Published
• 109
VisualToolAgent (VisTA): A Reinforcement Learning Framework for Visual
Tool Selection
Paper
• 2505.20289
• Published
• 10
Spatial-MLLM: Boosting MLLM Capabilities in Visual-based Spatial
Intelligence
Paper
• 2505.23747
• Published
• 69
Open CaptchaWorld: A Comprehensive Web-based Platform for Testing and
Benchmarking Multimodal LLM Agents
Paper
• 2505.24878
• Published
• 23
SmolVLA: A Vision-Language-Action Model for Affordable and Efficient
Robotics
Paper
• 2506.01844
• Published
• 153
LoHoVLA: A Unified Vision-Language-Action Model for Long-Horizon
Embodied Tasks
Paper
• 2506.00411
• Published
• 31
VS-Bench: Evaluating VLMs for Strategic Reasoning and Decision-Making in
Multi-Agent Environments
Paper
• 2506.02387
• Published
• 58
GUI-Actor: Coordinate-Free Visual Grounding for GUI Agents
Paper
• 2506.03143
• Published
• 53
Multimodal DeepResearcher: Generating Text-Chart Interleaved Reports
From Scratch with Agentic Framework
Paper
• 2506.02454
• Published
• 7
SAFE: Multitask Failure Detection for Vision-Language-Action Models
Paper
• 2506.09937
• Published
• 9
Optimus-3: Towards Generalist Multimodal Minecraft Agents with Scalable
Task Experts
Paper
• 2506.10357
• Published
• 21
VideoDeepResearch: Long Video Understanding With Agentic Tool Using
Paper
• 2506.10821
• Published
• 19
BridgeVLA: Input-Output Alignment for Efficient 3D Manipulation Learning
with Vision-Language Models
Paper
• 2506.07961
• Published
• 11
EfficientVLA: Training-Free Acceleration and Compression for
Vision-Language-Action Models
Paper
• 2506.10100
• Published
• 10
From Intention to Execution: Probing the Generalization Boundaries of
Vision-Language-Action Models
Paper
• 2506.09930
• Published
• 8
Unified Vision-Language-Action Model
Paper
• 2506.19850
• Published
• 28
WorldVLA: Towards Autoregressive Action World Model
Paper
• 2506.21539
• Published
• 40
A Survey on Vision-Language-Action Models: An Action Tokenization
Perspective
Paper
• 2507.01925
• Published
• 39
RoboBrain 2.0 Technical Report
Paper
• 2507.02029
• Published
• 35
PresentAgent: Multimodal Agent for Presentation Video Generation
Paper
• 2507.04036
• Published
• 11
A Survey on Vision-Language-Action Models for Autonomous Driving
Paper
• 2506.24044
• Published
• 14
PyVision: Agentic Vision with Dynamic Tooling
Paper
• 2507.07998
• Published
• 33
ThinkAct: Vision-Language-Action Reasoning via Reinforced Visual Latent
Planning
Paper
• 2507.16815
• Published
• 42
ScreenCoder: Advancing Visual-to-Code Generation for Front-End
Automation via Modular Multimodal Agents
Paper
• 2507.22827
• Published
• 100
villa-X: Enhancing Latent Action Modeling in Vision-Language-Action
Models
Paper
• 2507.23682
• Published
• 24
Cosmos-Reason1: From Physical Common Sense To Embodied Reasoning
Paper
• 2503.15558
• Published
• 50
InstructVLA: Vision-Language-Action Instruction Tuning from
Understanding to Manipulation
Paper
• 2507.17520
• Published
• 15
RoboMemory: A Brain-inspired Multi-memory Agentic Framework for Lifelong
Learning in Physical Embodied Systems
Paper
• 2508.01415
• Published
• 8
OS Agents: A Survey on MLLM-based Agents for General Computing Devices
Use
Paper
• 2508.04482
• Published
• 9
MolmoAct: Action Reasoning Models that can Reason in Space
Paper
• 2508.07917
• Published
• 44
MM-BrowseComp: A Comprehensive Benchmark for Multimodal Browsing Agents
Paper
• 2508.13186
• Published
• 19
UI-Venus Technical Report: Building High-performance UI Agents with RFT
Paper
• 2508.10833
• Published
• 45
Seeing, Listening, Remembering, and Reasoning: A Multimodal Agent with
Long-Term Memory
Paper
• 2508.09736
• Published
• 58
WebWatcher: Breaking New Frontier of Vision-Language Deep Research Agent
Paper
• 2508.05748
• Published
• 141
Mobile-Agent-v3: Foundamental Agents for GUI Automation
Paper
• 2508.15144
• Published
• 65
Do What? Teaching Vision-Language-Action Models to Reject the Impossible
Paper
• 2508.16292
• Published
• 9
CogVLA: Cognition-Aligned Vision-Language-Action Model via
Instruction-Driven Routing & Sparsification
Paper
• 2508.21046
• Published
• 9
Mind the Third Eye! Benchmarking Privacy Awareness in MLLM-powered
Smartphone Agents
Paper
• 2508.19493
• Published
• 11
EmbodiedOneVision: Interleaved Vision-Text-Action Pretraining for
General Robot Control
Paper
• 2508.21112
• Published
• 77
UItron: Foundational GUI Agent with Advanced Perception and Planning
Paper
• 2508.21767
• Published
• 12
Robix: A Unified Model for Robot Interaction, Reasoning and Planning
Paper
• 2509.01106
• Published
• 52
UI-TARS-2 Technical Report: Advancing GUI Agent with Multi-Turn
Reinforcement Learning
Paper
• 2509.02544
• Published
• 125
VLA-Adapter: An Effective Paradigm for Tiny-Scale Vision-Language-Action
Model
Paper
• 2509.09372
• Published
• 248
SimpleVLA-RL: Scaling VLA Training via Reinforcement Learning
Paper
• 2509.09674
• Published
• 80
F1: A Vision-Language-Action Model Bridging Understanding and Generation
to Actions
Paper
• 2509.06951
• Published
• 32
UI-S1: Advancing GUI Automation via Semi-online Reinforcement Learning
Paper
• 2509.11543
• Published
• 49
Nav-R1: Reasoning and Navigation in Embodied Scenes
Paper
• 2509.10884
• Published
• 9
BTL-UI: Blink-Think-Link Reasoning Model for GUI Agent
Paper
• 2509.15566
• Published
• 14
Video2Roleplay: A Multimodal Dataset and Framework for Video-Guided
Role-playing Agents
Paper
• 2509.15233
• Published
• 2
A Vision-Language-Action-Critic Model for Robotic Real-World
Reinforcement Learning
Paper
• 2509.15937
• Published
• 20
RynnVLA-001: Using Human Demonstrations to Improve Robot Manipulation
Paper
• 2509.15212
• Published
• 22
D-Artemis: A Deliberative Cognitive Framework for Mobile GUI
Multi-Agents
Paper
• 2509.21799
• Published
• 9
Efficient Multi-turn RL for GUI Agents via Decoupled Training and
Adaptive Data Curation
Paper
• 2509.23866
• Published
• 14
PixelCraft: A Multi-Agent System for High-Fidelity Visual Reasoning on
Structured Images
Paper
• 2509.25185
• Published
• 5
UniMIC: Token-Based Multimodal Interactive Coding for Human-AI
Collaboration
Paper
• 2509.22570
• Published
• 4
VLA-R1: Enhancing Reasoning in Vision-Language-Action Models
Paper
• 2510.01623
• Published
• 12
VLA-RFT: Vision-Language-Action Reinforcement Fine-tuning with Verified
Rewards in World Simulators
Paper
• 2510.00406
• Published
• 66
Expertise need not monopolize: Action-Specialized Mixture of Experts for
Vision-Language-Action Learning
Paper
• 2510.14300
• Published
• 12
VLA^2: Empowering Vision-Language-Action Models with an Agentic
Framework for Unseen Concept Manipulation
Paper
• 2510.14902
• Published
• 17
InternVLA-M1: A Spatially Guided Vision-Language-Action Framework for
Generalist Robot Policy
Paper
• 2510.13778
• Published
• 17
X-VLA: Soft-Prompted Transformer as Scalable Cross-Embodiment
Vision-Language-Action Model
Paper
• 2510.10274
• Published
• 16
DeepMMSearch-R1: Empowering Multimodal LLMs in Multimodal Web Search
Paper
• 2510.12801
• Published
• 13
Vlaser: Vision-Language-Action Model with Synergistic Embodied Reasoning
Paper
• 2510.11027
• Published
• 23
ReLook: Vision-Grounded RL with a Multimodal LLM Critic for Agentic Web
Coding
Paper
• 2510.11498
• Published
• 11
GigaBrain-0: A World Model-Powered Vision-Language-Action Model
Paper
• 2510.19430
• Published
• 52
DaMo: Data Mixing Optimizer in Fine-tuning Multimodal LLMs for Mobile
Phone Agents
Paper
• 2510.19336
• Published
• 17
PhysVLM-AVR: Active Visual Reasoning for Multimodal Large Language
Models in Physical Environments
Paper
• 2510.21111
• Published
• 3
π_RL: Online RL Fine-tuning for Flow-based
Vision-Language-Action Models
Paper
• 2510.25889
• Published
• 66
RoboOmni: Proactive Robot Manipulation in Omni-modal Context
Paper
• 2510.23763
• Published
• 56
From Spatial to Actions: Grounding Vision-Language-Action Model in
Spatial Foundation Priors
Paper
• 2510.17439
• Published
• 28
World Simulation with Video Foundation Models for Physical AI
Paper
• 2511.00062
• Published
• 45
Unified Diffusion VLA: Vision-Language-Action Model via Joint Discrete
Denoising Diffusion Process
Paper
• 2511.01718
• Published
• 7
A Survey on Efficient Vision-Language-Action Models
Paper
• 2510.24795
• Published
• 6
iFlyBot-VLA Technical Report
Paper
• 2511.01914
• Published
• 7
Don't Blind Your VLA: Aligning Visual Representations for OOD
Generalization
Paper
• 2510.25616
• Published
• 105
DeepEyesV2: Toward Agentic Multimodal Model
Paper
• 2511.05271
• Published
• 45
Paper
• 2511.05491
• Published
• 52
Robot Learning from a Physical World Model
Paper
• 2511.07416
• Published
• 32
UniVA: Universal Video Agent towards Open-Source Next-Generation Video Generalist
Paper
• 2511.08521
• Published
• 38
AffordBot: 3D Fine-grained Embodied Reasoning via Multimodal Large Language Models
Paper
• 2511.10017
• Published
• 7
WMPO: World Model-based Policy Optimization for Vision-Language-Action Models
Paper
• 2511.09515
• Published
• 20
WebVIA: A Web-based Vision-Language Agentic Framework for Interactive and Verifiable UI-to-Code Generation
Paper
• 2511.06251
• Published
• 14
Orion: A Unified Visual Agent for Multimodal Perception, Advanced Visual Reasoning and Execution
Paper
• 2511.14210
• Published
• 21
MiMo-Embodied: X-Embodied Foundation Model Technical Report
Paper
• 2511.16518
• Published
• 26
EvoVLA: Self-Evolving Vision-Language-Action Model
Paper
• 2511.16166
• Published
• 6
RynnVLA-002: A Unified Vision-Language-Action and World Model
Paper
• 2511.17502
• Published
• 28
Mantis: A Versatile Vision-Language-Action Model with Disentangled Visual Foresight
Paper
• 2511.16175
• Published
• 12
VLA-4D: Embedding 4D Awareness into Vision-Language-Action Models for SpatioTemporally Coherent Robotic Manipulation
Paper
• 2511.17199
• Published
• 8
MobileVLA-R1: Reinforcing Vision-Language-Action for Mobile Robots
Paper
• 2511.17889
• Published
• 5
Agent0-VL: Exploring Self-Evolving Agent for Tool-Integrated Vision-Language Reasoning
Paper
• 2511.19900
• Published
• 48
Agentic Learner with Grow-and-Refine Multimodal Semantic Memory
Paper
• 2511.21678
• Published
• 12
DualVLA: Building a Generalizable Embodied Agent via Partial Decoupling of Reasoning and Action
Paper
• 2511.22134
• Published
• 22
LongVT: Incentivizing "Thinking with Long Videos" via Native Tool Calling
Paper
• 2511.20785
• Published
• 187
VLASH: Real-Time VLAs via Future-State-Aware Asynchronous Inference
Paper
• 2512.01031
• Published
• 26
WorldMM: Dynamic Multimodal Memory Agent for Long Video Reasoning
Paper
• 2512.02425
• Published
• 25
SwiftVLA: Unlocking Spatiotemporal Dynamics for Lightweight VLA Models at Minimal Overhead
Paper
• 2512.00903
• Published
• 7
SIMA 2: A Generalist Embodied Agent for Virtual Worlds
Paper
• 2512.04797
• Published
• 25
Steering Vision-Language-Action Models as Anti-Exploration: A Test-Time Scaling Approach
Paper
• 2512.02834
• Published
• 41
VideoVLA: Video Generators Can Be Generalizable Robot Manipulators
Paper
• 2512.06963
• Published
• 4
HiF-VLA: Hindsight, Insight and Foresight through Motion Representation for Vision-Language-Action Models
Paper
• 2512.09928
• Published
• 14
UniUGP: Unifying Understanding, Generation, and Planing For End-to-end Autonomous Driving
Paper
• 2512.09864
• Published
• 12
Spatial-Aware VLA Pretraining through Visual-Physical Alignment from Human Videos
Paper
• 2512.13080
• Published
• 17
VLSA: Vision-Language-Action Models with Plug-and-Play Safety Constraint Layer
Paper
• 2512.11891
• Published
• 10
An Anatomy of Vision-Language-Action Models: From Modules to Milestones and Challenges
Paper
• 2512.11362
• Published
• 22
AdaTooler-V: Adaptive Tool-Use for Images and Videos
Paper
• 2512.16918
• Published
• 14
Step-GUI Technical Report
Paper
• 2512.15431
• Published
• 132
Vision-Language-Action Models for Autonomous Driving: Past, Present, and Future
Paper
• 2512.16760
• Published
• 15
EVOLVE-VLA: Test-Time Training from Environment Feedback for Vision-Language-Action Models
Paper
• 2512.14666
• Published
• 10
GTR-Turbo: Merged Checkpoint is Secretly a Free Teacher for Agentic VLM Training
Paper
• 2512.13043
• Published
• 5
Dream-VL & Dream-VLA: Open Vision-Language and Vision-Language-Action Models with Diffusion Language Model Backbone
Paper
• 2512.22615
• Published
• 49
OmniAgent: Audio-Guided Active Perception Agent for Omnimodal Audio-Video Understanding
Paper
• 2512.23646
• Published
• 15
SenseNova-MARS: Empowering Multimodal Agentic Reasoning and Search via Reinforcement Learning
Paper
• 2512.24330
• Published
• 36
Watching, Reasoning, and Searching: A Video Deep Research Benchmark on Open Web for Agentic Video Reasoning
Paper
• 2601.06943
• Published
• 215
VLingNav: Embodied Navigation with Adaptive Reasoning and Visual-Assisted Linguistic Memory
Paper
• 2601.08665
• Published
• 8
ACoT-VLA: Action Chain-of-Thought for Vision-Language-Action Models
Paper
• 2601.11404
• Published
• 26
MMDeepResearch-Bench: A Benchmark for Multimodal Deep Research Agents
Paper
• 2601.12346
• Published
• 49
BayesianVLA: Bayesian Decomposition of Vision Language Action Models via Latent Action Queries
Paper
• 2601.15197
• Published
• 54
TwinBrainVLA: Unleashing the Potential of Generalist VLMs for Embodied Tasks via Asymmetric Mixture-of-Transformers
Paper
• 2601.14133
• Published
• 61
VisGym: Diverse, Customizable, Scalable Environments for Multimodal Agents
Paper
• 2601.16973
• Published
• 40
DynamicVLA: A Vision-Language-Action Model for Dynamic Object Manipulation
Paper
• 2601.22153
• Published
• 73
A Pragmatic VLA Foundation Model
Paper
• 2601.18692
• Published
• 49
Shallow-π: Knowledge Distillation for Flow-based VLAs
Paper
• 2601.20262
• Published
• 2
Green-VLA: Staged Vision-Language-Action Model for Generalist Robots
Paper
• 2602.00919
• Published
• 315
Vision-DeepResearch: Incentivizing DeepResearch Capability in Multimodal Large Language Models
Paper
• 2601.22060
• Published
• 154