SIMA

Scalable Instructable Multiworld Agent

A Generalist AI for 3D Environments

Unlike AI models trained to play one specific game, Google DeepMind's SIMA is a generalist agent. It learns to follow natural language instructions to carry out tasks across a variety of 3D virtual environments and video games. By simply looking at the screen and reading text commands, SIMA acts like a human player, navigating complex worlds, interacting with objects, and collaborating to achieve high-level goals.

9+
Diverse 3D Worlds
600+
Basic Skills Learned
No API
Game Code Needed
Multi
Modal Architecture

Capabilities Across Domains

SIMA excels across a broad spectrum of embodied agent tasks, seamlessly blending visual perception with language understanding to execute complex goals.

How SIMA Interacts

SIMA only requires two inputs: the images on screen, and a natural-language instruction. It then outputs keyboard and mouse commands.

ObservationPixels + Language
Processes visual screen output and user instructions.
ReasoningVision-Language Model
Translates goals into actionable steps contextually.
ActionKeyboard/Mouse
Outputs physical control commands within the environment.