SIMA: Scalable Instructable Multiworld Agent

A Generalist AI for 3D Environments

Unlike AI models trained to play one specific game, Google DeepMind's SIMA is a generalist agent. It learns to follow natural language instructions to carry out tasks across a variety of 3D virtual environments and video games. By simply looking at the screen and reading text commands, SIMA acts like a human player, navigating complex worlds, interacting with objects, and collaborating to achieve high-level goals.

Diverse 3D Worlds

600+

Basic Skills Learned

No API

Game Code Needed

Multi

Modal Architecture

Capabilities Across Domains

SIMA excels across a broad spectrum of embodied agent tasks, seamlessly blending visual perception with language understanding to execute complex goals.

How SIMA Interacts

SIMA only requires two inputs: the images on screen, and a natural-language instruction. It then outputs keyboard and mouse commands.

ObservationPixels + Language

➔

Processes visual screen output and user instructions.

ReasoningVision-Language Model

➔

Translates goals into actionable steps contextually.

ActionKeyboard/Mouse

➔

Outputs physical control commands within the environment.