A Generalist AI for 3D Environments
Unlike AI models trained to play one specific game, Google DeepMind's SIMA is a generalist agent. It learns to follow natural language instructions to carry out tasks across a variety of 3D virtual environments and video games. By simply looking at the screen and reading text commands, SIMA acts like a human player, navigating complex worlds, interacting with objects, and collaborating to achieve high-level goals.
9+
Diverse 3D Worlds
600+
Basic Skills Learned
No API
Game Code Needed
Multi
Modal Architecture
Capabilities Across Domains
SIMA excels across a broad spectrum of embodied agent tasks, seamlessly blending visual perception with language understanding to execute complex goals.
How SIMA Interacts
SIMA only requires two inputs: the images on screen, and a natural-language instruction. It then outputs keyboard and mouse commands.
ObservationPixels + Language
➔
Processes visual screen output and user instructions.
ReasoningVision-Language Model
➔
Translates goals into actionable steps contextually.
ActionKeyboard/Mouse
➔
Outputs physical control commands within the environment.