Google’s AI can create a video game based on a napkin drawing

Google’s “Genie” could be used to create a wide range of interactive environments for more than just games.

March 20, 2024

When the video game No Man’s Sky was first released in 2016, it boasted a universe containing more than 18 quintillion planets. Granted, it took another six years of additional development to make any of those planets worth exploring, but this enormous, virtually limitless universe was created using a development method called “procedural generation.”

Procedural generation allows computers to create video game content by combining human-made assets (such as textures and prefabricated objects) with those generated by an algorithm. Other popular games have utilized this method to either ease the strain of manually constructing vast gaming worlds (Skyrim and The Witcher III) or to generate unique worlds for each new play session (Minecraft and Dead Cells).

As No Man’s Sky’s example suggests, procedural generation can create a near-infinite number of interactive environments, so long as human developers are available to provide the initial assets alongside the rules and parameters. Google, however, has brought procedurally generated environments to a whole new level.

The company’s AI research lab, Deepmind, recently announced an AI model that learned to craft 2D video games by analyzing internet videos. Once trained, the only assets a human need provide is a single image. Even a napkin drawing will do.

Named Genie (Generative Interactive Environments), the Al is currently a proof of concept, but it may signal more than a seismic shift in how we develop video games in the future. It could unlock new potential in what we can do with interactive environments.

I am really excited to reveal what @GoogleDeepMind's Open Endedness Team has been up to 🚀. We introduce Genie 🧞, a foundation world model trained exclusively from Internet videos that can generate an endless variety of action-controllable 2D worlds given image prompts. pic.twitter.com/TnQ8uv81wc
— Tim Rocktäschel (@_rockt) February 26, 2024

Now you’re playing with AI power

Genie’s training regimen is sure to be the envy of every middle schooler: It watched 6.8 million video clips of video games being played online. The AI specifically focused on 2D platformers like classic Super Mario Bros. and Sonic the Hedgehog. Each clip was 16 seconds long, meaning Genie watched the equivalent of 30,000 hours of Twitch streams (minus the commentary).

Genie analyzed the videos to determine what “latent actions” were taking place between each frame of animation. This analysis looked at latent actions because internet videos don’t come with action labels to explain what’s happening in the game. The model had to infer that information for itself. To keep things simple, the researchers limited the potential actions to eight possibilities (up, down, left, right, plus the diagonals).

A pixelated character from a classic side-scrolling video game jumps across platforms.
. — An example of a 2D platformer created by Google’s Genie. The initial prompt was an Imagen2 image featuring a … walking pizza slice wearing a hat? (Credit: Google Deepmind)

After extracting data from tons of video games and analyzing it for potential latent actions, Genie could generate playable 2D platformers using a single image as a foundation. The image could be a real-world photo, a hand-drawn sketch, or an AI text-to-image creation.

Whatever the source, the image becomes the game’s initial frame. The player then specifies an action — move right or jump up — and the model predicts and generates the next frame in the sequence. This cycle continues for the duration, with previous frames becoming data to predict the next frame based on player input. So unlike traditional video games, where developers have to create animations based on potential player inputs, Genie makes it up as it goes based on current player input.

“Genie introduces the era of being able to generate entire interactive worlds from images or text,” Google writes in a blog post.

More than just a game

The results are impressive. In the post, Google shared gifs of the video games in action. One showed a clay character jumping across a world that feels right out of a stylized Super Nintendo game. Another turned a young child’s refrigerator art into something playable. And another had LEGO Thor leaping off a dry board eraser.

Genie also displayed some unexpected emergent properties. For instance, some of the environments emulated an animation technique known as “parallax scrolling.” This is when game developers move the background at a slower pace than the foreground elements to give the illusion of depth. It’s an advanced design for an AI to pick up without explicit instructions to do so.

The researchers also wanted to determine if Genie could learn outside of gaming. Setting up a side experiment, they trained the AI on videos of robot arms moving and manipulating objects. Without any additional knowledge of how these robots should operate, Genie was able to develop an interactive environment where a user could manipulate a virtual robot arm just like a playable character in a video game.

Robot arm sorting objects on a conveyor belt. — Google’s Genie can also be applied to more general domains. This gif demonstrates its ability to create an interactive environment after watching videos of a robotic arm at work. (Credit: Google Deepmind)

As a bonus, the researchers found another emergent property: the AI model simulated object deformation, such as a bag of chips being squashed by the robot arm’s grip.

“[W]e believe Genie opens up vast potential for future research. Given its generality, the model could be trained from an even larger proportion of Internet videos to simulate diverse, realistic, and imagined environments,” the researchers write in their technical report.

For instance, the researchers speculate that Genie has the potential to create massive numbers of interactive environments for other AI models to learn in. So, rather than training self-driving car AIs through trial and error on real urban streets, a model like Genie could create a wide and diverse array of virtual and interactive environments for those AI models to learn in.

“Given that a lack of rich and diverse environments is one of the key limitations in [reinforcement learning], we could unlock new paths to creating more generally capable [AI] agents,” the researchers add.

Potential in need of a power-up

But like No Man’s Sky, Genie has limitations that must be improved before its worlds can be utilized to the fullest. Like all AI models, Genie is prone to making errors. In one example, two birds fly backward before smashing into each other and melding into a completely new character.

Don't forget: This is the worst it will ever get. Soon it will work perfectly. 🚀📈🚀📈 pic.twitter.com/03sVXq1jvE
— Jeff Clune (@jeffclune) February 26, 2024

The AI model is also limited in what it can produce. It can only manage “16 frames of memory,” which “makes it challenging to get consistent environments over long horizons.” In other words, the levels are short or get weird fast. The environments also only operate at around 1 frame per second. To put that in perspective, most 2D platformers target a frame rate of 60 per second. (The examples released by Google speed the frames up to gif standards.)

With those limitations in mind, it may take more than a few years of development before Genie can generate a limitless stream of new gaming content. But as Jeff Clune, paper author and an associate professor of computer science at the University of British Columbia, stated on X: “Don’t forget: This is the worst it will ever get. Soon it will work perfectly.”