Generative AI Terrains in Simulations

Unleashing the Power of Generative AI for Terrain Generation in Games

Creating realistic and engaging terrain for video games has always been a challenge for game developers. Traditionally, terrain generation has relied on procedural algorithms and manual artist input to design in-game landscapes. However, recent advancements in artificial intelligence (AI) have opened the door to a new world of possibilities for terrain generation. In this blog post, we'll explore how generative AI is revolutionizing the process of terrain creation, enabling the development of more complex, diverse, and immersive environments for players to explore.

What is Generative AI?

Generative AI refers to a class of artificial intelligence algorithms that are designed to create new content or data based on existing examples. These algorithms learn patterns and structures from the input data and then generate new instances that exhibit similar properties. Among the most popular generative AI techniques are Generative Adversarial Networks (GANs), Variational Autoencoders (VAEs) and systems that build upon VAEs like Stable Diffusion (SD), which have shown remarkable success in generating realistic images, audio, and even 3D models from text inputs where a user simply describes what they are looking for and the model generates pretty amazing results.

Heightmaps

heightmap diagram via Bonfire Studios

One of the primary technologies used to create terrain in video games is heightmaps. A heightmap is a 2D grayscale image where each pixel corresponds to the height value of a grid. By applying this heightmap to a 3D mesh, developers can create a 3D representation of terrain that accurately reflects the elevations and contours of the original heightmap.

Heightmaps are typically created using specialized software or tools that allow developers to sculpt the height values for each pixel so that the terrain fits the needs of the game. In Unity and Unreal terrain systems, you can also use “stamps” which are essentially brushes for terrain. Other tools allow a rough outline of the landscape to be created and then it’s procedurally eroded to look more realistic.

The problem with these tools is they are not very accessible and it often takes a proper 3D artist to create a satisfactory result. As one step in our goal to democratize the creation of 3D experiences, we leveraged the amazing work in AI image generation and applied it to creating terrains.

Training AI to Generate Terrain Heightmaps

Starting with a training set of ~600 terrains of various styles, we needed to convert them to a format that Stable Diffusion could work with. That meant taking the 4K 16bit single channel terrains and converting them to 512x512 RGB images (8 bits per channel). The conversion process meant the loss of a lot of detail, but it wasn’t clear how well the underlying VAE would work with these rather strange images in the first place so simply converting made sense to start with. 

Using the Stable Diffusion WebUI running locally with a RTX 3090, we trained a custom embedding on SD-1.5 so that at generation time, we could instruct Stable Diffusion to only generate terrain images. The process we used for this is called “Textual Inversion”.

Using only 3-5 images of a user-provided concept, like an object or a style, we learn to represent it through new “words” in the embedding space of a frozen text-to-image model. These “words” can be composed into natural language sentences, guiding personalized creation in an intuitive way. Notably, we find evidence that a single word embedding is sufficient for capturing unique and varied concepts.
— Rinon Gal, et al. `22

So, we expanded the vocabulary of the CLIP text processor with a new and unique word that we made up, “FableTerrainGenv1”.

That word is set to use 8 vectors per token so we can capture a lot of variation in what we mean by that word visually. The more vectors we use, the less room there is for the rest of the description, but luckily the other concepts are not affected. That means you could ask for terrain images with a query like “A series of sharp mountains surrounding a flat lake in the middle by FableTerrainGenv1” and the other words are still understood.

We trained this embedding over 10,000 steps with a batch size of 8, which took up the entirety of the GPUs 24GB of memory to hold the 8 images in the batch and the 3 models involved in the Stable Diffusion pipeline (VAE, UNet, CLIP).

Here are some examples of the results at various milestones during training.

For more information on this process, especially using the handy stable-diffusion-webUI project, we recommend checking out their documentation.

The beauty of this method is that instead of retraining the network itself, you are simply finding the right set of embedding vectors to represent the concept you want to concentrate on. That means the resulting embedding only takes up 25k instead of a few gigabytes.

Results

Generating the Terrain

We now ask the model to generate “sharp mountains by FableTerrainGenv1” and Stable Diffusion generates the following image.

Importing Into Unity

the Generated 512x512 heightmap is then imported into your game engine. Here we’re using Unity 2021.3 and applied a simple texture set to highlight the features.

First Person View

It’s also helpful to see the bare terrain up close, but it’s missing additional textures, vegetation, rocks, and more. These can be placed procedurally using the features of the terrain itself, for instance placing rocks at the bottom of slopes and spawning trees within the valleys.

Next Steps

We’re continuing to explore this space, extending these ideas to many other elements in the pipeline required to create simulation experiences. Here, we use an Img2Img model on the image above to skip the generation of 3D objects like vegetation and imagine what it would look like to simply describe the environment we want.

Frank Carey

Fable CTO

Next
Next

Friends AI — A Simulated Sitcom with GPT4