Introduction

I went into a bit of a rabbithole studying how Nanite works. Here are some of the things I learned and the write-up I wish existed.

The Four Problems

Every triangle you render costs something. When you have millions of them four problems start screaming at you.

Draw Calls

A draw call is a command. The CPU tells the GPU "draw this object". One object. One draw call. Sounds simple.

The problem is the coordination. Every draw call requires the CPU and GPU to sync up. The CPU prepares the command. The GPU receives it. They handshake. This takes time. Not because the drawing is slow. But because the communication is slow.

1000 objects means 1000 draw calls. The CPU spends all its time just talking to the GPU instead of doing useful work. The GPU sits there waiting. This is a bottleneck. The slowest part that holds everything else back.

Overhead

Overhead is everything that is NOT drawing pixels. Setting up shaders. Binding textures. Switching materials. State changes. Every time the GPU switches from one material to another there is a cost. All this extra work adds up.

More objects with different materials means more overhead. The GPU spends time preparing instead of rendering.

Memory

Triangles take space. A single vertex needs about 32 to 44 bytes. That includes its position in 3D space. Its normal direction. Its texture coordinates. Each triangle also needs 12 bytes for index data.

A 1 million triangle mesh uses roughly 30 to 140 megabytes depending on how it is structured. That lives in VRAM. "VRAM" is the GPU's own memory. It is fast but limited. Fill it up and performance falls off a cliff.

Computation

Every triangle goes through a pipeline. The vertex shader transforms it into screen space. The rasterizer figures out which pixels it covers. The fragment shader colors those pixels. Multiply that by millions of triangles and you have an enormous amount of math every single frame.

If you want 60 frames per second you have about 16 milliseconds to do ALL of this. For the entire scene. Every object. Every triangle. Every pixel. 16 milliseconds. That is not a lot.

The Techniques We Had Before Nanite

People have been fighting these four problems for decades. Here are the main tools they built.

Normal Baking

You start with two models. A high detail version with millions of triangles. And a low detail version with a few thousand. You "bake" the surface detail from the high model into a texture called a normal map. This texture stores the direction each tiny surface patch faces. When light hits the low model it reads the normal map and PRETENDS the surface has all that detail.

The result looks almost the same. But the GPU only processes a few thousand triangles instead of millions.

The good: Huge geometry reduction. Low computation. Low overhead.

The bad: Medium memory cost because you need to store the normal map textures. Medium manual labor because an artist has to create both the high and low poly versions and bake the map carefully. Also the silhouette of the object is still simple. The edges of the shape look low poly because normal maps only fake surface detail. They cannot change the actual outline.

LOD. Level of Detail

You create multiple versions of the same model. LOD 0 is the full detail version. LOD 1 has fewer triangles. LOD 2 even fewer. LOD 3 is very simple. The engine swaps between them based on distance from the camera.

Close up you see LOD 0. Far away you see LOD 3. The player never notices.

The good: High geometry reduction. Saves a lot of GPU work for distant objects.

The bad: HIGH manual labor. Someone has to create 3 or 4 versions of every single model. That is a lot of work across hundreds of assets. Also "popping". When the engine swaps LOD levels the model visibly changes for a frame. Players can see it. It breaks immersion. And you need to store all versions in memory.

Subdivision Modeling

The opposite approach. You start with a simple low poly cage. The computer subdivides it. "Subdivide" means split each face into smaller faces and smooth the result. You control how many times it subdivides. More subdivisions means more detail -> more triangles!

The good. Low manual labor. You build one simple model and the computer generates the detail.

The bad. Low geometry reduction. It ADDS triangles. It does not remove them. High computation because subdividing at runtime is expensive. It can only add detail. It cannot reduce it based on distance.

Voxels

Instead of triangles you represent the world as a 3D grid of cubes. Like Minecraft. Each cell in the grid is either filled or empty. You can vary the grid resolution. Coarse grid for far away. Fine grid for close up.

The good: Low manual labor. Voxel worlds can be generated by code. Dynamic detail adjustment by changing grid resolution.

The bad: High memory because you store a big 3D grid. High computation to convert voxels into something renderable. Blocky look. Sharp edges are hard to represent. Texturing is problematic because voxels do not naturally support UV coordinates.

The Summary

The first two techniques handle most problems but require a lot of manual work. The second two are more automatic but have other costs. None of them solve everything.

What if there was a system that handled all four problems. Automatically. No manual LOD creation. No popping. No wasted triangles.

The Fastest Triangle to Render

Before diving into Nanite there is one idea you need to accept.

The fastest triangle to render is the one you never send to the GPU.

Not "the smallest triangle". Not "the simplest triangle". The one you skip entirely. If you can figure out which triangles the player will never see and throw them away before the GPU even touches them you win. Every triangle you skip saves draw call time. Overhead. Memory. Computation. All four problems at once.

Think about a statue with 33 million triangles. Now put 500 of those statues in a room. That is over 16 billion triangles. You cannot render all of them. But you do not need to. Most of those triangles are either too far away to matter. Or facing away from the camera. Or hidden behind other objects.

The entire job is figuring out which ones to skip.

Clustering. Organizing the Chaos

Checking 33 million triangles one by one is way too slow. You need to group them.

A cluster is a small group of triangles. Usually around 128 of them. Instead of asking "should I render this triangle" 33 million times you ask "should I render this cluster" about 250,000 times. Much better.

Bounding Volumes

Each cluster gets wrapped in a simple shape. A sphere or a box. This is called a bounding volume. It is a quick approximation of where the cluster is in space.

Testing "does this simple box intersect the camera view" is way faster than testing 128 individual triangles. If the box is not visible you skip all 128 triangles inside it in one check.

Bounding Volume Hierarchy. BVH

You can nest these bounding volumes. Wrap two clusters in a bigger box. Wrap two of those bigger boxes in an even bigger box. Keep going until you have one box that contains the whole object.

This creates a tree structure. At the top is the whole object. At the bottom are individual clusters.

To check visibility you start at the top. If the big box is not visible you skip EVERYTHING inside it. One check eliminates thousands of clusters. If it IS visible you go one level deeper and check the two medium boxes. And so on.

This reduces the number of checks from N to log(N). In a scene with 1000 clusters that means roughly 10 checks instead of 1000. Massive speedup.

Automated LOD With Clusters

Take your clusters of 128 triangles. That is LOD 0. Full detail.

Now take two neighboring clusters. 256 triangles total. Merge them into one group. Simplify down to 128 triangles. Split into two new clusters.

You went from 256 to 128. Half the detail. One LOD level up. No artist involved. Fully automatic.

Repeat at every level. Each level roughly halves the triangle count until you have a very coarse version of the whole mesh at the top.

The order matters. Merge first. Then simplify. If you simplify each cluster alone the shared edges between neighbors change independently. They no longer line up. You get cracks. Visible gaps in the mesh.

By merging first the shared edge is no longer a boundary. It is in the middle of the merged group. Just regular geometry. The simplification cleans it up like any other edge. No cracks.

But here is the problem. Two clusters can only be merged if something can reach both of them. In a tree each cluster has one parent. If two clusters have different parents nobody can merge them. Their shared boundary is stuck forever. Level after level these stuck edges pile up. The mesh fills with dense geometry that cannot be simplified. That is mesh cruft.

The fix is simple. Let a cluster have two parents. Both sides of any boundary can reach the shared cluster. Both sides can merge across it. Every boundary gets cleaned up. Nothing is stuck.

That is a DAG. A Directed Acyclic Graph. "Directed" means connections flow one way. Parent to child. "Acyclic" means no loops. The only difference from a tree is that one child can have two parents.

Nanite does not use a DAG because it is fancy. Merging clusters naturally creates shared children. The DAG is just what you end up with when you allow that.

Screen Space Error

Every simplified mesh is slightly different from the original. That difference is the error. A fixed number measured in world units. It never changes.

What changes is how big that error looks on screen. 0.5 centimeters of error up close might cover 8 pixels. The player sees it. 200 meters away the same 0.5 centimeters covers less than one pixel. The player cannot see it. The monitor cannot even display it.

The rule. If the error is smaller than one pixel use the cheaper version.

Each cluster has multiple LOD levels. Each level has a known error. The engine picks the cheapest level where the error is below one pixel. Done.

This happens per cluster. Not per object. Front of a rock might be at LOD 0. Back of the same rock at LOD 3. Each cluster picks independently every frame. Changes are always smaller than one pixel. That is why there is no popping. The switches are invisible by definition.

One Giant Draw Call

Traditional rendering. The CPU tells the GPU "draw this rock". Then "draw this wall". Then "draw this floor". One command per object. 1000 objects means 1000 commands. The CPU spends all its time talking.

Nanite packs all surviving cluster triangles into one big block of data in VRAM. Sends one command. "Draw all of this". Done.

The GPU does the rest. It runs the culling. Picks the LOD levels. Decides what to render. All on its own. The CPU barely participates. This works because GPUs have thousands of cores that run in parallel. Checking 250,000 clusters is 250,000 tiny tasks. Perfect GPU work. A CPU with 16 cores would choke on that. A GPU with thousands of cores eats it.

The only split is materials. Metal needs different shader code than wood. So triangles with different materials still need separate commands. But all the triangles within one material get batched together. Way fewer commands than one per object.

Culling. Throwing Away What You Cannot See

Even with automated LODs and batching you still have too many clusters. The next step is figuring out which ones to skip entirely. This is culling.

Culling means removing. Deciding what NOT to render. Three types. Each one catches different things.

Frustum Culling

The frustum is the shape of what your camera can see. It looks like a pyramid with the tip cut off. The near end is small and close to the camera. The far end is wide and far away.

Anything outside this shape is off screen. Do not render it. A building behind you. Gone. A tree far to the left. Gone.

You test each cluster's bounding volume against the frustum. If the bounding box does not intersect the frustum the whole cluster is skipped. Fast and simple.

But frustum culling has a blind spot. It keeps everything INSIDE the frustum. Even objects hidden behind a wall. You can see the wall but not the building behind it. Yet both pass the frustum test.

Backface Culling

Every triangle has two sides. A front face and a back face. If a triangle faces away from the camera you are looking at its back. You cannot see it. Skip it.

How does the GPU know which side faces you. Triangle vertices are listed in a specific order. If they appear clockwise on screen the triangle faces you. Counter clockwise means it faces away. The GPU checks this with very fast math.

The other way to check is using the dot product. Each triangle has a normal vector pointing outward. If the dot product of the view direction and the normal is positive the triangle faces away. "Dot product" is a math operation that tells you how much two directions agree. Positive means same direction. Negative means opposite directions.

About half of all triangles in any scene face away from you at any given time. So backface culling alone removes roughly 50 percent of the work.

Occlusion Culling

The hardest one. A cluster passes the frustum test. Its triangles face the camera. But there is a wall in front of it. The player cannot see the cluster because something else blocks it. That is occlusion.

To figure this out you need to know what is in front of what. That requires the depth buffer.

The Depth Buffer

The depth buffer is an image the GPU creates while rendering. Every pixel stores how far away the closest surface is. It is a grayscale image. Close things are dark. Far things are bright.

If you want to check whether a cluster is hidden you compare its depth against the depth buffer. If every pixel where the cluster would appear already has something closer then the cluster is fully hidden. Skip it.

But here is a problem. You need the depth buffer to decide what to cull. But you need to render things to build the depth buffer. Chicken and egg.

Hierarchical Z Buffer. Hi-Z

This is the trick that makes occlusion culling fast.

First a quick reminder. The depth buffer is an image the GPU builds while rendering. Every pixel stores how far away the closest visible surface is at that spot on screen. Keyword is visible. Only things that were actually drawn get recorded. If something was culled it is not in there. -> This took me a while to grasp. This is the key thing. Visible!

Now the problem. Say you want to check if a cluster is hidden behind something. That cluster might cover 10,000 pixels on screen. To check properly you would have to read 10,000 depth values and compare each one. That is slow.

Hi-Z fixes this. You take the depth buffer and build a mipmap chain. A mipmap is a series of smaller copies of the image. Full resolution. Half. Quarter. Eighth. All the way down to one pixel.

But here is the key difference from normal mipmaps. Normal texture mipmaps average the pixels together. Hi-Z takes the MAXIMUM depth value. The furthest point.

Why maximum. This is the part that matters for occlusion.

Say four pixels in the depth buffer have values 5. 8. 3. 10. Those numbers are distances. Something visible was rendered at distance 5 at one pixel. Something at distance 8 at another. And so on. The max is 10. That means the FURTHEST visible thing in that group of four pixels is at distance 10. Everything else there is even closer.

Now you have a cluster at distance 12. It is further away than 10. That means it is further away than EVERYTHING rendered in that group. Even the furthest thing there is still closer than the cluster. So the cluster is behind all of it. Fully hidden. That is occlusion. One check covered four pixels.

Go up another mipmap level. One pixel now covers 16 original pixels. Same logic. One check covers 16 pixels. Next level. 64 pixels. Then 256. Then 1024. A cluster that covers a large area on screen can be tested with just one or two reads at a coarse mipmap level instead of thousands of reads.

To test a cluster you figure out how big it would be on screen. Pick the mipmap level where one pixel roughly covers that area. Read the max depth value. Compare it to the cluster's distance. Cluster is further away. It is behind everything there. Hidden. Cut it. Cluster is closer. It might be visible. Keep it.

This is conservative. It plays it safe. Sometimes it says "maybe visible" when the cluster is actually hidden. That is fine. You render a little extra. But it NEVER says "hidden" when the cluster is actually visible. That would mean missing objects on screen. That would be a bug. So it errs on the side of doing a little extra work rather than skipping something it should not.

Nanite's Two Pass Occlusion

Nanite takes this further with two passes.

Pass 1: Use last frame's depth buffer. Test all clusters against it. Most things that were hidden last frame are still hidden this frame. This catches the majority of occluded clusters immediately.

Pass 2: Render the clusters that survived pass 1. Build a new depth buffer. Now test any remaining uncertain clusters against this fresh depth buffer. This catches things that just moved into view or edge cases the old depth buffer missed.

Two passes sounds like more work. But it means almost nothing gets rendered unnecessarily. The total work is less than doing it in one pass with guesswork.

The Small Triangle Problem

There is one more thing Nanite had to solve.

The GPU processes pixels in groups of 2x2. These groups are called quads. When a triangle is so small it only covers one pixel in that quad the GPU still runs the fragment shader for all four pixels. Three of those four pixels are wasted work. Up to 75 percent waste.

With millions of small triangles this waste adds up fast.

Nanite's solution. For clusters where triangle edges are smaller than 32 pixels it uses a software rasterizer. "Software rasterizer" means custom code running on the GPU that replaces the normal hardware rasterization pipeline. This custom code knows how to handle tiny triangles efficiently. No wasted quads.

For larger triangles the normal hardware rasterizer runs as usual. It is already fast for big triangles.

This split made Nanite's rasterization three times faster.

The Full Picture

Nanite is not the GPU. Nanite is software built by Epic Games. A system of code that runs on the GPU and tells it exactly what to do. The GPU is the muscle. Nanite is the brain.

It works in two phases.

Phase 1. Before The Game Runs

This happens once. When the artist imports a mesh into Unreal Engine.

Clustering

An artist imports a 33 million triangle statue. Nanite splits it into clusters. Small groups of 128 triangles each.

Building The DAG

It builds the DAG. The structure where clusters can share boundaries with multiple parents. This is what allows clean simplification without cracks.

Creating The LOD Levels

Merge neighboring clusters. Simplify. Split into new clusters. Repeat. Each level has roughly half the triangles of the level below it. Old boundaries get cleaned up because they become interior edges after merging. No manual work. Fully automatic.

Calculating The Error

For each LOD level it measures how many centimeters the simplified version differs from the original. This number is baked in. It never changes.

Storing The Data

Two things get stored. The metadata. Bounding boxes. Error values. Facing directions. DAG connections. This is lightweight. A few megabytes. And the actual triangle data for every LOD level. This is heavy. Sits on disk.

Done. Never repeated.

Phase 2. Every Frame

The metadata sits in RAM. Always available. The GPU runs through it and filters the clusters down step by step.

Frustum Culling

Is this cluster on screen. Test the bounding box against the camera frustum. Off screen. Cut it.

Backface Culling

Is this cluster facing the camera. Check the stored normal direction against the view direction. Facing away. Cut it.

Screen Space Error

Which LOD level does this cluster need. Convert the error value to pixels based on distance from camera. Pick the cheapest level where the error is below one pixel. Invisible to the player.

Occlusion Culling Pass 1

Was this cluster hidden last frame. Check against last frame's Hi-Z depth buffer. Further away than everything at that spot. Cut it.

Occlusion Culling Pass 2

Render everything that made it this far. Build a fresh depth buffer. Test uncertain clusters again. Still hidden. Cut it.

Streaming

Now the GPU has a small list of clusters that passed every check. Only their triangle data gets streamed from disk into VRAM. Everything else stays on disk untouched.

Rasterization

Clusters with tiny triangles where edges are smaller than 32 pixels go through the software rasterizer. Custom code. No wasted pixels. Bigger triangles go through the normal hardware rasterizer.

The Draw Call

Here is the important part. By the time the draw call happens the GPU already threw away most of the scene. The culling. The LOD selection. All of that happened before this step. The buffer only contains the clusters that passed every single check.

So the CPU sends one command per material. "Draw this". But "this" is not 33 million triangles. It is maybe 50,000. The GPU is not drawing everything. It is drawing the tiny fraction that actually matters. One command. Small amount of real work.

Frame done. 16 milliseconds. Next frame. Do it all again.

When to Use Nanite. And When Not To

Nanite is not the answer to everything.

Use it for: Scenes with extremely high poly static assets. Photogrammetry scans. Rocks. Walls. Architectural models. Anything that is large. Opaque. And does not move.

Do not use it for: Dynamic or animated objects like characters. Transparent materials. Masked materials like foliage and leaves. Very small detailed objects. In these cases traditional LOD and culling techniques still work better.

Command Palette

Introduction

The Four Problems

Draw Calls

Overhead

Memory

Computation

The Techniques We Had Before Nanite

Normal Baking

LOD. Level of Detail

Subdivision Modeling

Voxels

The Summary

The Fastest Triangle to Render

Clustering. Organizing the Chaos

Bounding Volumes

Bounding Volume Hierarchy. BVH

Automated LOD With Clusters

Screen Space Error

One Giant Draw Call

Culling. Throwing Away What You Cannot See

Frustum Culling

Backface Culling

Occlusion Culling

The Depth Buffer

Hierarchical Z Buffer. Hi-Z

Nanite's Two Pass Occlusion

The Small Triangle Problem

The Full Picture

Phase 1. Before The Game Runs

Clustering

Building The DAG

Creating The LOD Levels

Calculating The Error

Storing The Data

Phase 2. Every Frame

Frustum Culling

Backface Culling

Screen Space Error

Occlusion Culling Pass 1

Occlusion Culling Pass 2

Streaming

Rasterization

The Draw Call

When to Use Nanite. And When Not To

Comments

More from this blog