Indirect draws in WebGPU

1. The two kinds of passes

Every frame, CPU records commands grouped into passes. Two kinds matter here.

Render pass = pixels come out. Runs the standard graphics pipeline:

Vertex shader: runs once per vertex (corner of a triangle). Decides where each point ends up on screen.
Fragment shader: runs once per pixel covered by a triangle. Decides what color that pixel is.

Inside a render pass you issue draw calls. Output goes to a texture (your canvas, a shadow map, etc).

Compute pass = numbers come out. Runs a compute shader. No vertices, no pixels, no triangles. Just custom logic running in parallel over a buffer. Output is whatever you write to storage buffers. Culling, physics, particle updates, GPU-side logic all live here.

Typical frame:

commandEncoder
  → beginComputePass()    // prep (decide what to draw)
    dispatch(...)
    end()
  → beginRenderPass()     // actually draw it
    drawIndirect(...)
    end()
  → submit()

Compute pass writes into buffers. Render pass reads those same buffers. That's the handoff.

2. How a normal draw works (at the hardware level)

When CPU calls renderPass.draw(1000, 5), here's what happens under the hood:

The driver packs a command like CMD_DRAW indexCount=1000 instanceCount=5 and queues it for the GPU.
The GPU has a unit at the front called the command processor. It reads the command stream one entry at a time.
Command processor sees CMD_DRAW, reads the numbers (1000, 5) baked into the command, kicks off the work: vertex fetch, vertex shader, rasterization, fragment shader, framebuffer write.

Key point: those numbers (1000, 5) were baked in when CPU recorded the command.

3. What indirect changes

Normal draw: CPU bakes the numbers (count, instance count) into the command.

Indirect draw: CPU points at a buffer instead. The command looks like this:

CMD_DRAW_INDIRECT bufferAddress=0x... offset=0

No counts in the command itself.

When the command processor hits this, it does one extra step: read 16 or 20 bytes from that buffer. Those bytes ARE the draw arguments. Then it draws normally.

Same shaders, same rasterization, same output. Only difference: the numbers came from GPU memory.

Why it matters: something else on the GPU can write those numbers. Usually a compute shader.

4. The buffer layout

When a compute shader fills in draw args, it has to write the right fields in the right order. So you need to know what's in that buffer.

Each draw's args take a fixed number of bytes:

drawIndirect (16 bytes): vertexCount, instanceCount, firstVertex, firstInstance
drawIndexedIndirect (20 bytes): indexCount, instanceCount, firstIndex, baseVertex, firstInstance

Pack as many as you want, one after another. The offset you pass to drawIndirect tells the GPU where to start reading.

[draw0: 20 bytes][draw1: 20 bytes][draw2: 20 bytes]
 offset=0         offset=20         offset=40

Two buffer usage flags to remember:

INDIRECT (always required)
STORAGE (add this if a compute shader writes to the buffer)

Forget either flag and WebGPU will reject the call. You don't need to memorize the field layout, just know it exists when you sit down to write the compute shader.

5. The barrier

GPUs run work in parallel. If your compute pass writes to a buffer and your render pass reads from it, the GPU needs to know "finish the writes before the reads start." That's a barrier.

WebGPU inserts barriers automatically between passes. You don't write them manually. They just need to exist, and they do.

6. A real frame, step by step

The full chain in a GPU driven frame:

CPU submits three commands:
  CMD_DISPATCH       → run the compute pass
  CMD_BARRIER        → wait for compute writes to land
  CMD_DRAW_INDIRECT  → run the render pass

GPU command processor does:
  1. Read CMD_DISPATCH.
     Spawn compute shader invocations.
     Compute shader decides which objects survive culling,
     writes their instance data into a buffer,
     writes instanceCount into the indirect args buffer.

  2. Read CMD_BARRIER.
     Wait for those writes to be visible to subsequent reads.

  3. Read CMD_DRAW_INDIRECT.
     Read 20 bytes from the indirect args buffer.
     See instanceCount=3000 (written by the compute shader).
     Spawn vertex shaders, rasterize, run fragment shaders.

CPU submitted 3 commands. Never touched individual objects. GPU decided how many to draw.

7. Why this is a win

a) CPU per-object work disappears

Normal renderer looks like this:

for (const obj of scene.objects) {       // CPU loops
  if (isInFrustum(obj)) {                 // CPU tests
    renderPass.setBindGroup(obj.bg);      // CPU sets state
    renderPass.draw(obj.indexCount);      // CPU issues draw
  }
}

100k objects = CPU loops 100k times. Even skipped objects cost a check. More objects = more CPU work.

GPU driven + indirect: loop is gone. CPU dispatches one compute pass and one indirect draw regardless of object count. 1k objects or 1M objects look identical from the CPU side.

b) Driver overhead shrinks

Every draw call has cost beyond drawing: validation, state setup, command translation, tracking. This cost is per-call, not per-triangle.

1000 draws of 10 triangles each is way more CPU-expensive than 1 draw of 10,000 triangles. Same GPU work, much more driver overhead.

Indirect + instancing collapses many draws into few. Less overhead per frame.

c) No readback, no stall

Traditional CPU culling sometimes needs data that lives on the GPU (animated bone positions, depth buffer for occlusion). CPU has to ask GPU for that data and wait for it. The wait is a stall. The transfer is a readback.

Readbacks are slow. Milliseconds sometimes. A 60fps frame budget is 16ms. One bad readback can wreck it.

With GPU driven rendering, CPU never asks. Data lives on the GPU. Compute shader uses it directly, writes results to another GPU buffer, render pass reads that buffer. CPU is never in the loop.

8. The remaining problem: many different meshes

A single draw call (indirect or not) draws one mesh. It uses the currently bound vertex buffer, index buffer, shader, and textures, all set by the CPU before the call.

Indirect lets the GPU decide how many of that mesh to draw (the instanceCount). It does NOT let the GPU decide which mesh. The mesh is locked in by whatever the CPU bound right before the draw.

So if you have 1000 trees and 5000 grass blades, that's 2 indirect draws. Fine. The GPU decides the per-frame counts.

But if you have 500 different meshes (tree, rock, grass, mushroom, stump, flower...), you need 500 indirect draws, because each one needs different bound geometry:

for (const meshType of allMeshTypes) {  // 500 iterations on CPU
  renderPass.setVertexBuffer(meshType.vbo);
  renderPass.setIndexBuffer(meshType.ibo);
  renderPass.setBindGroup(0, meshType.bg);
  renderPass.drawIndexedIndirect(argsBuffer, meshType.argsOffset);
}

The args are GPU authored, but CPU still walks a list of 500 to swap geometry between draws.

Better than 500 normal draws (you skipped the per-object loop). Still not ideal at extreme scale.

That's what multi-draw indirect fixes next.

9. Multi-draw indirect

Quick refresher on vertex buffers first.

A vertex buffer is just a chunk of GPU memory holding the points (vertices) that make up a mesh. A tree mesh has, say, 512 vertices. Those 512 points sit in a vertex buffer. When you draw the tree, the GPU reads from that buffer to know where the triangles are.

Each mesh normally has its own vertex buffer. To draw a different mesh, the CPU has to swap which buffer is bound. That swap is the slow part.

Now the trick.

renderPass.multiDrawIndexedIndirect(argsBuffer, offset, countBuffer, countOffset, maxCount)

Step 1: put every mesh into ONE big vertex buffer.

Instead of 500 separate buffers (one per mesh), you concatenate all of them into one big buffer:

big vertex buffer:
[tree verts][rock verts][grass verts][mushroom verts][...]
 0           512          640          704

The tree's vertices live at the start. The rock's vertices live right after. And so on. Every mesh is a slice of the same big buffer.

Step 2: each draw points at its slice.

Each draw arg has fields called baseVertex and firstIndex. They mean "start reading from this position in the big buffer." So:

draw 0: baseVertex=0     → tree slice
draw 1: baseVertex=512   → rock slice
draw 2: baseVertex=640   → grass slice

Same buffer for all of them. Different starting positions. No swapping needed.

Step 3: GPU writes the args.

A compute shader figures out what's visible. For each visible mesh, it writes a draw arg pointing at the right slice. It also writes the total number of draws into a small buffer called the count buffer.

Step 4: one CPU call does everything.

multiDrawIndexedIndirect(argsBuffer, offset, countBuffer, countOffset, maxCount)

argsBuffer = all the draw args the compute shader wrote
countBuffer = how many of them are real (GPU decided)
maxCount = upper safety cap

GPU reads the count. Issues that many draws. Each one pulls geometry from its own slice of the big buffer. CPU never loops, never swaps state, never finds out how many draws happened.

Pair this with bindless textures (so different draws can use different textures without the CPU swapping those either) and you have a fully GPU driven renderer.

This is how UE5 Nanite and modern AAA engines work. Flat CPU cost, massive scenes.

10. WebGPU status

multi-draw-indirect is an optional feature:

if (adapter.features.has('multi-draw-indirect')) {
  device = await adapter.requestDevice({
    requiredFeatures: ['multi-draw-indirect']
  });
}

Shipping in Chrome, solid in Dawn. Not universal on the web yet. For native targets (wgpu, Dawn native) it's production ready.

More info.

11. Summary in order

Two kinds of passes. Render pass = draws pixels. Compute pass = runs logic that writes into buffers.
Normal draw. CPU picks the numbers (how many indices, how many instances) and bakes them into the command.
Indirect draw. Same draw, but the numbers live in a GPU buffer. Command processor reads them right before drawing. CPU no longer picks the numbers.
Barrier. Makes sure the compute pass finishes writing before the render pass starts reading. WebGPU adds it for you.
Typical frame. Compute pass decides what's visible and writes the draw args. Render pass issues one indirect draw using those args. CPU never touches individual objects.
The win. CPU cost stops growing with scene size. Fewer driver calls. No CPU/GPU sync stalls.
The catch. One indirect draw = one mesh. 500 different meshes still means 500 CPU calls.
Multi-draw indirect. Pack all meshes into one big vertex buffer. GPU writes args pointing at different slices. One CPU call issues thousands of draws, count and contents both GPU decided. The core trick behind modern engines like UE5 Nanite.

Indirect draws in WebGPU and why they're so powerful

Indirect draws in WebGPU

1. The two kinds of passes

2. How a normal draw works (at the hardware level)

3. What indirect changes

4. The buffer layout

5. The barrier

6. A real frame, step by step

7. Why this is a win

a) CPU per-object work disappears

b) Driver overhead shrinks

c) No readback, no stall

8. The remaining problem: many different meshes

9. Multi-draw indirect

10. WebGPU status

11. Summary in order

Comments

More from this blog

Become a cracked product engineer today

Bipedal, humanoid, and the words for creature shapes in games

Lerp and smoothstep, what they actually do

IK and FK, what they actually are

How to make your textures fast on the GPU

Command Palette

Indirect draws in WebGPU

1. The two kinds of passes

2. How a normal draw works (at the hardware level)

3. What indirect changes

4. The buffer layout

5. The barrier

6. A real frame, step by step

7. Why this is a win

a) CPU per-object work disappears

b) Driver overhead shrinks

c) No readback, no stall

8. The remaining problem: many different meshes

9. Multi-draw indirect

10. WebGPU status

11. Summary in order

Comments

More from this blog