How A CPU Actually Runs Code

A CPU core has a clock. It ticks about 3 to 4 billion times per second. Each tick, the core tries to do one unit of work. That is where your frames come from.

Inside one core there are four parts that matter. The Control Unit reads the next instruction. The ALU (Arithmetic Logic Unit) is the calculator that does the math. Registers are tiny slots right next to the ALU holding the numbers being worked on. Caches are fast memory nearby that feed the registers.

All math happens in the ALU. The ALU only touches registers. So every number has to travel from RAM through the caches into a register and out. The math is fast. The travel is slow.

When your data is not in the cache at the moment the ALU needs it, the core stalls. A stalled core still ticks. It just does not do anything on those ticks. Wasted work. That is where most slow code actually loses.

The Memory Hierarchy

Modern CPUs can run about a hundred math operations in the time it takes to fetch one value from RAM. So the thing that makes code fast is not the math. It is keeping your data close to the ALU.

Caches work in chunks called cache lines, usually 64 bytes. When you touch one byte, the CPU pulls 64 bytes around it into the cache. Touch the next 63 bytes right after. Free.

Touch memory randomly across the heap and every access is a new cache miss. Every miss is tens of wasted ticks while the ALU waits.

Sequential access is fast. Random access is slow. Most of this post is about making your JavaScript sequential.

SIMD. One Instruction, Many Numbers.

Your CPU has special instructions called SIMD, Single Instruction Multiple Data. They do the same math on a chunk of numbers in a single tick. Usually 4 or 8 at a time on a mainstream CPU.

Adding two arrays of 8 floats without SIMD takes 8 ticks, one pair per tick. The same work with SIMD takes 1 tick for all 8.

A lane is one parallel slot inside a SIMD register. A modern CPU can pack 8 floats side by side in one register. Each float slot is a lane. One SIMD add touches all 8 lanes at once. Scalar JS math uses only lane 0. The other 7 are physically there but sitting idle.

Same CPU. Same clock. Same silicon. The SIMD lanes are there either way. Without SIMD you leave 7 of the 8 lanes idle. You pay for the whole chip and use one eighth of it.

JavaScript does not expose SIMD directly. Modern engines auto vectorize tight loops over typed arrays sometimes.

Branch Prediction

Modern CPUs run a pipeline. The next instruction starts before the previous one finishes. Several are always in flight at the same time.

The problem comes with if statements. The CPU does not know which branch to take until the condition is evaluated. Rather than stall, it guesses. This is branch prediction. Guess right, free. Guess wrong, the CPU throws away the work it did on the wrong path and restarts. A mispredict costs around 15 ticks.

A condition that takes the same path almost every time is basically free. The predictor learns it. A condition that flips 50/50 every iteration forces a mispredict half the time and turns a tight loop into a stumble.

Fix it by removing data dependent branches from inner loops, or by sorting your data so the branches become predictable.

The Big Trick. Structure of Arrays.

This is the one that matters most. Every serious JS game engine, physics library, and particle system does it. Most regular JS does not. The gap in speed is an order of magnitude.

You have an array of objects. Each object has many fields. You loop over them doing math on a few fields.

type Particle = { x, y, vx, vy, color, age, life }
const particles: Array<Particle> = [...]

for (let i = 0; i < particles.length; i++) {
  particles[i].x += particles[i].vx
  particles[i].y += particles[i].vy
}

This looks clean. It is also cache hostile. Each particle is a scattered object on the JS heap. When you read particles[i].x, the cache line you get contains x, y, vx, vy, color, age, life, and chunks of V8 metadata. You use two of those. The other 60 bytes of the cache line are dead weight for this loop.

Flip the layout. Same data, different memory arrangement.

const x = new Float32Array(count)
const y = new Float32Array(count)
const vx = new Float32Array(count)
const vy = new Float32Array(count)

for (let i = 0; i < count; i++) {
  x[i] += vx[i]
  y[i] += vy[i]
}

Now every cache line is full of numbers you actually read. No metadata. No waste. You blaze through memory in order and the prefetcher pulls the next 64 bytes before you need them.

This pattern has a name. Structure of Arrays, or SoA. The opposite is Array of Structures, AoS. AoS is how most JavaScript is written. SoA is how fast JavaScript is written.

Every serious engine does this. Unity DOTS and ECS. Bevy. Unreal Mass. Box2D and Rapier physics. Three.js instanced meshes. They all store components in parallel typed arrays under the hood.

The Entity Component System pattern game engines ship with is largely a way to enforce SoA automatically. You write code that looks like "for every entity with a Position and Velocity," and the engine guarantees those components live in parallel arrays you never see. You think in terms of "the data I read together should live together," and the engine does the layout for you.

When people say data oriented design, this is what they mean.

JavaScript Specifics

Two things V8 cares about on top of the CPU rules.

Typed arrays hold raw numbers. A regular Array stores values as tagged pointers. Every element could be anything, so every read checks the type first. A Float32Array or Int32Array is a raw packed block of numbers. No type checks. Cache friendly. Use them for every hot numeric buffer.

Stable object shapes. V8 watches every object. If you always set the same fields in the same order at construction, V8 assigns a fast hidden class and field access becomes a direct memory read. Add fields later or delete them and you drop to the slow path.

// slow. Shape changes.
const p = { x: 0, y: 0 }
p.vx = 0
p.vy = 0

// fast. Shape is final at birth.
const p = { x: 0, y: 0, vx: 0, vy: 0 }

For hot code, combine these two. SoA layout, typed arrays for the numeric columns.

Before And After

Add two vectors of 100 000 3D positions per frame.

Naive AoS version.

type Vec3 = { x: number, y: number, z: number }
const a: Array<Vec3> = [...]
const b: Array<Vec3> = [...]
const out: Array<Vec3> = a.map(() => ({ x: 0, y: 0, z: 0 }))

for (let i = 0; i < a.length; i++) {
  out[i].x = a[i].x + b[i].x
  out[i].y = a[i].y + b[i].y
  out[i].z = a[i].z + b[i].z
}

300 000 object allocations. Random cache misses every iteration. 8 to 15 ms per frame on a modern laptop.

SoA version with typed arrays.

const ax = new Float32Array(100_000)
const ay = new Float32Array(100_000)
const az = new Float32Array(100_000)
const bx = new Float32Array(100_000)
const by = new Float32Array(100_000)
const bz = new Float32Array(100_000)
const ox = new Float32Array(100_000)
const oy = new Float32Array(100_000)
const oz = new Float32Array(100_000)

for (let i = 0; i < 100_000; i++) {
  ox[i] = ax[i] + bx[i]
  oy[i] = ay[i] + by[i]
  oz[i] = az[i] + bz[i]
}

Zero allocation in the hot loop. Nine sequential reads and three sequential writes. 0.3 to 0.8 ms. More than ten times faster for the same math.

CPU Friendly JavaScript. A Visual Guide.

How A CPU Actually Runs Code

The Memory Hierarchy

SIMD. One Instruction, Many Numbers.

Branch Prediction

The Big Trick. Structure of Arrays.

JavaScript Specifics

Before And After

Comments

More from this blog

Become a cracked product engineer today

Bipedal, humanoid, and the words for creature shapes in games

Lerp and smoothstep, what they actually do

IK and FK, what they actually are

How to make your textures fast on the GPU

Command Palette

How A CPU Actually Runs Code

The Memory Hierarchy

SIMD. One Instruction, Many Numbers.

Branch Prediction

The Big Trick. Structure of Arrays.

JavaScript Specifics

Before And After

Comments

More from this blog