Sound Is a Wave

Sound is vibration. A speaker pushes air forward and pulls it back. That creates a pressure wave. Your ear catches that wave and your brain turns it into something you hear.

        ┌─╮       ┌─╮
       │   │     │   │
      │     │   │     │
─────│───────│─│───────│──── silence
              │         │
               │       │
                ╰─────╯
          one cycle

That is a waveform. The height is how loud it is. The speed of the cycles is the pitch. A deep bass note is a slow wave. A high whistle is a fast one.

Sound in the real world is smooth and continuous. It flows. A computer cannot store that. Computers only understand numbers. So we need to turn the wave into numbers.

Sampling. Catching the Wave

To record sound digitally you measure the wave over and over again. Each measurement is called a sample. You write down how high the wave is at that exact moment.

     ●           ●
   ●   ●       ●   ●
  ●     ●     ●     ●
 ●       ●   ●       ●
●         ● ●         ●
           ●

Each ● is one sample.
A snapshot of the wave at that moment.

Do this thousands of times per second and you get a copy of the original wave. Not perfect. But close enough that your ears cannot tell.

Sample Rate. How Often You Measure

The sample rate is how many snapshots you take per second.

Sample Rate      Per Second     Used For
──────────────────────────────────────────────
8,000 Hz         8,000          Phone calls
44,100 Hz        44,100         Music and CDs
48,000 Hz        48,000         Video and pro audio
96,000 Hz        96,000         High-res audio

CD quality is 44,100 Hz. That is 44,100 snapshots every second.

Why that number. Humans can hear up to about 20,000 Hz. There is a rule in signal theory that says you need to sample at least twice the highest sound you want to capture. 20,000 times 2 is 40,000. They rounded up to 44,100.

Bit Depth. How Precisely You Measure

Sample rate is how often. Bit depth is how precise each sample is.

Think of it like a ruler. An 8 bit sample gives you a ruler with 256 marks. The wave has to snap to the nearest one. Not very accurate.

A 16 bit sample gives you 65,536 marks. Way more detail. The wave can be described much more closely to what it actually sounds like.

8-bit:   256 levels. Sounds grainy and rough.
16-bit:  65,536 levels. Sounds clean and smooth.
24-bit:  16,777,216 levels. Studio quality.

Low bit depth means the wave gets rounded a lot. You hear that rounding as noise and crunch. 16 bit is the standard for music. 24 bit is used in studios.

Channels. How Many Waves

Mono is one channel. One stream of samples. One speaker.

Stereo is two channels. Left and right. That is how sound can move between your ears.

Surround sound is five or six or seven channels. One for each speaker in the room.

More channels means more data. Double the channels. Double the size.

How Big Is Audio Actually

Take CD quality. 44,100 samples per second. 2 bytes per sample. Stereo.

44,100 x 2 bytes x 2 channels = 176,400 bytes per second
176,400 x 60 = 10,584,000 bytes per minute
≈ 10 MB per minute

About 10 megabytes per minute. A four minute song is roughly 40 MB raw.

A full album of 60 minutes is around 600 MB. That is why CDs hold 700 MB. It was built around this math.

How the Data Is Laid Out

The simplest layout is sample by sample. In order. Left then right. Repeating.

[Left] [Right] [Left] [Right] [Left] [Right] ...

Each pair is one moment in time.

That is raw PCM audio. No compression. No tricks. Just a flat stream of numbers that describe the wave.

WAV files are basically this. A small header with the sample rate and bit depth and channel count. Then the raw numbers. That is why WAV files are huge.

So Why Are Music Files Not 40 MB

Because compression.

Lossless Compression

FLAC is the most common lossless audio format. It looks at the raw samples and finds patterns. Audio waves are predictable. Each sample is usually close to the one before it. FLAC uses that to store the data in fewer bytes.

A 40 MB song becomes about 25 MB as a FLAC file.

Decompress it and you get back every single original number. Nothing lost. Nothing changed.

Lossy Compression

MP3 and AAC and OGG throw away sounds you probably will not notice.

They use models of human hearing. They know what your ear is bad at. A quiet sound next to a loud sound. You will not hear the quiet one. Very high frequencies. Most people cannot hear those anyway. Small details in busy parts. Your brain fills those in.

Remove all of that and a 40 MB song drops to about 4 MB. Most people cannot hear the difference on normal headphones.

But that data is gone forever. Encode to MP3 and the numbers change. Do it again and it gets worse. Every pass loses more.

The Tradeoff

Format    Compression    Quality     Size Per Minute
────────────────────────────────────────────────────
WAV       None           Perfect     ~10 MB
FLAC      Lossless       Perfect     ~6 MB
AAC       Lossy          Great       ~1 MB
MP3       Lossy          Good        ~1 MB
OGG       Lossy          Good        ~1 MB

More compression means smaller files but more is lost. Less compression means bigger files but perfect quality. Every format sits somewhere on that line.

Metadata. The Extra Stuff

Audio files carry more than just sound. Artist name. Album. Track number. Genre. Year. Sometimes lyrics. Sometimes album art.

These are stored as tags inside the file. Usually a few kilobytes. Unless there is album art embedded. Then you might have a megabyte of image data sitting inside your audio file.

The Full Picture

Sound is a wave in the air. To store it you measure that wave thousands of times per second. Each measurement is a number. You write those numbers down in order. That is digital audio.

Everything else is just how often you measure. How precise each measurement is. How many channels you record. And how cleverly you shrink the result.

Next time you press play remember your speakers are just reading a list of numbers and pushing air back and forth to match. Your ears and brain do the rest.a

What is audio?

Sound Is a Wave

Sampling. Catching the Wave

Sample Rate. How Often You Measure

Bit Depth. How Precisely You Measure

Channels. How Many Waves

How Big Is Audio Actually

How the Data Is Laid Out

So Why Are Music Files Not 40 MB

Lossless Compression

Lossy Compression

The Tradeoff

Metadata. The Extra Stuff

The Full Picture

Comments (2)

More from this blog

Bipedal, humanoid, and the words for creature shapes in games

Lerp and smoothstep, what they actually do

IK and FK, what they actually are

How to make your textures fast on the GPU

Convex hulls and why they matter for collision

Command Palette

Sound Is a Wave

Sampling. Catching the Wave

Sample Rate. How Often You Measure

Bit Depth. How Precisely You Measure

Channels. How Many Waves

How Big Is Audio Actually

How the Data Is Laid Out

So Why Are Music Files Not 40 MB

Lossless Compression

Lossy Compression

The Tradeoff

Metadata. The Extra Stuff

The Full Picture

Comments (2)

More from this blog