Understanding Service Reliability

When a business offers a service, like a website or API, it often makes promises to its customers. These promises are called Service Level Agreements (SLAs). An SLA might say "Our website will be available 99.9% of the time, or you get a refund." This is a legal and business agreement.

But how do you make sure you can keep this promise? You need a system of measurement and targets.

Setting Internal Targets

To avoid breaking your SLA promise, you need internal targets that are stricter than your external promise. These are called Service Level Objectives (SLOs). If you promise customers 99.9% availability in your SLA, you might set an internal SLO of 99.95%. This gives you room to handle problems before they affect your promises to customers.

What to Measure

Now we need to decide what to measure to know if we're meeting our targets. These measurements are called Service Level Indicators (SLIs). For a web service, you might measure:

How fast pages load (latency)
How often the service is available (uptime)
How many requests fail (error rate)

The Problem with Simple Measurements

Many teams start by measuring averages. "Our average page load time is 200ms." But averages hide important problems. Here's why:

Imagine a service where:

9 users get responses in 100ms
1 user gets a response in 1000ms
The average is 190ms

This looks good on paper, but one user had a terrible experience. In the real world, these ones matter. They could be your most important customers. Or, this could be a sign of a bigger problem to come for other customers.

A Better Way: Percentiles

Instead of averages, we use percentiles to understand the full picture. Let's measure three key points:

50th percentile (median): Half of all requests are faster than this
95th percentile: 95% of requests are faster than this
99th percentile: 99% of requests are faster than this

Now we can see patterns that averages miss. For example:

Normal operation:

50th: 100ms
95th: 200ms
99th: 300ms

This shows consistent, good performance. But what if we see:

50th: 100ms
95th: 200ms
99th: 3000ms

Here in simple words: "95% of users experience response times of 200ms or less. 50% of users experience 100ms or less."

However, The remaining 5% between 95th and 99th percentile is 3000ms. This means for a small group of users, they experience 3000ms or less. The point here is because of the spike for this small group, it can be an indication of a problem:

A problem in one geographic region
A specific type of request that's failing
An early warning of a bigger problem that will soon affect more users

Using This Knowledge

This system of measurement lets you:

Make realistic promises to customers (SLAs)
Set appropriate internal targets (SLOs)
Measure what matters (SLIs)
Catch problems before they affect everyone
Understand the real user experience, not just averages

When you see problems in different percentiles, they tell you different stories:

All percentiles rising: System-wide problem
Only high percentiles affected: Specific issue affecting some users. Could be start of a bigger problem affecting all users eventually.
Percentiles spread far apart: Inconsistent performance (bad, users aren't a fan of inconsistent performance)
Percentiles close together: Consistent (good or bad) performance

This approach helps you maintain reliable services by understanding exactly how they're performing for all users, not just the average case.