Introduction

In this post, we'll dive into the fundamental thing a system is doing and how to measure if a system is well designed.

What are we doing?

At the fundamental level of a system, we're moving, storing and transforming data.

Moving Data

The main job of any system is to move data smoothly from one place to another, like from a user to a server, server to database, or database back to the user. This easy flow of information is key to how well the system works.

Storing Data

Good data storage means keeping information safe and easy to get when needed. For example, using databases for organized data and blob storage for less structured data, making sure both are quick and efficient to access.

Transforming Data

The real worth of data often comes from how we interpret it. Changing raw data into easier-to-understand formats, like graphs or lists, is crucial for getting valuable insights and making better decisions.

Key Metrics for a Good Design

When designing a system, there is no right or wrong answer, it's all about tradeoffs. Understanding the trade-offs in system design is important. But how do we measure these trade-offs?

Availability

Availability measures how likely it is that a system is working and available when needed. It's not realistic to expect a system to be up 100% of the time because unexpected events like disasters can happen. Aiming for high availability, which is often talked about in terms of "nines" (for example, 99.9% or "three nines"), greatly increases how reliable the system is.

For example, improving availability from 99% to 99.99% greatly reduced downtime, showing how important even small improvements can be.

Service Level Objectives (SLO) and Agreements (SLA)

SLOs (Service Level Objectives) are goals for how well a system should work. They are like promises we make to ourselves to make sure our users are happy.

SLAs, on the other hand, are promises we make to our users about how well the system will work. They are like a deal we make, saying our service will meet certain standards, such as being up 99.99% of the time.

For example, AWS S3 has an SLA (Service Level Agreement) that says it will be up and running 99.99% of the time. The customer gets a partial refund if AWS doesn't meet this agreement.

But an SLO for AWS S3 might say that it should answer requests in less than 300 milliseconds 99% of the time. This SLO is about making sure the system is not just on, but also fast and reliable for users.

Reliability

Reliability and availability are both important for a system, but they focus on different things. Reliability is about a system's ability to do its job right every time under usual conditions. For example, a reliable email service would send and receive emails without errors. Availability, however, is about the system being ready to use when needed. A system is available if you can access it, like a website that loads when you want to visit it.

Even if a system is available (you can access it), it might not be reliable if it doesn't work as it should. Imagine a messaging app that opens every time you try, showing it's available. But, if it often fails to send messages or delays in showing them, it's not reliable.

To increase a system's reliability, adding redundancy is a common strategy. This means putting in extra parts, like more servers, so if one part fails, the others keep the system running smoothly. Take a streaming service as an example. If it uses multiple servers around the world, viewers can still watch videos even if one server goes offline. This setup helps the service stay reliable (videos play correctly) and available (the service is up and running) for users everywhere.

Throughput

Throughput is about how many tasks a system can handle in a certain amount of time, like how many searches it can do every second (queries per second or QPS).

When you add more servers to the system (this is called horizontal scaling), you help it do more tasks at once, which improves throughput. This means the system can process more requests faster.

For example, if a system can handle 100 tasks per second and you add more servers, it might then handle 200 tasks per second. This increase helps the system work better and more reliably by allowing it to manage more work without slowing down or crashing.

Latency

Latency is the time it takes for a system to respond to a request, from the moment it's sent by the user to when the user gets a response back.

To make this response time faster, you can place servers closer to where the users are or use edge locations. This helps reduce the travel time of data back and forth, leading to quicker responses.

For example, if a server is in the same city as the user, the data doesn't have to go as far, which means the user can get information or complete actions much faster. This improvement in speed makes the experience better for the user because they spend less time waiting.

System Design Requirements

What defines a good system?

Table of contents