The most simple guide to Database Partitioning

The most simple guide to Database Partitioning

Introduction

I assume you've read my post on Replication or at least know what it is.

In this post, I'm rewriting chapter 6 of Designing Data-Intensive Applications.

I struggled to understand the ins and outs of database partitioning, I hope you find this easy to understand.

What is Database Partitioning?

Database Partitioning is when you split the data inside the database into smaller "databases". One partition holds a part of your data. All partitions together form the entire database.

Partitioning is often used with replication, which means each partition's data is copied onto multiple nodes. This way, even though each record is in only one partition, it can be stored on different nodes for fault tolerance. A node may hold more than one partition.

When using a leader-follower replication model, the mix of partitioning and replication might look like this:

Every partition has a leader on one node and followers on other nodes. Each node can be the leader for some partitions and a follower for others.

Why do we need Database Partitioning?

You've implemented replication, and it's improved data availability and read performance. However, when the amount of data grows and the number of transactions increases, you might see:

  1. Limited Write Scalability: Replication doesn't usually help with write scalability. As every write still hits the same database before being replicated, this can become a bottleneck.

  2. Massive Single-Point Datasets: With replication, the whole dataset is copied to different places, but handling this large dataset all together can be tricky.

  3. Inefficient Resource Usage: Replication doesn't necessarily optimize the use of server resources. Some servers might be idle, while others are pushed to their limits.

Partitioning directly addresses these areas where replication falls short:

  • Distributed Writes: Partitioning enables distributing writes across different servers, each handling only a portion of the data. This way, no single server becomes a bottleneck.

  • Smaller Datasets: Each partition is a smaller, more manageable chunk of the entire dataset, making it easier to perform maintenance operations like backups and indexing.

  • Resource Optimization: By dividing the database, each server can be utilized more efficiently, leading to a balanced workload distribution.

How is partitioning done?

We have to be careful

When partitioning data, there are several strategies we can take. Each have their pros and cons. Our aim with partitioning is to evenly distribute data and query load among nodes. Before we dive into the strategies, let's talk about two fundamental terms: Skew and Hotspot.

  • Skew: When partitioning isn't equal and some parts have more data or queries than others, we call it skewed.

  • Hotspot: A partition that has a much higher load than others is called a hotspot.

To avoid hot spots, data can be placed randomly on nodes, distributing data evenly. However, this makes locating a specific item difficult, as all nodes must be searched simultaneously.

Partitioning by Key Range

In key-range partitioning, the database is divided based on set ranges of keys. For example, imagine partition 1 holds all customer IDs from 1 to 1000, and partition 2 holds IDs from 1001 to 2000.

This method makes it easy to find data because you know which partition holds which IDs. However, some partitions might end up with more data if certain ID ranges have more customers.

Partitioning by Hash of Key

Hash partitioning takes each key, runs it through a formula (hash function), and assigns the data to partitions based on the result. This method spreads data more evenly.

For instance, customer IDs hashed might end up spread across all partitions, even if their original numbers are close together. This helps avoid one partition getting too full. But, finding data in a sequence, like all IDs from 1001 to 2000, becomes harder because they're not stored together.

Skewed Workloads and Relieving Hot Spots

Sometimes, even with hash partitioning, too many requests can focus on data in one partition, leading to a hotspot.

This happens, for example, when a lot of users want to access information about a very popular product, causing the partition holding this product's data to become overloaded. To spread the requests more evenly, you might adjust how keys are distributed, such as by adding a random number to the end of product IDs.

This approach makes sure no single partition gets too overloaded. However, it's important to watch how the system is doing because changing how keys are set up too much can make it harder to find information.

For instance, if we give each sale of the same product a slightly different ID by adding random numbers, these sales might be stored in different places. While each sale's details stay in one partition, to get all sales information for one product, you might have to look in many partitions.

So, it's best to use this method carefully and only when needed to keep reading data simple.

Partitioning and Secondary Indexes

So far, we've focused on partitioning based on primary keys, which uniquely identify records. However, secondary indexes are also important. These indexes don't uniquely identify records but allow searching for records based on non-primary key attributes, like finding all posts by a specific user or all cars of a certain color.

Local Indexing

Local (or Document-Based) Indexing means each partition maintains its own index for the data it holds.

Writing to secondary index

For example, consider adding a new car to the database, which has two secondary indexes: one for sorting cars by color and another for type, such as a sports car.

To add a car, we first determine its partition based on the primary key. Once identified, we send the car data to that partition for storage. The secondary indexes, being local to the partition, are then updated accordingly. Since each partition possesses its own set of secondary indexes, writing to these indexes is efficient in local indexing.

Reading from secondary index

On the other hand, reading is less efficient. As each partition has its own secondary index, searching for all cars of a particular color, like "red," could involve querying multiple, or in the worst-case scenario, all partitions. This process, known as scatter/gather, essentially means we must check each partition to compile a complete list of red cars.

Summary

Local indexing offers efficiency for writing to secondary indexes but poses challenges for reading, often requiring a scatter/gather approach to collect data spread across multiple partitions.

Global Indexing

Global (or Term-Based) Indexing creates a unified index that spans all partitions, organizing data based on secondary attributes across the entire database.

Writing to a Secondary Index

In global indexing, when a record, such as a sale with a specific customer ID as the secondary index is added, the process involves several key steps.

First, the record is sent to the partition determined by its primary key, which means it's stored in the corresponding node for that partition.

However, the complexity arises with the global secondary indexes, which are organized independently of the primary key partitions. For instance, if customer IDs 0-1000 are managed by partition 1 and IDs 1001-2000 by partition 2, but the sale record (based on its primary key) is directed to partition 3, the system faces a challenge. The sale's customer ID, serving as a secondary index, might belong to a range handled by another partition.

Here's where cross-node communication becomes critical. The node holding partition 3, which initially receives and stores the record, identifies that the secondary index, such as a customer ID, isn't managed locally. It then initiates communication with the other nodes to ensure the global index is accurately updated. This entails that if the record, like a sale with a customer ID assigned to partition 1's range, is stored in partition 3, this node must work together with the node managing partition 1 to ensure the global index reflects this new information.

This cross-partition and cross-node coordination is what makes writing to global secondary indexes more complex compared to local indexing. Each write operation might require communication across multiple nodes to update parts of the global index, ensuring that the secondary index accurately reflects the newly added or modified record, regardless of where in the database the primary record resides.

Reading from a Secondary Index

Reading from a global secondary index is efficient. To find all cars with the color "red," the query can directly target the specific part of the global index that catalogs car colors, bypassing the need to query every partition.

This targeted method significantly reduces steps and resources for data retrieval, resulting in faster reading and less resource usage compared to local indexing's scatter/gather approach.

Summary

Global indexing makes the writing process more complex because it needs to update different index parts. But, it makes reading data much faster and easier. It allows direct access to indexed data across all parts, avoiding the need to check each part one by one.

Rebalancing Partitions

As a database evolves, sometimes we need to move data and requests between nodes. This can happen when there are more queries, needing more computing power, or when there's more data, needing more storage space. Also, if a node stops working, the system has to give the failed node's tasks to other nodes to keep running smoothly.

The process to adjust and redistribute data and load across the cluster is known as rebalancing.

Main points about rebalancing

  • Fair Load Distribution: Ensures data storage and read/write requests are evenly shared across nodes.

  • Continuous Operation: The database should remain operational, handling reads and writes during rebalancing.

  • Efficiency: Aims to move the minimum amount of data necessary to reduce network and disk strain.

Rebalancing strategies

We're gonna cover the three important rebalancing strategies:

  • Fixed Number of Partitions

  • Dynamic Partitioning

  • Partitioning Proportional to Nodes

Fixed Number of Partitions

In this approach, the database is split into many partitions from the beginning, with the number of partitions being much larger than the number of nodes. Each node is responsible for several partitions. For example, in a 10-node cluster, you might have 1,000 partitions, with each node initially managing about 100 partitions.

How it works

  • When the cluster needs to grow and a new node is added, this node takes some partitions from the existing nodes. This sharing helps keep the load balanced across all nodes without changing the total number of partitions or how data is assigned to them.

  • On the other hand, if a node is taken away, its partitions get shared among the other nodes. This makes sure the data can still be reached and the workload stays even.

  • The data movement is kept simple by moving whole partitions, making the rebalancing process easier. This way, the cluster can adapt to changes in size or capacity while keeping problems to a minimum.

Pros

  • Makes scaling operations easier by planning the partition layout ahead of time, so adding or removing nodes is simpler.

  • Spreads the load evenly across the cluster by moving partitions instead of single records, making it more efficient and less likely to cause errors.

Cons

  • The first setup needs to plan for future growth, because the number of partitions decides how well the data can be spread across the cluster.

  • Each partition has overhead, so too many can reduce efficiency. Finding the right balance in partition count is crucial for optimizing performance and resource use.

Visual

Dynamic Partitioning

This method adjusts the number of partitions in the database as it grows or shrinks.

For instance, a database initially set up with 100 partitions can dynamically increase to 200 partitions if the total data size doubles, ensuring that each partition maintains a manageable size. Instead of starting with many partitions, the system creates or removes partitions based on the amount of data.

How it works

  • When a partition gets too big, the system splits it into two smaller partitions. This keeps the data in each partition at a manageable size.

  • If a lot of data is deleted and a partition becomes very small, the system can merge it with a neighboring partition to avoid having too many tiny partitions.

  • Each partition lives on a node, and a node can have many partitions. When the system splits a partition because it's too large, it can move the new partition to another node to spread out the workload.

Pros

  • The system can handle data growth smoothly by making more partitions as needed. This means that no matter how much data you have, the system can keep it organized without any one part getting too overloaded.

  • It's also flexible. If the amount of data decreases, the system can combine partitions to keep things efficient.

Cons

  • At the very start, when the database is small, all data might be in just one partition. This means only one node is doing all the work until the data grows enough to be split into more partitions. Having a single node handle all transactions can lead to bottlenecks, where this node becomes a performance limit for the entire database. To avoid bottlenecks, some systems enable "pre-splitting," allowing administrators to create multiple partitions beforehand, spreading the initial load evenly across the cluster.

  • Deciding when to split or merge partitions requires careful planning to make sure the system remains balanced and performs well.

Visual

Partitioning Proportional to Nodes

This method adjusts partitions in the database based on the number of nodes, aiming to keep workloads even across the cluster.

How it works

  • Suppose a database grows from 10 to 12 nodes to handle more data. The system automatically splits some existing partitions to create new ones, spreading the data across the new nodes.

  • If a node is taken out due to less demand or a failure, its data is distributed among the remaining nodes. This keeps the database running smoothly without any loss of data.

Decision making

  • The trigger for adding or removing nodes often comes from monitoring the database's performance. If it's getting overloaded or running out of space, adding nodes can help. If many nodes are underused, removing some can cut costs.

  • Some databases can automatically scale up or down based on set rules, like when disk space gets low or the load gets too high, reducing the need for constant human oversight.

Pros

  • Keeps data and workload evenly spread across the cluster, avoiding overburdening any single node.

  • Makes it easier to scale the database by adding or removing nodes based on demand, ensuring efficient resource use.

Cons

  • Changing the number of nodes means moving data around, which can temporarily affect the database's performance.

  • Although automation can handle scaling, setting up and tuning the rules for when and how to scale requires careful planning.

Visual

Manual vs Automatic Rebalancing

Automatic Rebalancing happens without needing manual intervention, ensuring data is evenly spread across the cluster as soon as imbalances occur. This reduces constant monitoring and speeds up adjustments, but may impact performance if system thresholds aren't well-tuned.

Manual Rebalancing requires administrators to decide when to redistribute data, offering more control over the timing and impact of the process. This method prevents unexpected performance problems but demands more monitoring and management, possibly causing delays in fixing imbalances.

Request Routing

Once we've spread our dataset across multiple nodes on different machines, we face a critical question: how does a client know which node to contact for a request?

As the system rebalances partitions among nodes, the mapping of partitions to nodes can change. It's important to track these changes to route requests correctly.

This is an example of a broader issue called service discovery, which isn't only for databases. Any software that can be accessed through a network faces this problem, especially if it aims for high availability (running on multiple machines for backup).

Routing Methods

  1. Direct Node Contact: Clients can connect to any node. If this node has the requested partition, it handles the request directly. Otherwise, it forwards the request to the right node, then sends the response back to the client.

  2. Routing Tier: All client requests first go to a "routing tier," which doesn't process requests but acts as a smart load balancer. It identifies the correct node for each request and forwards it accordingly.

  3. Client Awareness: Clients themselves know about the partitioning and nodes. With this knowledge, they directly connect to the node responsible for the partition they need.

Visualization of Routing Methods

Keeping Track of Changes

The challenge across these methods is staying updated on the node-partition assignment changes.

Solutions differ:

  • External Coordination Services: Tools like ZooKeeper handle this metadata, offering a trustworthy source for routing choices. Systems like LinkedIn's Espresso and Kafka use ZooKeeper to tell their routing layers or clients about the current node-partition connections.

  • Gossip Protocols: Cassandra and Riak use a gossip protocol, where nodes communicate among themselves to spread the latest cluster state. This model allows requests to be sent to any node, with that node then forwarding the request as needed, eliminating reliance on external services.

  • Routing Tiers and DNS: Systems with a routing tier or the option to connect to any node typically use DNS for clients to find node IP addresses. Since IP addresses change less frequently than partition assignments, DNS provides a stable method for clients to locate nodes.

Conclusion

I'm not gonna lie, this was the hardest topic to understand in the book. I wasn't expecting this post to get so long. My biggest confusion was around secondary indexes and how they're partitioned.

Anyways, I hope you found this insightful!