Distributed Systems: Setting Up New Followers in a Leader-Based Replication System
Introduction
When managing databases in a leader-based replication setup, adding new followers is a common task. This might be necessary to increase the system's resilience, expand its capacity, or replace nodes that have failed.
However, introducing a new follower to the system is not as straightforward as simply copying data from one node to another.
In this post, we'll go over the steps of how it's done.
What's a snapshot?
A snapshot, when talking about a database, is a picture of the data at a specific time. It shows the exact state of all data at that moment, making sure everything in the snapshot matches up for that point in time. This is important in a changing environment where data is always being read and added to the database.
Step 1: Create a Consistent Snapshot
The first difficulty in adding a new follower is getting a consistent snapshot of the leader's database. Since databases are always changing with new data being added, just copying files won't work. This would cause different parts of the database to be captured at different times, creating inconsistencies.
To avoid this, you need to create a snapshot that accurately reflects the database at a specific moment. Fortunately, most database systems offer a way to do this without locking the entire database, which would prevent new data writes and compromise availability. This snapshot serves as the starting point for the new follower.
The problems with copying files
When you initiate a file copy process, you're capturing the state of each file at the start of its copy operation. If you have multiple files to copy, you're likely copying them one at a time. Because the database is live and receiving new updates, the state of the database can change between copying the first file and the last file.
As a result, the first file you copy might represent the database's state at time T1, and the last file might represent the state at a slightly later time, T2. If changes occur between T1 and T2, those changes will be reflected in the files copied at different times. This results in an inconsistency because the copied files reflect the database's state at different moments, not a single, unified point in time.
How is Creating a Snapshot Different from Copying Files?
A snapshot captures the entire state of the database at a single point in time, ensuring consistency across all data. Even if the database is live and changes are happening.
When initiating the creation of a consistent snapshot, it's a common misconception that the entire database must be frozen in a literal sense, stopping all reads and writes. This is not the case. Current database systems use advanced techniques (like WAL) to make sure a snapshot shows a clear state of the data at a certain moment, without needing to stop ongoing database actions.
Step 2: Transfer the Snapshot
Once you have a consistent snapshot, the next step is to transfer this snapshot to the new follower node. This operation is usually straightforward but can involve significant data transfer, depending on the size of your database.
Step 3: Catch Up with the Leader
Once the new follower has the snapshot, it needs to sync with the leader to get all the data changes that happened since the snapshot was made. This is when replication logs are used (I'll write about them in the future). The snapshot is linked to a specific spot in the leader's replication log, which might have different names depending on the database system (like "log sequence number" in PostgreSQL or "binlog coordinates" in MySQL).
The new follower uses this information to request the necessary data changes from the leader. Once it has processed these changes, the follower is considered to have "caught up" with the leader. At this point, it can start receiving and applying new changes in real time, just like the other followers.
Practical Considerations
Setting up a new follower can be different for each database system. Some systems make this process easy, while others need more manual work. This might include many detailed steps that an administrator has to do.
No matter what database technology you use, the main ideas explained here still work. The important thing is to make sure the new follower begins with a matching snapshot of the leader's database and then handles all data changes up to now. This method lets the new follower join the replication system easily without downtime or losing data accuracy.
Summary
In summary, adding a new follower to a leader-based replication system involves creating a consistent snapshot, transferring this snapshot to the new node, and then catching up with the leader's ongoing changes. By following these steps, you can expand your replication setup effectively while maintaining high availability and data consistency.