A friendly introduction to Replication (distributed systems)

A friendly introduction to Replication (distributed systems)

Introduction

This will be fun!

I guess you're familiar with the concept of distributed systems, but not sure about the fancy term "Replication".

I want to briefly explain Replication, and then we'll dive into a story where you and I try to scale our e-commerce app.

Replication

Replication is when you have the same data on more than one computer, and they're connected through a network.

There are several reasons why you might want to replicate data:

  • To keep data near your users (reduce latency)

  • To let the system keep working even if some parts break (which improves availability)

  • To increase the number of computers that can handle read requests (improve read speed)

Chapter 1: The Beginnings

The start

We begin building an e-commerce app aiming for ease of use and handling various activities like browsing products and processing payments.

Initially, we use a single database for user accounts, products, orders, and payments, which is simple and cost-effective.

Upon launch, the platform works well initially, but as user numbers grow, we encounter slow response times and issues with multiple transactions, indicating the need for a more robust solution.

For example, during a big holiday sale, our website experienced heavy traffic. The single database struggled to handle it, resulting in longer page loading times and transaction timeouts.

One customer tried to buy a limited-edition item but missed out due to slow response time. They were upset and threatened to shop elsewhere.

This event highlighted the need for a more robust solution to handle high traffic and prevent lost sales and unhappy customers.

Chapter 2: Discovering Replication

Aha Moment!

As we dive into solving our e-commerce app's scaling issues, we encounter the concept of database replication. This method makes many copies of our main database. It's like having extra backup dancers ready to help the main dancer, making sure the performance never fails.

Implementing Leaders and Followers

We choose to use a leader-follower model. Imagine a chef (the leader) in a kitchen, making a recipe that the assistant chefs (followers) copy. In our database system, one database is the 'leader' and takes care of all write actions (like adding new items or managing orders). The 'followers' copy these changes and deal with read actions (like looking at products).

Now, the leader can handle read operations. However, in a read-heavy system, it's better to let the leader focus on write operations and have replicas focus on read operations.

Having a leader and followers makes our operations easier. The leader database takes care of all updates, while the follower databases deal with queries. This way, no single database gets too busy, making everything more efficient.

Observation

With this setup, we see clear improvements. Pages load quicker, even at busy times, and transactions go smoothly. The leader-follower model not only fixes our first issues but also prepares us for more growth.

Chapter 3: Overcoming New Challenges

Oh no, Replication lag

As our e-commerce platform scales up, we face the inevitable issue of replication lag. The delay between data being updated in the leader database and its replication in the follower databases. This becomes evident when a customer doesn't see their recent order in their account immediately.

These delays, even though small, can really affect user trust and happiness. It's very important for our platform to show up-to-date information, especially for important actions like order confirmations and account changes.

Balancing Sync and Async Replication

We explore two primary replication methods: synchronous and asynchronous.

Synchronous replication guarantees data consistency, but at the cost of speed.

Asynchronous replication is quicker, but it might cause data to be briefly out-of-sync.

After careful consideration, we decide on a hybrid approach.

For critical operations, such as processing payments and placing orders, we use synchronous replication to ensure data consistency and user trust.

For less critical operations, like browsing products or reading reviews, we go with asynchronous replication to maintain high performance and a smooth user experience.

This method helps us find a balance between having accurate data for important tasks and keeping things fast for general browsing. It's a custom solution that fits our needs and what users expect.

It's often referred to as semi-synchronous replication.

Use Cases

Synchronous replication is great for tasks where data accuracy is very important, like financial transactions or updating user accounts.

Asynchronous replication works well for situations where having the most recent data isn't as crucial, like showing product categories or user reviews.

Chapter 4: Scaling Up and Out

Expanding the Follower Base

As our e-commerce platform grows, so does the demand for data access. Picture our database system as a team of workers, with each follower database handling customer requests like data queries. To manage increasing data requests and ensure fast, accurate processing, we must add more follower databases.

Implementing Effective Replication Strategies

We encounter new synchronization challenges as we add more followers. To manage the challenges posed by additional followers, we consider two main strategies: parallel replication and delayed replication.

  1. Parallel Replication: This approach lets the main database do many replication tasks at once. It's like having multiple workers filling shelves at the same time, making sure everything is refilled fast. Parallel replication greatly speeds up data syncing between followers and is great for managing lots of read-and-write actions happening together.

  2. Delayed Replication: Delayed replication purposely slows down the replication process for some followers. It's like a buffer zone. Imagine a live event being broadcasted with a delay, so you can fix mistakes before the audience sees them. This method helps with disaster recovery, giving a chance to fix problems in the main database before they spread to all followers.

Choosing Replication Strategy

For our platform, parallel replication is the better choice.

The main reason is that it can handle more followers without slowing down. It makes sure all followers are updated fast and consistently, which is important for a real-time user experience.

Delayed replication is good for certain situations like disaster recovery, but it doesn't match our need for quick data consistency across all nodes, especially for a dynamic e-commerce platform.

Handling Failures and Failovers

In a complicated system, like our e-commerce platform, there's a chance that the main database (the leader) might fail. We need to be ready for this, just like having backup plans for important tasks. A failover strategy is our backup plan for databases.

What is a failover?

Failover is what we do if the main database stops working because of problems like broken hardware or network issues. It means we automatically move operations to a backup system. In our situation, making one of the follower databases the new leader. This way, our platform keeps working, like when a vice-captain takes over if the captain can't do their job.

Automating the Failover Process

To reduce service disruptions, our failover process is completely automatic. We use tools that always monitor our main database's health. If they find a problem, the system starts the failover steps right away. This means choosing the best backup database to become the new leader, using rules in place.

Criteria for Electing a New Leader

For the new leader, we choose a database with the newest data, lowest network delay, and best performance. This makes sure the new leader is a great copy of the original and can manage the work well.

Challenges During Failover

Keep in mind that failover isn't perfect. There might be a short time when data isn't completely in sync, which could impact ongoing transactions. Our system aims to reduce these issues and quickly get back to normal once the new leader is in charge.

Post-Failover Actions and Reintegration

After the new leader starts working, the other followers sync with it. At the same time, we fix the original leader database. When it's back online and stable, we decide if it should be a leader or follower, depending on how reliable it is and the system's current state.

Chapter 5: Advanced Replication Strategies

Let's take a look at some advanced strategies we may need to use as our app grows. We won't dive too deep into those, considering this is an introduction to replication. However, I don't want to leave you hanging or lost.

Overview of Multi Leader Replication

As our e-commerce platform grows worldwide, we face new challenges in handling data across different regions. To manage this, we explore a strategy called multi-leader replication.

What is Multi Leader Replication

This method involves having more than one main database (called leaders). Each of these is in a different part of the world. It's similar to having several head offices in different countries, each taking care of its local area.

Benefits

  • Faster Access: With databases located closer to where users are, information moves more quickly, making the website faster for them.

  • Better Local Service: Each database focuses on the users near it, leading to a smoother experience in each region.

  • Safer and More Reliable: Having several main databases means if one has a problem, the others keep working.

New challenges to deal with

  • Keeping Data Consistent: When using multiple main databases, it's important to make sure they all have the same information. This can be tricky, especially when they are far apart.

  • Handling Conflicts: Sometimes, different databases might have conflicting information. We need special methods to find and fix these differences.

  • Extra Costs and Effort: This setup can be more complex and expensive because it needs more resources and careful management.

  • Customizing for Our Needs: We have to customize this system to suit our specific e-commerce platform, particularly in how we synchronize data between these databases.

Overview of Leaderless Replication

After exploring multi-leader replication, we now dive into the concept of leaderless replication.

This method is very different from the usual leader-follower systems. In leaderless replication, each node (or database server) in the network can manage both reading and writing tasks.

Think of a group where all members can do the same jobs and are responsible for the same tasks. There's no specific leader giving out tasks. Instead, each person works on their own, helping to reach the main goal. This is what leaderless replication is like.

Benefits

  • High Availability and Fault Tolerance: One of the key benefits of leaderless replication is its high availability. Since every node can handle writes, the system doesn’t rely on a single point of failure, making it highly fault-tolerant.

  • Data Consistency: Leaderless systems often use consensus algorithms (like Paxos or Raft) to ensure consistency across nodes. These algorithms help in synchronizing data, although they can add complexity to the system.

New challenges to deal with

  • Conflict Resolution: Similar to multi-leader systems, leaderless replication requires effective conflict resolution mechanisms, especially in scenarios where two nodes receive write requests for the same data at the same time.

  • Latency Considerations: Leaderless replication can improve availability, but it might sometimes make things slower, especially when using consensus algorithms to confirm writes across many nodes.

When to consider multi or leaderless?

Multi-leader replication works well for global apps, systems with high availability, and those with lots of writing.

On the other hand, leaderless replication is great for distributed databases like Amazon's DynamoDB and Apache Cassandra. It provides high availability and fault tolerance. This method is made to deal with node failures smoothly and is used in systems that need fault tolerance and high availability.

Conclusion

In conclusion, Replication is a powerful strategy to scale our system. With great power comes great responsibility.

As your app grows, you may realize the need for Replication. Add replication and you'll slowly keep encountering more challenges.

Thankfully, many services/managed databases handle Replication for us.

In the future, I hope to cover Partitioning too.