Problems with Replication Lag

Problems with Replication Lag

Introduction

Replication lag is a critical aspect of distributed systems that affects data consistency, user experience, and overall system reliability.

My aim here is to simplify the topic, explaining it concisely without leaving important information out.

What Is Replication Lag?

Replication lag occurs when there is a delay between a write operation being executed on the primary (leader) node and the same operation being replicated to the secondary (follower) nodes.

Replication lag is a basic issue in leader-based replication systems that are made for better availability, scalability, and faster response times by spreading out read operations across replicas in different locations.

Why Does Replication Lag Happen?

The main reasons for replication lag are network delays, the time needed to copy data between nodes, and the system's workload.

Replication lag gets worse with asynchronous replication because the system keeps working without waiting for all copies to confirm the changes. This trade-off means less consistency right away, but better availability and performance.

The Impact of Replication Lag on User Experience

Replication lag can lead to temporary inconsistencies across the database, affecting user experiences in various ways. Users might encounter outdated information or seemingly lost updates, leading to confusion and mistrust in the system's reliability.

Problems with Replication Lag

Reading Your Own Writes

The challenge

Users expect to see their updates (e.g., posts or comments) immediately after submission. However, with replication lag, a user's subsequent read request might be served by a follower that hasn't yet received the update, making it seem as though the update was lost.

Example

A user posts a comment on a social media platform and refreshes the page, only to find their comment missing because the read was served by a lagging follower.

Solutions

Direct Reads After Write: Ensure that any read operation following a user's write is directed to the leader or a sufficiently updated replica.

Timestamp Tracking: Utilize the timestamp of the user's last write to ensure reads are served by replicas that have been updated beyond this timestamp.

Client-Side Handling: The client tracks the last write timestamp and requests data from replicas known to be up-to-date.

Monotonic Reads

The Challenge

Users see newer updates first and older ones later because they get data from replicas that have different levels of lag.

Example

A user checks the latest comments on a post, sees new comments, and upon refreshing, older comments appear, making it seem like recent comments disappeared.

Solutions

Sticky Sessions: Bind user sessions to certain replicas to make sure they see updates in the right order without going back to older ones.

Read from Updated Replicas: Implement logic to serve reads only from replicas that are not lagging beyond a certain threshold.

Consistent Prefix Reads

The Challenge

Users perceive violations of causality when related updates are seen out of order due to different replication lags across partitions.

Example

In a chat application, a message responding to a question appears before the question itself due to the response being replicated faster than the question.

Solutions

Partition-Aware Writes: Ensure causally related writes go to the same partition.

Causal Dependency Tracking: Use algorithms to track and enforce the replication order based on causal relationships between operations.

Conclusion

We shouldn't pick the strategy to solve the replication lags we have blindly. We need to evaluate the situation, trade offs and how the end experience would be for the user. That's why each problem had a different example.

They all address "Replication Lag" but each should not be used for the same things.