Introduction
High availability is very important for today's distributed systems. It helps services stay available even when some parts fail.
This post, based on Chapter 5 of Designing Data-Intensive Applications by Martin Kleppmann, explores how systems can stay strong when parts stop working.
Any node may go down
In a distributed system, nodes can stop working for different reasons, like hardware problems or needed updates. The tricky part is restarting these nodes or dealing with them being gone for a while without hurting the system's total uptime.
The main goal is to keep everything running smoothly and lessen the effect of a single node's failure.
Follower Failure: Catch-Up Recovery
Followers, in a leader-based replication setup, are designed to be resilient. Each follower maintains a log of data changes received from the leader. If a follower crashes or loses connection, it isn't a catastrophe.
Upon restart or reconnection, it can quickly catch up by requesting from the leader all changes it missed during its downtime. This catch-up process ensures that once the follower is up-to-date, it can seamlessly resume its role in the replication architecture.
Leader Failure: Failover
Dealing with the leader node failing is more complicated. Failover, which means making a follower the new leader, needs careful planning. This can be done either manually, with someone in charge fixing the system, or automatically.
Automatic failover involves several steps:
Detecting Leader Failure: Typically done through timeouts, where a lack of response from the leader within a set period triggers failover procedures.
Choosing a New Leader: This might involve an election among followers or designation by a pre-chosen controller node. The most data-up-to-date replica is usually the best candidate.
Reconfiguring the System: All nodes, including clients, need to recognize and align with the new leader to maintain operations.
What Can Go Wrong with Failover?
While crucial, failover is filled with potential pitfalls:
Data Loss Risk: If using asynchronous replication, the new leader might lack some of the old leader's latest writes. These missing updates might never be recovered, leading to data loss.
Split Brain Syndrome: A scenario where two nodes believe they're the leader, causing data corruption or loss due to conflicting writes.
A real-world example -> The GitHub Incident: An outdated follower was mistakenly promoted to leader, leading to primary key conflicts between MySQL and Redis, and subsequently, data inconsistencies.
Choosing the right timeout to decide if a leader is dead, is super important. If it's too short, you might have unneeded failovers because of temporary problems. If it's too long, fixing real failures takes longer.
Conclusion
Maintaining a leader-based replication system requires careful management of failures.
There aren't simple answers to these issues. That's why some operations teams choose to do failovers manually, even when the software can handle automatic failover.