Hashing in System Design

Hashing in System Design

Consistent and Rendezvous explained.

Introduction

Hashing is a basic method in system design that changes data of any size to a fixed size. This helps spread client requests across servers, improve load balancing, and make caching better.

The Naive Hashing Approach

At first, think about a basic situation: many clients send requests to one server. When the number of requests goes up, the server gets too busy, which can cause the service to get worse or even fail.

To prevent this, systems scale in two primary ways:

  • Vertical Scaling (Scaling Up): Enhancing the server's capacity with more powerful resources.

  • Horizontal Scaling (Scaling Out): Adding more servers to distribute the load.

Horizontal scaling is a popular approach because of it's flexibility and how well it scales. However, now we have multiple servers. We run into the problem of needing to distribute the load.

This is where load balancers come in. A load balancer is placed between clients and servers to evenly distribute the load when clients make requests.

The naive approach here is to map every client to a server by hashing e.g. the client's IP address and modding it to a server (using % operator).

But, this method doesn't work well when servers are added or removed. It requires rehashing all clients, which disturbs the distribution and makes cached data on servers less useful.

Consistent Hashing

Consistent hashing minimizes the redistribution of requests when the server pool changes. It places both clients and servers on a hash ring (also called circle), assigning each request to the nearest server in the clockwise direction.

Notes:

  • C are the clients, hashed onto a specific position on the circle.

  • Servers are also hashed.

  • Virtual nodes are excluded from the diagram, but you may have e.g. server B placed between C and D if server B is more powerful and you want to utilize it's resources.

  • This is nice because if e.g. server D goes down, Client 4 and 3 simply point to server A. We don't have to redistribute the entire system just because one server is added/removed.

How It Works

  1. Hash Space: Both servers and clients are hashed onto a virtual circle.

  2. Mapping: Clients are assigned to the nearest server on the circle in a clockwise direction.

  3. Server Addition/Removal: Changes affect only the immediate neighbors on the circle, significantly reducing cache invalidation.

Virtual Nodes

Virtual nodes help balance the workload, especially when servers have different capacities. A server can have multiple points on the circle, making request distribution more even.

Rendezvous Hashing

Rendezvous hashing, also known as Highest Random Weight (HRW) hashing, offers a different approach. It calculates a unique score for each server-client pair and assigns the client to the server with the highest score.

How It Works

  1. Score Calculation: Each server computes a score based on both the server and client identifiers.

  2. Server Selection: The client is assigned to the server with the highest score.

  3. Dynamic Adjustments: Adding or removing servers requires recalculating scores without major problems.

Note: Each client calculates a score with every server and selects the server with the highest score.

Choosing the Right Hashing Strategy

  • Consistent Hashing is ideal for frequently scaling systems as it minimizes cache inconsistency by evenly distributing load. It's useful in dynamic environments, limiting the impact of server changes and ensuring smooth transitions.

  • In contrast, Rendezvous Hashing is your go-to choice when you need a straightforward and quick way to connect clients to servers. Unlike Consistent Hashing, Rendezvous Hashing makes it easier to reassign clients to different servers without affecting the overall system. This approach is most effective in scenarios where servers are similar in performance and capacity. It allows for flexible server assignment, making it simpler to manage without the worry of unbalancing the system. However, if servers have varying capabilities, Rendezvous Hashing may not distribute work evenly, as it doesn't adapt to server performance or workload. Then you should go with Consistent Hashing.