Introduction

We're gonna do a frontend chat system design exercise together.

This is really good because it tackles a lot of nuances of frontend development.

The goal is to go through the entire process of designing this and think about all the trade offs and considerations.

Requirements

Component Structure: Sketch out the main components of the frontend architecture for a real-time chat application. Identify the key components and their relationships.

State Management: Discuss how you would handle state management within the frontend application. What data needs to be stored and managed?

Real-Time Updates: Explain how you would achieve real-time updates in the chat interface. What technologies or techniques would you use for real-time communication between users?

User Authentication: Describe how you would handle user authentication on the frontend to ensure secure access to the chat.

UI/UX Considerations: Discuss any user interface and user experience considerations, such as message rendering, chat history, and responsive design.

Taking it to production

In the end, we'll discuss how we could take this to production.

Points to keep in mind

You wanna be thorough but keep it at a high level.

This means focusing on what matters. This is also what shows seniority in frontend system design interviews.

1. Clarifying the requirements

Functional requirements

Some immediate questions you might start with to clarify functional requirements:

"What platforms need to be supported? (web/mobile/desktop)" -> Focus on web for now.
"Is this for 1:1 chats, group chats, or both?" -> We can stick to 1:1 chats.
"Do we need to support file attachments or just text messages?" -> Just text is ok.

Some more that shows thoughts to UX:

"Do we want to show typing indicators?" -> Yes.
"Do we want to show status of messages e,g. read, delivered, sent?" -> Yes.

Non-functional requirements

Non-functional requirements are things like performance, scalability, security, etc.

When you think about chat systems, there are several important things to consider:

Security: Messages should be encrypted.
Performance: Latency should be low, back and fourth chatting should be feel good and be fast.
Reliability: If you send a message and go offline, it shouldn't get lost.

2. High level architecture

Points about our high level architecture:

Need both REST and WebSocket for this.
For sockets, we'd need to handle reconnection.
Offline store to not lose messages.
Inmemory store for what's needed currently.
For authentication, we'll use JWT. We'd have a refresh and access token. Refresh is long lived and access is short lived.
This needs to be secure. The communication between two people should be encrypted. We'll use Signal Protocol to handle this for us. To be clear, we'd use a library to handle this. We also need to encrypt messages when storing them in the offline store.

3. Data Model

Let's think about the data model.

If you've used WhatsApp or other chat apps, think about your experience there.

Seeing when someone is online or offline.
Seeing when someone is typing.
Seeing messages you've not read yet.
Seeting status of messages. On WhatsApp, they got checkmarks for messages e.g. single grey one means sent, double grey ones means delivered, double blue ones means read.

Chat list is a list of conversations:

interface User {
  id: string;
  name: string;
  image?: string;
  status: "online" | "offline"; // Whether user is online or offline
  lastSeen?: Date; // Last time user was seen online
}

interface Conversation {
  id: string;
  participants: User[]; // For 1:1 chats, there's only two participants.
  lastMessage?: Message; // Last message in the conversation
  lastMessageAt?: Date;
}

// State for each user's conversation
// unread messages is different per user
// this way you can show a number of unread messages in the chat list
interface UserConversationState {
  conversationId: string;
  userId: string;
  unreadCount: number;
  lastReadMessageId?: string;
  isTyping?: boolean; // Works for 1:1 chats
  lastTypingAt?: Date;
}

interface Message {
  id: string;
  clientId: string; // client side generated id to help with idempotency/deduplication
  conversationId: string;
  content: string;
  sentAt: Date;
  sender: User["id"];
  status: "sending" | "sent" | "delivered" | "read" | "failed";
  retryCount?: number; // Number of times the message has been retried
}

// For our offline storage
interface OfflineMessage extends Message {
  syncStatus: "pending" | "synced" | "failed";
  attempts: number;
  lastAttempt?: Date;
}

Some interesting points:

User has a status: offline or online. Last seen helps us communicate when they were last online.
We have a user conversation state. This is different per user. They're in the same conversation but this doesn't mean e.g. everyone has read the same message.
- isTyping tells us whether the user is typing.
- lastTypingAt is used to clear the typing state after a timeout.
- unreadCount is the number of unread messages in the conversation.
Message has a status. Depending on the status we show different indicators.
Offline message is the message in our offline store. When sending, it'd be pending, when sent, it'd be synced.
clientId is used for idempotency/deduplication. We don't wanna add the same message twice. So we'd check both server and client id to see if a message is already in the store.

4. API design

REST API

interface ChatAPI {
  // GET /conversations
  getConversations(): Promise<Conversation[]>;

  // GET /conversations/:id/messages?cursor=xyz&limit=50
  getMessages(
    conversationId: string,
    cursor?: string,
    limit?: number
  ): Promise<{
    messages: Message[];
    nextCursor?: string;
  }>;

  // POST /conversations/:id/messages
  sendMessage(conversationId: string, content: string): Promise<Message>;
}

We have a cursor for getMessages to paginate through messages. It wouldn't be efficient to fetch all messages at once. Think of a message with long history.

We could also use pagination for getConversations. Depends on the scale of the app I guess. But something to consider. But a conversation isn't too heavy. Fetching all conversations and then using virtualization should be cool for most people.

WebSocket API

The WebSocket API is the bridge to real-time communication. So we'd use REST for the initial request and to get older messages as you scoll up the chat.

But sockets is what we need to know what's happening in real-time.

// Events we LISTEN to
interface WebSocketEvents {
  "message.new": Message; // New message received
  "message.status": MessageStatus; // Status updates (read/delivered)
  "user.typing": { chatId: string }; // User is typing
  "user.online": { userId: string }; // User came online
}

// Events we SEND
interface WebSocketEmits {
  "message.received": { messageId: string }; // Acknowledge message receipt
  "message.read": { messageId: string }; // Mark message as read
  "user.typing": { chatId: string }; // User started typing
}

message.new: New message received. Update in-memory store and send message that it was received via message.received.

message.status: Status of a message has changed. Here one caveat is that we need to keep track of the order of the status. If a message is in read state and we get a delivered status, we shouldn't update the status. Because delivered is earlier in the chain. So this would be wrong when you think about the ordering.

user.typing: User is typing, we can listen and send this when we're typing ourselves.

5. Optimizations & Edge cases

Idempotency

I briefly mentioned idempotency/deduplication. This is a common problem in chat systems. We need to ensure that we don't add the same message twice. Because we're showing the sent message optimistically, we need to generate a client side id.

Client side storage

We'd go with indexedDB for offline storage. The nice thing about indexedDB is that it can store large objects and also sync across tabs. So you can imagine user opens multiple tabs and you want to sync the messages across tabs, this is possible.

Network

The user's network is not reliable. If they're offline, we should tell them that.

If we need to send messages that are offline, we should queue them up and try to send them again when they come back online. It's important here how we send them. The ordering matters. A solution here would be to send them in a batch (per conversation) and let the server handle the ordering based on the client side timestamp.

Now, the client side timestamp may initially be used for ordering of messages by sender but then gets updated by the server to a proper timestamp. If you've discord for a while, you know that messages when you're offline get sent and can be out of order (e.g. 2nd message appears after 3rd message). It's a trade off.

One thing I forgot to mention: We’re gonna need some sort of sync manager to deal with this. It should deal with queuing up those batches. A batch per conversation. For the queue, we can use something as simple as Array. The sync manager you can think of as it’s own class. This doesn’t need to be anything fancy. But to be fair, it’d be good if I had included it in the architecture for clarity.

Are we online again? We can check navigator.onLine
Get all messages with status pending from IndexedDB (remember OfflineMessage from Data model)
Group by conversationId
Send in batches

Performance

Lazy load code that's not needed.
Use virtualization for chat messages. As the user scrolls up, we might end up with a lot of message in the DOM.
Prefetch next messages whe the user starts to scroll. This way, it doesn't feel like they need to wait for the batch of messages to load.

6. Taking it to production

Feature flag

This would be build behind a feature flag.

We can gradually roll it out to users. This means not letting everyone use it at once, but a subset of users. This way, we can get feedback and ensure it's working as expected with major things going wrong.

In the case of building a full product, maybe you wanna do it incrementally per feature. You wouldn't have the entire product behind a feature flag.

Testing

We need to make sure this works as expected. We need tests that resemble the real world from a user's perspective as much as possible.

It'd be good to have E2E tests here where we tests with multiple users and browsers. You can achieve this with Playwright.

Error tracking

We need to know when things go wrong. We should track errors that are happening in production. Sentry is amazing for this. We can also see the entire stack trace and what's happening with Sentry to understand what led up to the error.

Performance monitoring

Monitoring performance would be good. It's important here to use percentiles and not just averages.

I can imagin we're interested in a few things:

interface MessageMetrics {
  // Time from clicking send to server acknowledgment
  sendLatencyMs: number;

  // Time from send to delivery confirmation
  deliveryLatencyMs: number;

  // WebSocket health
  wsConnectionTime: number;
  wsReconnectionCount: number;

  // API latencies
  getMessagesLatencyMs: number;
  getConversationsLatencyMs: number;
}

User behavior tracking

We're gonna want analytics to understand user behavior.

We can use tools like PostHog for this.

Behaviors we might wanna track:

interface UserEvent {
  // When user sends a message and we got a response from server
  "message.sent": {
    conversationId: string;
    success: boolean;
  };

  // When user opens a conversation
  "conversation.opened": {
    conversationId: string;
    // How user found the conversation
    source: "notification" | "list" | "direct";
  };

  // When user scrolls through messages
  "messages.scrolled": {
    conversationId: string;
    direction: "up" | "down";
    // How far they scrolled back in history
    oldestMessageTimestamp: number;
  };
}

Frontend System Design: Chat Application

Table of contents