How do you approach a system design interview?

Use a structured framework: 1) Clarify requirements - ask about users, scale, features, constraints (5 min). 2) Estimate scale - calculate storage, bandwidth, QPS (5 min). 3) Define API and data model (5 min). 4) High-level design - draw main components and data flow (10 min). 5) Deep dive - detail specific components, discuss trade-offs (15 min). 6) Address bottlenecks - scaling, reliability, monitoring (5 min). Always drive the conversation and explain your reasoning.

What is horizontal vs vertical scaling?

Vertical scaling (scale up) means adding more power to existing servers - more CPU, RAM, storage. It's simpler but has limits and creates single points of failure. Horizontal scaling (scale out) means adding more servers to distribute load. It's more complex but offers better fault tolerance and virtually unlimited scaling. Most large systems use horizontal scaling with load balancers to distribute traffic across multiple servers.

What is a load balancer and why is it important?

A load balancer distributes incoming traffic across multiple servers to ensure no single server is overwhelmed. Benefits include: improved availability (if one server fails, others handle traffic), better performance (requests go to least-busy servers), easier scaling (add/remove servers without downtime). Common algorithms: round-robin, least connections, IP hash, weighted distribution. Load balancers can operate at Layer 4 (TCP) or Layer 7 (HTTP) of the network stack.

What is database sharding?

Sharding is horizontally partitioning data across multiple databases. Each shard contains a subset of the data, identified by a shard key (like user_id). Benefits: distributes load, reduces query time, enables horizontal scaling. Challenges: choosing the right shard key, handling cross-shard queries, rebalancing when adding shards, maintaining consistency. Common strategies: range-based (user_id 1-1000 on shard 1), hash-based (hash(user_id) % num_shards), directory-based (lookup table maps keys to shards).

What is the CAP theorem?

CAP theorem states distributed systems can only guarantee two of three properties: Consistency (all nodes see the same data), Availability (every request gets a response), Partition tolerance (system works despite network failures). Since network partitions are unavoidable, the real choice is between consistency and availability during partitions. CP systems (like traditional databases) may reject requests during partitions. AP systems (like Cassandra) allow stale reads but stay available.

When should you use caching?

Use caching when: data is read frequently but written infrequently, data can tolerate being slightly stale, computation is expensive to repeat. Common patterns: Cache-aside (app checks cache first, falls back to DB), Write-through (writes go to cache and DB), Write-behind (writes to cache, async to DB). Consider cache invalidation strategy, TTL settings, and cache eviction policies (LRU, LFU). Popular tools: Redis, Memcached. Cache at multiple levels: CDN, application, database query cache.

15+ System Design Interview Questions 2025: Scalability, Caching & Load Balancing

Twitter handles 500 million tweets per day. Netflix streams to 230 million subscribers simultaneously. Google processes 8.5 billion searches daily. How do these systems actually work? More importantly—can you explain how to build one in 45 minutes? System design interviews at FAANG companies have a 60% fail rate, often because candidates dive into details without a structured approach.

Framework Questions
Requirements Gathering Questions
Scale Estimation Questions
High-Level Design Questions
Deep Dive Questions
Common Patterns Questions
Classic Problems Questions
Follow-Up Questions
Quick Reference

Framework Questions

These questions test your ability to approach system design systematically.

How should you structure a system design interview?

System design interviews are conversations, not coding tests. Your goal is to demonstrate structured thinking, make reasonable trade-offs, and communicate clearly. A proven 6-step framework allocates 45 minutes effectively.

The framework breaks down as follows:

Step 1: Clarify Requirements (5 min) - Ask about functional and non-functional requirements
Step 2: Estimate Scale (5 min) - Back-of-envelope calculations for storage, bandwidth, QPS
Step 3: Define API and Data Model (5 min) - Core endpoints and database schema
Step 4: High-Level Design (10 min) - Draw main components and data flow
Step 5: Deep Dive (15 min) - Detail 2-3 specific components with trade-offs
Step 6: Address Bottlenecks (5 min) - Scaling, reliability, and monitoring

Requirements Gathering Questions

These questions test whether you understand requirements before designing.

What questions should you ask before starting a design?

Never start designing without asking questions. The first 5 minutes should be spent understanding what you're building and the constraints you're working within.

"Before I dive in, I'd like to understand the requirements better..."

Functional requirements - What should the system do?

Core features only (don't over-scope)
User actions and flows
Input/output expectations

Non-functional requirements - How well should it do it?

Scale: How many users? Requests per second?
Performance: Acceptable latency?
Availability: What uptime is required?
Consistency: Can data be eventually consistent or must it be strong?

Constraints:

Are we building from scratch or integrating with existing systems?
Any technology preferences or restrictions?
Budget or team size considerations?

Scale Estimation Questions

These questions test your ability to think about real-world numbers.

How do you estimate scale for a system like Twitter?

Back-of-envelope calculations show you think about real-world constraints. These estimates inform your architecture decisions—a read-heavy system needs different optimization than a write-heavy one.

Example: Designing Twitter

Users: 500M monthly active users
Daily active: 200M (40%)
Tweets per day: 500M (avg 2.5 per active user)
Reads per day: 200M users × 100 tweets viewed = 20B reads

Tweets per second: 500M / 86400 ≈ 6000 TPS (write)
Reads per second: 20B / 86400 ≈ 230,000 QPS (read)

Storage per tweet: 280 chars + metadata ≈ 500 bytes
Daily storage: 500M × 500 bytes = 250GB
Yearly storage: 250GB × 365 = ~90TB (just text, media is much more)

Key insight: This is a read-heavy system (230K reads vs 6K writes). Design accordingly—optimize for reads with caching and pre-computation.

High-Level Design Questions

These questions test your ability to design system architecture.

How do you design a high-level architecture for Twitter?

Start by drawing the main components and explaining the data flow. Walk the interviewer through how a request travels through your system, explaining each component's purpose.

flowchart TB
    CDN["CDN<br/>(static assets)"]
    CLIENT["Client"]
    LB["Load Balancer"]
    API["API Servers"]
    CACHE["Cache<br/>(Redis)"]
    TWEET_SVC["Tweet Service"]
    TIMELINE_SVC["Timeline Svc"]
    TWEET_DB["Tweet DB<br/>(Sharded)"]
    TIMELINE_CACHE["Timeline Cache<br/>(Redis)"]
 
    CDN --> CLIENT
    CLIENT --> LB
    LB --> API
    API --> CACHE
    API --> TWEET_SVC
    API --> TIMELINE_SVC
    TWEET_SVC --> TWEET_DB
    TIMELINE_SVC --> TIMELINE_CACHE

Walk through the flow:

"When a user posts a tweet: request hits the load balancer, goes to an API server, which writes to the Tweet database. Then we need to update timelines - this is where it gets interesting.

For reading timelines, we want to avoid expensive database queries, so we pre-compute timelines and store them in Redis. When you open the app, we just read from cache.

The challenge is: when should we update these cached timelines?"

How do you define APIs and data models?

APIs and data models should be defined early to clarify what the system does and how data is structured. Keep APIs RESTful and data models normalized but practical.

API Design:

POST /tweets
  body: { text, media_ids }
  returns: { tweet_id, created_at }

GET /timeline
  params: ?cursor=xxx&limit=20
  returns: { tweets: [...], next_cursor }

GET /users/{id}/tweets
  returns: { tweets: [...] }

POST /follow/{user_id}
DELETE /follow/{user_id}

Data Model:

User
  - id (PK)
  - username
  - email
  - created_at

Tweet
  - id (PK)
  - user_id (FK)
  - text
  - created_at
  - media_urls

Follow
  - follower_id (PK, FK)
  - followee_id (PK, FK)
  - created_at

Deep Dive Questions

These questions test your ability to go deep on specific components.

What is the difference between fan-out on write and fan-out on read?

This is the classic Twitter timeline design problem. The choice between push and pull models has significant implications for write latency, read latency, and storage requirements.

Fan-out on Write (Push model):

When user posts tweet:
1. Write tweet to DB
2. Get all followers (could be millions)
3. Push tweet to each follower's timeline cache

Pros: Fast reads - timeline is pre-computed
Cons: Slow writes for users with many followers (celebrities)
      High storage - tweet duplicated N times

Fan-out on Read (Pull model):

When user reads timeline:
1. Get list of who they follow
2. Fetch recent tweets from each
3. Merge and sort

Pros: Fast writes - just store the tweet once
Cons: Slow reads - must query multiple users
      High compute at read time

Hybrid approach (what Twitter actually uses):

- Regular users: Fan-out on write
- Celebrities (>10K followers): Fan-out on read

When building timeline:
1. Read pre-computed timeline (regular users' tweets)
2. Merge with celebrity tweets fetched on-demand

How do you shard a database?

With 500M tweets per day, a single database won't scale. Sharding horizontally partitions data across multiple databases, but choosing the right shard key is critical.

"Shard key options:

User ID: All tweets from a user on same shard. Good for user profile pages, bad for timeline (must query all shards)

Tweet ID: Even distribution. Good for single tweet lookups, bad for user's tweet list

Time-based: Recent data on 'hot' shards. Good for recent queries, needs re-sharding over time

I'd recommend user_id sharding with a separate index for tweet_id lookups."

How do you address bottlenecks and ensure reliability?

The final step is identifying potential bottlenecks and explaining how to address them. This shows you think about production-ready systems.

Scaling:

Horizontal scaling of API servers behind load balancer
Database read replicas for read-heavy workload
Sharding for write scaling

Reliability:

Multiple data centers for geographic redundancy
Database replication (primary-replica)
Circuit breakers for failing services

Monitoring:

Request latency percentiles (p50, p95, p99)
Error rates
Database query times
Cache hit rates

Common Patterns Questions

These questions test your knowledge of reusable system design patterns.

How does cache-aside pattern work?

Caching is essential for read-heavy systems. The cache-aside pattern (also called lazy loading) is the most common caching strategy, where the application manages the cache explicitly.

flowchart TB
    REQ["Read Request"]
    CHECK["Check Cache"]
    HIT["Return Data"]
    MISS["Query Database"]
    UPDATE["Update Cache"]
    RETURN["Return Data"]
 
    REQ --> CHECK
    CHECK -->|"Cache Hit"| HIT
    CHECK -->|"Cache Miss"| MISS
    MISS --> UPDATE
    UPDATE --> RETURN

Cache-aside (Lazy Loading):

async function getUser(userId) {
  // Try cache first
  let user = await cache.get(`user:${userId}`);
 
  if (!user) {
    // Cache miss - load from DB
    user = await db.query('SELECT * FROM users WHERE id = ?', userId);
 
    // Update cache
    await cache.set(`user:${userId}`, user, { ttl: 3600 });
  }
 
  return user;
}

Write-through:

async function updateUser(userId, data) {
  // Update DB
  await db.query('UPDATE users SET ... WHERE id = ?', userId);
 
  // Update cache immediately
  await cache.set(`user:${userId}`, data);
}

How does rate limiting work?

Rate limiting protects your system from abuse and ensures fair resource usage. The token bucket algorithm is one of the most common approaches because it allows controlled bursting.

flowchart TB
    REQ["Request"]
    LIMITER["Rate Limiter<br/>(Token Bucket)"]
    ALLOWED["Process Request"]
    REJECTED["Return 429"]
 
    REQ --> LIMITER
    LIMITER -->|"Allowed"| ALLOWED
    LIMITER -->|"Rejected"| REJECTED

Token Bucket Algorithm:

class RateLimiter {
  constructor(capacity, refillRate) {
    this.capacity = capacity;      // Max tokens
    this.tokens = capacity;        // Current tokens
    this.refillRate = refillRate;  // Tokens per second
    this.lastRefill = Date.now();
  }
 
  allowRequest() {
    this.refill();
 
    if (this.tokens >= 1) {
      this.tokens -= 1;
      return true;
    }
 
    return false;
  }
 
  refill() {
    const now = Date.now();
    const elapsed = (now - this.lastRefill) / 1000;
    this.tokens = Math.min(this.capacity, this.tokens + elapsed * this.refillRate);
    this.lastRefill = now;
  }
}

When should you use message queues?

Message queues decouple services and handle asynchronous processing. They're essential for building resilient systems that can handle traffic spikes and service failures gracefully.

flowchart LR
    PRODUCER["Producer"] --> QUEUE["Message Queue<br/>(Kafka/SQS)"]
    QUEUE --> CONSUMER["Consumer"]

Use cases:

Decoupling services (post service → notification service)
Handling traffic spikes (queue absorbs burst)
Async processing (image resizing, email sending)
Event sourcing (log all changes for replay)

How does database replication work?

Database replication creates copies of your data across multiple servers. This improves read performance (distribute read traffic) and provides fault tolerance (if primary fails, promote a replica).

flowchart TB
    WRITES["Writes"]
    PRIMARY["Primary<br/>Database"]
    R1["Replica 1"]
    R2["Replica 2"]
    READS["Reads"]
 
    WRITES --> PRIMARY
    PRIMARY -->|"Replication"| R1
    PRIMARY -->|"Replication"| R2
    R1 --> READS
    R2 --> READS

Classic Problems Questions

These questions test your ability to apply patterns to common interview problems.

How do you design a URL shortener?

URL shorteners are a classic interview problem because they're simple enough to design in 45 minutes but have interesting scaling challenges around ID generation and redirection performance.

Requirements:

Shorten long URLs
Redirect short URLs
Custom short codes (optional)
Analytics (optional)

Scale:

100M URLs created per month
10:1 read:write ratio
7 characters = 62^7 = 3.5 trillion combinations

Design:

POST /shorten
  body: { long_url, custom_code? }
  returns: { short_url }

GET /{short_code}
  returns: 302 redirect to long_url

Key decisions:

ID generation: Counter + base62 encode, or random generation with collision check
Database: NoSQL (Cassandra/DynamoDB) - simple key-value, high write throughput
Caching: Hot URLs in Redis
Analytics: Async logging to Kafka → Analytics DB

How do you design a chat system?

Chat systems require real-time communication, which introduces WebSockets, presence tracking, and delivery guarantees. These are more complex than request-response systems.

Requirements:

1-on-1 and group messaging
Online status
Message history
Push notifications

Key components:

WebSocket servers for real-time communication
Message queue for delivery guarantee
User presence service (heartbeat-based)
Push notification service (APNs/FCM)

Message flow:

flowchart TB
    USER_A["User A sends message"]
    WS["WebSocket Server"]
    QUEUE["Message Queue (Kafka)"]
    STORE["Message Store<br/>(Cassandra)"]
    DELIVERY["Delivery Service"]
    ONLINE["User B online<br/>→ WebSocket push"]
    OFFLINE["User B offline<br/>→ Push notification"]
 
    USER_A --> WS
    WS --> QUEUE
    QUEUE --> STORE
    QUEUE --> DELIVERY
    DELIVERY --> ONLINE
    DELIVERY --> OFFLINE

How do you design a distributed rate limiter?

Distributed rate limiting is more complex than single-server rate limiting because you need to coordinate state across multiple servers. Redis provides atomic operations that make this tractable.

Requirements:

Limit requests per user/IP
Different limits for different APIs
Distributed (multiple servers)

Algorithms:

Token Bucket: Smooth rate limiting, allows burst
Leaky Bucket: Fixed rate output
Fixed Window: Simple but edge case at window boundaries
Sliding Window: Most accurate, more complex

Distributed implementation:

Using Redis:

MULTI
  GET rate_limit:{user_id}
  INCR rate_limit:{user_id}
  EXPIRE rate_limit:{user_id} 60
EXEC

if count > limit:
  reject request
else:
  allow request

Follow-Up Questions

These questions test your ability to handle curveballs and edge cases.

How would you handle a celebrity posting a tweet?

Celebrities have millions of followers, so fan-out on write is too slow. This is a classic follow-up that tests whether you understand the trade-offs of your design.

"Celebrities have millions of followers, so fan-out on write is too slow. I'd use a hybrid approach: for accounts over a threshold (say 10K followers), we don't fan out immediately. Instead, when a user loads their timeline, we fetch their pre-computed feed AND query recent tweets from celebrities they follow, then merge them. This trades off some read latency for much better write latency."

What happens if the database goes down?

This question tests your understanding of fault tolerance and disaster recovery. A production system needs to handle failures gracefully.

"We need to plan for this. First, database replication - a primary with multiple replicas in different availability zones. If the primary fails, we promote a replica. For read traffic, we can route to replicas immediately. For writes, we might need brief downtime during failover.

For critical data, we could use a message queue as a write-ahead log - even if the database is down, we accept writes to the queue and process them when the database recovers."

How do you ensure consistency in a distributed system?

This question tests your understanding of CAP theorem and consistency models. The answer depends on the use case—different data has different consistency requirements.

"It depends on the use case. For financial transactions, we need strong consistency - I'd use a distributed transaction or the Saga pattern with compensating transactions.

For social media features like like counts, eventual consistency is fine - the count might be slightly off for a few seconds, but it will converge. We can use async replication and accept temporary inconsistencies.

The CAP theorem tells us we have to choose. During a network partition, do we want availability (serve potentially stale data) or consistency (reject requests)? Most social applications choose availability."

Quick Reference

Component Selection Guide

Component	When to Use
Load Balancer	Multiple servers, high availability
CDN	Static assets, global users
Cache (Redis)	Read-heavy, acceptable staleness
Message Queue	Async processing, decoupling
Database Sharding	Single DB can't handle write load
Read Replicas	Single DB can't handle read load
NoSQL	Simple queries, high scale, flexible schema
SQL	Complex queries, transactions, relationships

Key Concepts Summary

Concept	Remember
CAP Theorem	Choose 2: Consistency, Availability, Partition tolerance
Horizontal scaling	Add more machines (better)
Vertical scaling	Bigger machine (limited)
Fan-out on write	Pre-compute, fast reads, slow writes
Fan-out on read	Compute on demand, fast writes, slow reads

Practice Questions

Test yourself before your interview:

1. Design a parking lot system. What are the key components and how do you handle multiple entry/exit points?

2. Design Instagram. How would you handle image storage and delivery at scale?

3. You're designing a notification system. How do you ensure notifications are delivered even if the user's device is offline?

4. Design a web crawler. How do you avoid crawling the same page twice?

Complete Node.js Backend Developer Interview Guide - comprehensive preparation guide for backend interviews
SQL JOINs Interview Guide - Master JOIN types with visual examples
REST API Interview Guide - API design principles and best practices
Node.js Advanced Interview Guide - Event loop, streams, and Node.js internals
Complete DevOps Engineer Interview Guide - comprehensive preparation guide for DevOps interviews
Docker Interview Guide - Containers, images, and production-ready Dockerfiles
Kubernetes Interview Guide - Container orchestration, pods, and deployments

15+ System Design Interview Questions 2025: Scalability, Caching & Load Balancing

Table of Contents

Framework Questions

How should you structure a system design interview?

Requirements Gathering Questions

What questions should you ask before starting a design?

Scale Estimation Questions

How do you estimate scale for a system like Twitter?

High-Level Design Questions

How do you design a high-level architecture for Twitter?

How do you define APIs and data models?

Deep Dive Questions

What is the difference between fan-out on write and fan-out on read?

How do you shard a database?

How do you address bottlenecks and ensure reliability?

Common Patterns Questions

How does cache-aside pattern work?

How does rate limiting work?

When should you use message queues?

How does database replication work?

Classic Problems Questions

How do you design a URL shortener?

How do you design a chat system?

How do you design a distributed rate limiter?

Follow-Up Questions

How would you handle a celebrity posting a tweet?

What happens if the database goes down?

How do you ensure consistency in a distributed system?

Quick Reference

Component Selection Guide

Key Concepts Summary

Practice Questions

Ready to ace your interview?

Table of Contents

Framework Questions

How should you structure a system design interview?

Requirements Gathering Questions

What questions should you ask before starting a design?

Scale Estimation Questions

How do you estimate scale for a system like Twitter?

High-Level Design Questions

How do you design a high-level architecture for Twitter?

How do you define APIs and data models?

Deep Dive Questions

What is the difference between fan-out on write and fan-out on read?

How do you shard a database?

How do you address bottlenecks and ensure reliability?

Common Patterns Questions

How does cache-aside pattern work?

How does rate limiting work?

When should you use message queues?

How does database replication work?

Classic Problems Questions

How do you design a URL shortener?

How do you design a chat system?

How do you design a distributed rate limiter?

Follow-Up Questions

How would you handle a celebrity posting a tweet?

What happens if the database goes down?

How do you ensure consistency in a distributed system?

Quick Reference

Component Selection Guide

Key Concepts Summary

Practice Questions

Related Articles

Ready to ace your interview?