15+ System Design Interview Questions 2025: Scalability, Caching & Load Balancing

·14 min read
system-designinterview-questionsarchitecturescalabilitybackend

Twitter handles 500 million tweets per day. Netflix streams to 230 million subscribers simultaneously. Google processes 8.5 billion searches daily. How do these systems actually work? More importantly—can you explain how to build one in 45 minutes? System design interviews at FAANG companies have a 60% fail rate, often because candidates dive into details without a structured approach.

Table of Contents

  1. Framework Questions
  2. Requirements Gathering Questions
  3. Scale Estimation Questions
  4. High-Level Design Questions
  5. Deep Dive Questions
  6. Common Patterns Questions
  7. Classic Problems Questions
  8. Follow-Up Questions
  9. Quick Reference

Framework Questions

These questions test your ability to approach system design systematically.

How should you structure a system design interview?

System design interviews are conversations, not coding tests. Your goal is to demonstrate structured thinking, make reasonable trade-offs, and communicate clearly. A proven 6-step framework allocates 45 minutes effectively.

The framework breaks down as follows:

  • Step 1: Clarify Requirements (5 min) - Ask about functional and non-functional requirements
  • Step 2: Estimate Scale (5 min) - Back-of-envelope calculations for storage, bandwidth, QPS
  • Step 3: Define API and Data Model (5 min) - Core endpoints and database schema
  • Step 4: High-Level Design (10 min) - Draw main components and data flow
  • Step 5: Deep Dive (15 min) - Detail 2-3 specific components with trade-offs
  • Step 6: Address Bottlenecks (5 min) - Scaling, reliability, and monitoring

Requirements Gathering Questions

These questions test whether you understand requirements before designing.

What questions should you ask before starting a design?

Never start designing without asking questions. The first 5 minutes should be spent understanding what you're building and the constraints you're working within.

"Before I dive in, I'd like to understand the requirements better..."

Functional requirements - What should the system do?

  • Core features only (don't over-scope)
  • User actions and flows
  • Input/output expectations

Non-functional requirements - How well should it do it?

  • Scale: How many users? Requests per second?
  • Performance: Acceptable latency?
  • Availability: What uptime is required?
  • Consistency: Can data be eventually consistent or must it be strong?

Constraints:

  • Are we building from scratch or integrating with existing systems?
  • Any technology preferences or restrictions?
  • Budget or team size considerations?

Scale Estimation Questions

These questions test your ability to think about real-world numbers.

How do you estimate scale for a system like Twitter?

Back-of-envelope calculations show you think about real-world constraints. These estimates inform your architecture decisions—a read-heavy system needs different optimization than a write-heavy one.

Example: Designing Twitter

Users: 500M monthly active users
Daily active: 200M (40%)
Tweets per day: 500M (avg 2.5 per active user)
Reads per day: 200M users × 100 tweets viewed = 20B reads

Tweets per second: 500M / 86400 ≈ 6000 TPS (write)
Reads per second: 20B / 86400 ≈ 230,000 QPS (read)

Storage per tweet: 280 chars + metadata ≈ 500 bytes
Daily storage: 500M × 500 bytes = 250GB
Yearly storage: 250GB × 365 = ~90TB (just text, media is much more)

Key insight: This is a read-heavy system (230K reads vs 6K writes). Design accordingly—optimize for reads with caching and pre-computation.


High-Level Design Questions

These questions test your ability to design system architecture.

How do you design a high-level architecture for Twitter?

Start by drawing the main components and explaining the data flow. Walk the interviewer through how a request travels through your system, explaining each component's purpose.

flowchart TB
    CDN["CDN<br/>(static assets)"]
    CLIENT["Client"]
    LB["Load Balancer"]
    API["API Servers"]
    CACHE["Cache<br/>(Redis)"]
    TWEET_SVC["Tweet Service"]
    TIMELINE_SVC["Timeline Svc"]
    TWEET_DB["Tweet DB<br/>(Sharded)"]
    TIMELINE_CACHE["Timeline Cache<br/>(Redis)"]
 
    CDN --> CLIENT
    CLIENT --> LB
    LB --> API
    API --> CACHE
    API --> TWEET_SVC
    API --> TIMELINE_SVC
    TWEET_SVC --> TWEET_DB
    TIMELINE_SVC --> TIMELINE_CACHE

Walk through the flow:

"When a user posts a tweet: request hits the load balancer, goes to an API server, which writes to the Tweet database. Then we need to update timelines - this is where it gets interesting.

For reading timelines, we want to avoid expensive database queries, so we pre-compute timelines and store them in Redis. When you open the app, we just read from cache.

The challenge is: when should we update these cached timelines?"

How do you define APIs and data models?

APIs and data models should be defined early to clarify what the system does and how data is structured. Keep APIs RESTful and data models normalized but practical.

API Design:

POST /tweets
  body: { text, media_ids }
  returns: { tweet_id, created_at }

GET /timeline
  params: ?cursor=xxx&limit=20
  returns: { tweets: [...], next_cursor }

GET /users/{id}/tweets
  returns: { tweets: [...] }

POST /follow/{user_id}
DELETE /follow/{user_id}

Data Model:

User
  - id (PK)
  - username
  - email
  - created_at

Tweet
  - id (PK)
  - user_id (FK)
  - text
  - created_at
  - media_urls

Follow
  - follower_id (PK, FK)
  - followee_id (PK, FK)
  - created_at

Deep Dive Questions

These questions test your ability to go deep on specific components.

What is the difference between fan-out on write and fan-out on read?

This is the classic Twitter timeline design problem. The choice between push and pull models has significant implications for write latency, read latency, and storage requirements.

Fan-out on Write (Push model):

When user posts tweet:
1. Write tweet to DB
2. Get all followers (could be millions)
3. Push tweet to each follower's timeline cache

Pros: Fast reads - timeline is pre-computed
Cons: Slow writes for users with many followers (celebrities)
      High storage - tweet duplicated N times

Fan-out on Read (Pull model):

When user reads timeline:
1. Get list of who they follow
2. Fetch recent tweets from each
3. Merge and sort

Pros: Fast writes - just store the tweet once
Cons: Slow reads - must query multiple users
      High compute at read time

Hybrid approach (what Twitter actually uses):

- Regular users: Fan-out on write
- Celebrities (>10K followers): Fan-out on read

When building timeline:
1. Read pre-computed timeline (regular users' tweets)
2. Merge with celebrity tweets fetched on-demand

How do you shard a database?

With 500M tweets per day, a single database won't scale. Sharding horizontally partitions data across multiple databases, but choosing the right shard key is critical.

"Shard key options:

  • User ID: All tweets from a user on same shard. Good for user profile pages, bad for timeline (must query all shards)
  • Tweet ID: Even distribution. Good for single tweet lookups, bad for user's tweet list
  • Time-based: Recent data on 'hot' shards. Good for recent queries, needs re-sharding over time

I'd recommend user_id sharding with a separate index for tweet_id lookups."

How do you address bottlenecks and ensure reliability?

The final step is identifying potential bottlenecks and explaining how to address them. This shows you think about production-ready systems.

Scaling:

  • Horizontal scaling of API servers behind load balancer
  • Database read replicas for read-heavy workload
  • Sharding for write scaling

Reliability:

  • Multiple data centers for geographic redundancy
  • Database replication (primary-replica)
  • Circuit breakers for failing services

Monitoring:

  • Request latency percentiles (p50, p95, p99)
  • Error rates
  • Database query times
  • Cache hit rates

Common Patterns Questions

These questions test your knowledge of reusable system design patterns.

How does cache-aside pattern work?

Caching is essential for read-heavy systems. The cache-aside pattern (also called lazy loading) is the most common caching strategy, where the application manages the cache explicitly.

flowchart TB
    REQ["Read Request"]
    CHECK["Check Cache"]
    HIT["Return Data"]
    MISS["Query Database"]
    UPDATE["Update Cache"]
    RETURN["Return Data"]
 
    REQ --> CHECK
    CHECK -->|"Cache Hit"| HIT
    CHECK -->|"Cache Miss"| MISS
    MISS --> UPDATE
    UPDATE --> RETURN

Cache-aside (Lazy Loading):

async function getUser(userId) {
  // Try cache first
  let user = await cache.get(`user:${userId}`);
 
  if (!user) {
    // Cache miss - load from DB
    user = await db.query('SELECT * FROM users WHERE id = ?', userId);
 
    // Update cache
    await cache.set(`user:${userId}`, user, { ttl: 3600 });
  }
 
  return user;
}

Write-through:

async function updateUser(userId, data) {
  // Update DB
  await db.query('UPDATE users SET ... WHERE id = ?', userId);
 
  // Update cache immediately
  await cache.set(`user:${userId}`, data);
}

How does rate limiting work?

Rate limiting protects your system from abuse and ensures fair resource usage. The token bucket algorithm is one of the most common approaches because it allows controlled bursting.

flowchart TB
    REQ["Request"]
    LIMITER["Rate Limiter<br/>(Token Bucket)"]
    ALLOWED["Process Request"]
    REJECTED["Return 429"]
 
    REQ --> LIMITER
    LIMITER -->|"Allowed"| ALLOWED
    LIMITER -->|"Rejected"| REJECTED

Token Bucket Algorithm:

class RateLimiter {
  constructor(capacity, refillRate) {
    this.capacity = capacity;      // Max tokens
    this.tokens = capacity;        // Current tokens
    this.refillRate = refillRate;  // Tokens per second
    this.lastRefill = Date.now();
  }
 
  allowRequest() {
    this.refill();
 
    if (this.tokens >= 1) {
      this.tokens -= 1;
      return true;
    }
 
    return false;
  }
 
  refill() {
    const now = Date.now();
    const elapsed = (now - this.lastRefill) / 1000;
    this.tokens = Math.min(this.capacity, this.tokens + elapsed * this.refillRate);
    this.lastRefill = now;
  }
}

When should you use message queues?

Message queues decouple services and handle asynchronous processing. They're essential for building resilient systems that can handle traffic spikes and service failures gracefully.

flowchart LR
    PRODUCER["Producer"] --> QUEUE["Message Queue<br/>(Kafka/SQS)"]
    QUEUE --> CONSUMER["Consumer"]

Use cases:

  • Decoupling services (post service → notification service)
  • Handling traffic spikes (queue absorbs burst)
  • Async processing (image resizing, email sending)
  • Event sourcing (log all changes for replay)

How does database replication work?

Database replication creates copies of your data across multiple servers. This improves read performance (distribute read traffic) and provides fault tolerance (if primary fails, promote a replica).

flowchart TB
    WRITES["Writes"]
    PRIMARY["Primary<br/>Database"]
    R1["Replica 1"]
    R2["Replica 2"]
    READS["Reads"]
 
    WRITES --> PRIMARY
    PRIMARY -->|"Replication"| R1
    PRIMARY -->|"Replication"| R2
    R1 --> READS
    R2 --> READS

Classic Problems Questions

These questions test your ability to apply patterns to common interview problems.

How do you design a URL shortener?

URL shorteners are a classic interview problem because they're simple enough to design in 45 minutes but have interesting scaling challenges around ID generation and redirection performance.

Requirements:

  • Shorten long URLs
  • Redirect short URLs
  • Custom short codes (optional)
  • Analytics (optional)

Scale:

  • 100M URLs created per month
  • 10:1 read:write ratio
  • 7 characters = 62^7 = 3.5 trillion combinations

Design:

POST /shorten
  body: { long_url, custom_code? }
  returns: { short_url }

GET /{short_code}
  returns: 302 redirect to long_url

Key decisions:

  • ID generation: Counter + base62 encode, or random generation with collision check
  • Database: NoSQL (Cassandra/DynamoDB) - simple key-value, high write throughput
  • Caching: Hot URLs in Redis
  • Analytics: Async logging to Kafka → Analytics DB

How do you design a chat system?

Chat systems require real-time communication, which introduces WebSockets, presence tracking, and delivery guarantees. These are more complex than request-response systems.

Requirements:

  • 1-on-1 and group messaging
  • Online status
  • Message history
  • Push notifications

Key components:

  • WebSocket servers for real-time communication
  • Message queue for delivery guarantee
  • User presence service (heartbeat-based)
  • Push notification service (APNs/FCM)

Message flow:

flowchart TB
    USER_A["User A sends message"]
    WS["WebSocket Server"]
    QUEUE["Message Queue (Kafka)"]
    STORE["Message Store<br/>(Cassandra)"]
    DELIVERY["Delivery Service"]
    ONLINE["User B online<br/>→ WebSocket push"]
    OFFLINE["User B offline<br/>→ Push notification"]
 
    USER_A --> WS
    WS --> QUEUE
    QUEUE --> STORE
    QUEUE --> DELIVERY
    DELIVERY --> ONLINE
    DELIVERY --> OFFLINE

How do you design a distributed rate limiter?

Distributed rate limiting is more complex than single-server rate limiting because you need to coordinate state across multiple servers. Redis provides atomic operations that make this tractable.

Requirements:

  • Limit requests per user/IP
  • Different limits for different APIs
  • Distributed (multiple servers)

Algorithms:

  • Token Bucket: Smooth rate limiting, allows burst
  • Leaky Bucket: Fixed rate output
  • Fixed Window: Simple but edge case at window boundaries
  • Sliding Window: Most accurate, more complex

Distributed implementation:

Using Redis:

MULTI
  GET rate_limit:{user_id}
  INCR rate_limit:{user_id}
  EXPIRE rate_limit:{user_id} 60
EXEC

if count > limit:
  reject request
else:
  allow request

Follow-Up Questions

These questions test your ability to handle curveballs and edge cases.

How would you handle a celebrity posting a tweet?

Celebrities have millions of followers, so fan-out on write is too slow. This is a classic follow-up that tests whether you understand the trade-offs of your design.

"Celebrities have millions of followers, so fan-out on write is too slow. I'd use a hybrid approach: for accounts over a threshold (say 10K followers), we don't fan out immediately. Instead, when a user loads their timeline, we fetch their pre-computed feed AND query recent tweets from celebrities they follow, then merge them. This trades off some read latency for much better write latency."

What happens if the database goes down?

This question tests your understanding of fault tolerance and disaster recovery. A production system needs to handle failures gracefully.

"We need to plan for this. First, database replication - a primary with multiple replicas in different availability zones. If the primary fails, we promote a replica. For read traffic, we can route to replicas immediately. For writes, we might need brief downtime during failover.

For critical data, we could use a message queue as a write-ahead log - even if the database is down, we accept writes to the queue and process them when the database recovers."

How do you ensure consistency in a distributed system?

This question tests your understanding of CAP theorem and consistency models. The answer depends on the use case—different data has different consistency requirements.

"It depends on the use case. For financial transactions, we need strong consistency - I'd use a distributed transaction or the Saga pattern with compensating transactions.

For social media features like like counts, eventual consistency is fine - the count might be slightly off for a few seconds, but it will converge. We can use async replication and accept temporary inconsistencies.

The CAP theorem tells us we have to choose. During a network partition, do we want availability (serve potentially stale data) or consistency (reject requests)? Most social applications choose availability."


Quick Reference

Component Selection Guide

ComponentWhen to Use
Load BalancerMultiple servers, high availability
CDNStatic assets, global users
Cache (Redis)Read-heavy, acceptable staleness
Message QueueAsync processing, decoupling
Database ShardingSingle DB can't handle write load
Read ReplicasSingle DB can't handle read load
NoSQLSimple queries, high scale, flexible schema
SQLComplex queries, transactions, relationships

Key Concepts Summary

ConceptRemember
CAP TheoremChoose 2: Consistency, Availability, Partition tolerance
Horizontal scalingAdd more machines (better)
Vertical scalingBigger machine (limited)
Fan-out on writePre-compute, fast reads, slow writes
Fan-out on readCompute on demand, fast writes, slow reads

Practice Questions

Test yourself before your interview:

1. Design a parking lot system. What are the key components and how do you handle multiple entry/exit points?

2. Design Instagram. How would you handle image storage and delivery at scale?

3. You're designing a notification system. How do you ensure notifications are delivered even if the user's device is offline?

4. Design a web crawler. How do you avoid crawling the same page twice?


Ready to ace your interview?

Get 550+ interview questions with detailed answers in our comprehensive PDF guides.

View PDF Guides