Observability is how you understand what's happening in production. When something breaks at 3 AM, your observability stack determines whether you fix it in minutes or spend hours guessing.
Interviewers test observability knowledge because it separates engineers who've operated real systems from those who've only built them. You can write perfect code, but if you can't debug it in production, you're not ready for senior roles.
This guide covers the three pillars of observability, the tools you'll be asked about, and the interview questions that test whether you actually understand monitoring or just know the buzzwords.
Monitoring vs Observability
These terms are often used interchangeably, but they're different.
Monitoring: Predefined Questions
Monitoring answers questions you've already thought of:
- Is CPU above 80%?
- Is the service responding?
- Are there more than 10 errors per minute?
You set up dashboards and alerts for known failure modes. Monitoring tells you that something is wrong.
Observability: Arbitrary Questions
Observability lets you ask questions you haven't thought of yet:
- Why is this specific user's request slow?
- What changed between yesterday and today?
- Which service is causing the cascade failure?
You can explore your system's state without deploying new code. Observability tells you why something is wrong.
The Key Difference
Monitoring: "Is the system healthy?" (yes/no)
Observability: "Why did user X's request fail at 2:34 PM?" (investigation)
A well-monitored system has good dashboards. An observable system lets you debug novel problems without adding instrumentation first.
The Three Pillars of Observability
Observability is built on three complementary data types. Each answers different questions.
| Pillar | What It Is | What It Answers |
|---|---|---|
| Metrics | Numerical measurements over time | What's happening? How much? |
| Logs | Discrete event records | Why did it happen? What exactly? |
| Traces | Request flow across services | Where did it happen? Which path? |
How They Work Together
A typical debugging flow:
- Metrics alert: Error rate spiked to 5%
- Logs investigate: Errors show "database connection timeout"
- Traces pinpoint: Requests to
/api/ordersare slow at the inventory service
No single pillar is sufficient. Metrics are cheap but lack detail. Logs are detailed but hard to aggregate. Traces show flow but only for sampled requests.
Metrics: Prometheus & Grafana
Metrics are numerical measurements collected over time. They're the foundation of monitoring because they're cheap to store and fast to query.
Metric Types
Understanding metric types is essential for interviews:
| Type | Description | Example |
|---|---|---|
| Counter | Only increases (resets on restart) | Total requests, errors, bytes sent |
| Gauge | Can go up or down | Temperature, queue size, active connections |
| Histogram | Samples in configurable buckets | Request latency distribution |
| Summary | Calculates quantiles client-side | Request latency percentiles |
Interview trap: "When would you use a histogram vs a summary?"
Histograms are aggregatable across instances (you can calculate percentiles server-side). Summaries calculate percentiles client-side and can't be aggregated. Use histograms for most cases.
PromQL Basics
Prometheus Query Language (PromQL) is used by Prometheus and many other systems. Know these patterns:
Rate of increase (for counters):
rate(http_requests_total[5m])Current value (for gauges):
node_memory_available_bytesPercentiles (for histograms):
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))Filtering and aggregation:
sum(rate(http_requests_total{status=~"5.."}[5m])) by (service)The RED Method
For request-driven services (APIs, microservices), monitor:
- Rate: Requests per second
- Errors: Failed requests per second
- Duration: Latency distribution (p50, p95, p99)
# Rate
sum(rate(http_requests_total[5m]))
# Errors
sum(rate(http_requests_total{status=~"5.."}[5m]))
# Duration (p95)
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))The USE Method
For infrastructure resources (CPU, memory, disk, network), monitor:
- Utilization: Percentage of resource capacity used
- Saturation: Queue depth when resource is fully utilized
- Errors: Error events for the resource
| Resource | Utilization | Saturation | Errors |
|---|---|---|---|
| CPU | CPU usage % | Run queue length | - |
| Memory | Memory used % | Swap usage | OOM events |
| Disk | Disk usage % | I/O queue depth | Read/write errors |
| Network | Bandwidth usage | Dropped packets | Interface errors |
Grafana Dashboards
Grafana visualizes metrics from Prometheus and other sources. Interview topics include:
- Dashboard design: Group related metrics, use consistent colors
- Variables: Make dashboards reusable across environments
- Alerting: Grafana can alert based on query thresholds
- Annotations: Mark deployments and incidents on graphs
Best practice: Four golden signals per service dashboard—latency, traffic, errors, saturation.
Logging: ELK Stack & Structured Logging
Logs record discrete events. They provide the detail metrics lack but are expensive to store and query at scale.
Structured vs Unstructured Logs
Unstructured (hard to parse):
2026-01-07 10:15:32 ERROR Failed to process order #12345 for user john@example.com
Structured (queryable):
{
"timestamp": "2026-01-07T10:15:32Z",
"level": "error",
"message": "Failed to process order",
"order_id": "12345",
"user_email": "john@example.com",
"error": "payment_declined"
}Structured logs let you query: "Show me all errors for order_id 12345" or "Count errors by error type."
Log Levels
Use levels consistently:
| Level | When to Use |
|---|---|
| DEBUG | Detailed diagnostic info, disabled in production |
| INFO | Normal operations worth recording (startup, requests) |
| WARN | Something unexpected but handled (retry succeeded) |
| ERROR | Operation failed, needs attention |
| FATAL | Application cannot continue |
Interview tip: "What's the difference between WARN and ERROR?"
WARN: Something unusual happened, but the system handled it. ERROR: Something failed and needs investigation or action.
ELK Stack Architecture
The ELK stack is a common logging solution:
Application → Logstash → Elasticsearch → Kibana
(collect) (store/index) (visualize)
Elasticsearch: Distributed search and analytics engine. Stores logs in indices, enables full-text search and aggregations.
Logstash: Data processing pipeline. Collects logs from multiple sources, parses and transforms them, sends to Elasticsearch.
Kibana: Visualization layer. Dashboards, log exploration, alerting.
Modern alternative: Many teams replace Logstash with Filebeat (lighter) or use cloud solutions (Datadog, Splunk).
Log Aggregation Patterns
Centralized logging: All services send logs to a central system
- Pros: Single place to search, correlation across services
- Cons: Network dependency, potential bottleneck
Sidecar pattern: Log collector runs alongside each service
- Used in Kubernetes (Fluentd/Fluent Bit sidecars)
- Handles log rotation, buffering, retries
Correlation IDs
A correlation ID (or trace ID) links related logs across services:
{"correlation_id": "abc-123", "service": "api", "message": "Received order request"}
{"correlation_id": "abc-123", "service": "inventory", "message": "Checking stock"}
{"correlation_id": "abc-123", "service": "payment", "message": "Processing payment"}Pass the correlation ID in HTTP headers (e.g., X-Correlation-ID) and include it in every log entry. This is the bridge between logging and tracing.
Distributed Tracing: Jaeger & OpenTelemetry
Tracing shows how requests flow through distributed systems. Essential for microservices where a single request might touch dozens of services.
Why Tracing Matters
In a monolith, a stack trace shows you where things went wrong. In microservices, a request might:
- Hit the API gateway
- Call the auth service
- Call the user service
- Call the order service
- Call the inventory service
- Call the payment service
If the request is slow, which service is the bottleneck? Logs from each service don't show the full picture. Tracing does.
Trace Anatomy
Trace: The complete journey of a request Span: A single operation within a trace (one service call) Context propagation: Passing trace IDs between services
Trace: order-request-abc123
├── Span: api-gateway (10ms)
│ ├── Span: auth-service (5ms)
│ └── Span: order-service (200ms)
│ ├── Span: inventory-check (50ms)
│ └── Span: payment-process (140ms) ← bottleneck
Each span includes:
- Start time and duration
- Service name and operation
- Tags (metadata)
- Logs (events within the span)
- Parent span ID
OpenTelemetry
OpenTelemetry (OTel) is the industry standard for observability instrumentation. It provides:
- APIs: For manual instrumentation
- SDKs: Language-specific implementations
- Auto-instrumentation: Automatic tracing for common frameworks
- Collector: Receives, processes, exports telemetry data
App (OTel SDK) → OTel Collector → Jaeger/Zipkin/Datadog
Interview tip: OpenTelemetry merged OpenTracing and OpenCensus. If asked about either, mention this consolidation.
Sampling Strategies
Tracing every request is expensive. Sampling reduces volume while maintaining visibility:
| Strategy | Description | Use Case |
|---|---|---|
| Head-based | Decide at request start | Simple, consistent |
| Tail-based | Decide after request completes | Keep errors/slow requests |
| Rate limiting | Fixed samples per second | Predictable cost |
| Probabilistic | Random percentage | Simple to configure |
Tail-based sampling is powerful: sample 100% of errors and slow requests, 1% of successful fast requests. You keep the interesting data.
Jaeger
Jaeger is a popular open-source tracing backend:
- Stores and indexes traces
- Provides UI for trace visualization
- Supports multiple storage backends (Elasticsearch, Cassandra)
- Integrates with Kubernetes and service meshes
Key features to mention:
- Service dependency graph
- Trace comparison
- Performance optimization analysis
Alerting & Incident Response
Observability data is useless if no one sees it when things break. Alerting bridges the gap between data and action.
Alert Design Principles
Alert on symptoms, not causes:
- Bad: "CPU above 80%"
- Good: "Error rate above 1%"
CPU might be high because you're doing useful work. Error rate means users are affected.
Every alert must be actionable:
- If you can't do anything about it, don't page someone
- Include runbook links in alert descriptions
Use appropriate severity:
- Critical: User-facing impact, immediate action needed
- Warning: Degradation, investigate soon
- Info: Worth knowing, no action needed
Avoiding Alert Fatigue
Alert fatigue is when teams ignore alerts because there are too many. It's dangerous—real issues get missed.
Prevention strategies:
- Review alerts regularly: Delete alerts nobody acts on
- Set proper thresholds: Include hysteresis (alert at 90%, clear at 80%)
- Group related alerts: Don't send 50 alerts for one incident
- Distinguish pages from notifications: Not everything needs to wake someone up
- Track alert metrics: Mean time to acknowledge, false positive rate
SLIs, SLOs, and SLAs
These terms are frequently confused in interviews:
| Term | Definition | Example |
|---|---|---|
| SLI | Service Level Indicator - a metric | 99.2% of requests succeed |
| SLO | Service Level Objective - internal target | 99.9% success rate |
| SLA | Service Level Agreement - contractual | 99.5% uptime or credits |
Relationship:
- SLIs measure actual performance
- SLOs set targets (with error budget)
- SLAs are promises to customers (usually looser than SLOs)
Error budget: If your SLO is 99.9%, you have 0.1% error budget. Use it for deployments and experiments. When it's exhausted, focus on reliability.
On-Call Best Practices
Interview questions often cover on-call practices:
- Rotation: Regular schedules, fair distribution
- Escalation: Clear paths when primary doesn't respond
- Handoffs: Document ongoing issues between shifts
- Runbooks: Step-by-step guides for common alerts
- Blameless postmortems: Learn from incidents without finger-pointing
Common Interview Questions
Conceptual Questions
Q: Explain the three pillars of observability.
A: Metrics (numerical time-series data for dashboards and alerts), logs (detailed event records for debugging), and traces (request flow visualization for distributed systems). Each serves a different purpose: metrics show what is happening, logs explain why, traces show where in distributed systems.
Q: What's the difference between monitoring and observability?
A: Monitoring answers predefined questions through dashboards and alerts. Observability lets you investigate novel problems without deploying new code. An observable system has sufficient instrumentation to debug issues you haven't anticipated.
Q: When would you use the RED method vs the USE method?
A: RED (Rate, Errors, Duration) for request-driven services like APIs—it measures user-facing behavior. USE (Utilization, Saturation, Errors) for infrastructure resources like CPU, memory, disk—it measures resource health.
Scenario Questions
Q: Your API's p99 latency suddenly increased from 200ms to 2s. How would you investigate?
A:
- Check if it's all endpoints or specific ones (metrics)
- Look for recent deployments or config changes (annotations)
- Check downstream service latencies (traces)
- Look for resource saturation—CPU, memory, connections (USE metrics)
- Check for database slow queries (logs)
- Verify external dependencies (third-party API latency)
Q: How would you design alerting for a new microservice?
A:
- Start with RED metrics—alert on error rate and latency SLOs
- Add resource alerts only if they correlate with user impact
- Include runbook links in alert descriptions
- Set up PagerDuty/OpsGenie integration with escalation
- Review and tune thresholds after collecting baseline data
Q: You're getting 1000 alerts per day and the team is ignoring them. How do you fix this?
A:
- Audit alerts—delete ones nobody acts on
- Consolidate related alerts (alert grouping)
- Separate pages (immediate action) from notifications (FYI)
- Add hysteresis to prevent flapping
- Track metrics: acknowledge rate, time to resolution
- Establish alert review process (weekly pruning)
Tool-Specific Questions
Q: Explain Prometheus metric types and when to use each.
A: Counter (only increases—requests, errors), Gauge (can go up/down—temperature, queue size), Histogram (distribution in buckets—latency), Summary (client-side percentiles—legacy use). Use histograms over summaries because they're aggregatable.
Q: What is context propagation in distributed tracing?
A: Passing trace and span IDs between services so spans can be correlated into a complete trace. Usually done via HTTP headers (W3C Trace Context standard) or gRPC metadata. Without propagation, you get isolated spans instead of connected traces.
Q: How does the ELK stack handle high log volume?
A: Elasticsearch scales horizontally with sharding. Index lifecycle management moves old logs to cheaper storage or deletes them. Ingest pipelines (Logstash/Beats) buffer and batch writes. Sampling can reduce volume for high-traffic services.
Quick Reference
Three Pillars:
- Metrics: What's happening (Prometheus, Grafana)
- Logs: Why it happened (ELK, structured logging)
- Traces: Where in the system (Jaeger, OpenTelemetry)
RED Method (for services):
- Rate, Errors, Duration
USE Method (for resources):
- Utilization, Saturation, Errors
Metric Types:
- Counter (increases), Gauge (up/down), Histogram (distribution), Summary (percentiles)
SLI/SLO/SLA:
- SLI measures, SLO targets, SLA promises
Alert Best Practices:
- Symptoms not causes
- Actionable with runbooks
- Regular review and pruning
Related Articles
If you found this helpful, explore our other DevOps guides:
- Complete DevOps Engineer Interview Guide - Full DevOps interview preparation
- Docker Interview Guide - Container fundamentals
- Kubernetes Interview Guide - Container orchestration
- Linux Commands Interview Guide - Essential Linux skills
What's Next?
Observability is a skill that separates operators from developers. Anyone can write code that works in development. Understanding how to debug it in production—at scale, under pressure, at 3 AM—is what makes you valuable.
Start with the three pillars. Learn Prometheus and Grafana for metrics. Understand structured logging. Get familiar with distributed tracing concepts. Then practice: set up observability for a side project and intentionally break things to practice debugging.
