Monitoring & Observability Interview Guide: Metrics, Logs, and Traces

·13 min read
devopsmonitoringobservabilityprometheusinterview-preparation

Observability is how you understand what's happening in production. When something breaks at 3 AM, your observability stack determines whether you fix it in minutes or spend hours guessing.

Interviewers test observability knowledge because it separates engineers who've operated real systems from those who've only built them. You can write perfect code, but if you can't debug it in production, you're not ready for senior roles.

This guide covers the three pillars of observability, the tools you'll be asked about, and the interview questions that test whether you actually understand monitoring or just know the buzzwords.


Monitoring vs Observability

These terms are often used interchangeably, but they're different.

Monitoring: Predefined Questions

Monitoring answers questions you've already thought of:

  • Is CPU above 80%?
  • Is the service responding?
  • Are there more than 10 errors per minute?

You set up dashboards and alerts for known failure modes. Monitoring tells you that something is wrong.

Observability: Arbitrary Questions

Observability lets you ask questions you haven't thought of yet:

  • Why is this specific user's request slow?
  • What changed between yesterday and today?
  • Which service is causing the cascade failure?

You can explore your system's state without deploying new code. Observability tells you why something is wrong.

The Key Difference

Monitoring: "Is the system healthy?" (yes/no)
Observability: "Why did user X's request fail at 2:34 PM?" (investigation)

A well-monitored system has good dashboards. An observable system lets you debug novel problems without adding instrumentation first.


The Three Pillars of Observability

Observability is built on three complementary data types. Each answers different questions.

PillarWhat It IsWhat It Answers
MetricsNumerical measurements over timeWhat's happening? How much?
LogsDiscrete event recordsWhy did it happen? What exactly?
TracesRequest flow across servicesWhere did it happen? Which path?

How They Work Together

A typical debugging flow:

  1. Metrics alert: Error rate spiked to 5%
  2. Logs investigate: Errors show "database connection timeout"
  3. Traces pinpoint: Requests to /api/orders are slow at the inventory service

No single pillar is sufficient. Metrics are cheap but lack detail. Logs are detailed but hard to aggregate. Traces show flow but only for sampled requests.


Metrics: Prometheus & Grafana

Metrics are numerical measurements collected over time. They're the foundation of monitoring because they're cheap to store and fast to query.

Metric Types

Understanding metric types is essential for interviews:

TypeDescriptionExample
CounterOnly increases (resets on restart)Total requests, errors, bytes sent
GaugeCan go up or downTemperature, queue size, active connections
HistogramSamples in configurable bucketsRequest latency distribution
SummaryCalculates quantiles client-sideRequest latency percentiles

Interview trap: "When would you use a histogram vs a summary?"

Histograms are aggregatable across instances (you can calculate percentiles server-side). Summaries calculate percentiles client-side and can't be aggregated. Use histograms for most cases.

PromQL Basics

Prometheus Query Language (PromQL) is used by Prometheus and many other systems. Know these patterns:

Rate of increase (for counters):

rate(http_requests_total[5m])

Current value (for gauges):

node_memory_available_bytes

Percentiles (for histograms):

histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))

Filtering and aggregation:

sum(rate(http_requests_total{status=~"5.."}[5m])) by (service)

The RED Method

For request-driven services (APIs, microservices), monitor:

  • Rate: Requests per second
  • Errors: Failed requests per second
  • Duration: Latency distribution (p50, p95, p99)
# Rate
sum(rate(http_requests_total[5m]))
 
# Errors
sum(rate(http_requests_total{status=~"5.."}[5m]))
 
# Duration (p95)
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))

The USE Method

For infrastructure resources (CPU, memory, disk, network), monitor:

  • Utilization: Percentage of resource capacity used
  • Saturation: Queue depth when resource is fully utilized
  • Errors: Error events for the resource
ResourceUtilizationSaturationErrors
CPUCPU usage %Run queue length-
MemoryMemory used %Swap usageOOM events
DiskDisk usage %I/O queue depthRead/write errors
NetworkBandwidth usageDropped packetsInterface errors

Grafana Dashboards

Grafana visualizes metrics from Prometheus and other sources. Interview topics include:

  • Dashboard design: Group related metrics, use consistent colors
  • Variables: Make dashboards reusable across environments
  • Alerting: Grafana can alert based on query thresholds
  • Annotations: Mark deployments and incidents on graphs

Best practice: Four golden signals per service dashboard—latency, traffic, errors, saturation.


Logging: ELK Stack & Structured Logging

Logs record discrete events. They provide the detail metrics lack but are expensive to store and query at scale.

Structured vs Unstructured Logs

Unstructured (hard to parse):

2026-01-07 10:15:32 ERROR Failed to process order #12345 for user john@example.com

Structured (queryable):

{
  "timestamp": "2026-01-07T10:15:32Z",
  "level": "error",
  "message": "Failed to process order",
  "order_id": "12345",
  "user_email": "john@example.com",
  "error": "payment_declined"
}

Structured logs let you query: "Show me all errors for order_id 12345" or "Count errors by error type."

Log Levels

Use levels consistently:

LevelWhen to Use
DEBUGDetailed diagnostic info, disabled in production
INFONormal operations worth recording (startup, requests)
WARNSomething unexpected but handled (retry succeeded)
ERROROperation failed, needs attention
FATALApplication cannot continue

Interview tip: "What's the difference between WARN and ERROR?"

WARN: Something unusual happened, but the system handled it. ERROR: Something failed and needs investigation or action.

ELK Stack Architecture

The ELK stack is a common logging solution:

Application → Logstash → Elasticsearch → Kibana
              (collect)    (store/index)   (visualize)

Elasticsearch: Distributed search and analytics engine. Stores logs in indices, enables full-text search and aggregations.

Logstash: Data processing pipeline. Collects logs from multiple sources, parses and transforms them, sends to Elasticsearch.

Kibana: Visualization layer. Dashboards, log exploration, alerting.

Modern alternative: Many teams replace Logstash with Filebeat (lighter) or use cloud solutions (Datadog, Splunk).

Log Aggregation Patterns

Centralized logging: All services send logs to a central system

  • Pros: Single place to search, correlation across services
  • Cons: Network dependency, potential bottleneck

Sidecar pattern: Log collector runs alongside each service

  • Used in Kubernetes (Fluentd/Fluent Bit sidecars)
  • Handles log rotation, buffering, retries

Correlation IDs

A correlation ID (or trace ID) links related logs across services:

{"correlation_id": "abc-123", "service": "api", "message": "Received order request"}
{"correlation_id": "abc-123", "service": "inventory", "message": "Checking stock"}
{"correlation_id": "abc-123", "service": "payment", "message": "Processing payment"}

Pass the correlation ID in HTTP headers (e.g., X-Correlation-ID) and include it in every log entry. This is the bridge between logging and tracing.


Distributed Tracing: Jaeger & OpenTelemetry

Tracing shows how requests flow through distributed systems. Essential for microservices where a single request might touch dozens of services.

Why Tracing Matters

In a monolith, a stack trace shows you where things went wrong. In microservices, a request might:

  1. Hit the API gateway
  2. Call the auth service
  3. Call the user service
  4. Call the order service
  5. Call the inventory service
  6. Call the payment service

If the request is slow, which service is the bottleneck? Logs from each service don't show the full picture. Tracing does.

Trace Anatomy

Trace: The complete journey of a request Span: A single operation within a trace (one service call) Context propagation: Passing trace IDs between services

Trace: order-request-abc123
├── Span: api-gateway (10ms)
│   ├── Span: auth-service (5ms)
│   └── Span: order-service (200ms)
│       ├── Span: inventory-check (50ms)
│       └── Span: payment-process (140ms)  ← bottleneck

Each span includes:

  • Start time and duration
  • Service name and operation
  • Tags (metadata)
  • Logs (events within the span)
  • Parent span ID

OpenTelemetry

OpenTelemetry (OTel) is the industry standard for observability instrumentation. It provides:

  • APIs: For manual instrumentation
  • SDKs: Language-specific implementations
  • Auto-instrumentation: Automatic tracing for common frameworks
  • Collector: Receives, processes, exports telemetry data
App (OTel SDK) → OTel Collector → Jaeger/Zipkin/Datadog

Interview tip: OpenTelemetry merged OpenTracing and OpenCensus. If asked about either, mention this consolidation.

Sampling Strategies

Tracing every request is expensive. Sampling reduces volume while maintaining visibility:

StrategyDescriptionUse Case
Head-basedDecide at request startSimple, consistent
Tail-basedDecide after request completesKeep errors/slow requests
Rate limitingFixed samples per secondPredictable cost
ProbabilisticRandom percentageSimple to configure

Tail-based sampling is powerful: sample 100% of errors and slow requests, 1% of successful fast requests. You keep the interesting data.

Jaeger

Jaeger is a popular open-source tracing backend:

  • Stores and indexes traces
  • Provides UI for trace visualization
  • Supports multiple storage backends (Elasticsearch, Cassandra)
  • Integrates with Kubernetes and service meshes

Key features to mention:

  • Service dependency graph
  • Trace comparison
  • Performance optimization analysis

Alerting & Incident Response

Observability data is useless if no one sees it when things break. Alerting bridges the gap between data and action.

Alert Design Principles

Alert on symptoms, not causes:

  • Bad: "CPU above 80%"
  • Good: "Error rate above 1%"

CPU might be high because you're doing useful work. Error rate means users are affected.

Every alert must be actionable:

  • If you can't do anything about it, don't page someone
  • Include runbook links in alert descriptions

Use appropriate severity:

  • Critical: User-facing impact, immediate action needed
  • Warning: Degradation, investigate soon
  • Info: Worth knowing, no action needed

Avoiding Alert Fatigue

Alert fatigue is when teams ignore alerts because there are too many. It's dangerous—real issues get missed.

Prevention strategies:

  1. Review alerts regularly: Delete alerts nobody acts on
  2. Set proper thresholds: Include hysteresis (alert at 90%, clear at 80%)
  3. Group related alerts: Don't send 50 alerts for one incident
  4. Distinguish pages from notifications: Not everything needs to wake someone up
  5. Track alert metrics: Mean time to acknowledge, false positive rate

SLIs, SLOs, and SLAs

These terms are frequently confused in interviews:

TermDefinitionExample
SLIService Level Indicator - a metric99.2% of requests succeed
SLOService Level Objective - internal target99.9% success rate
SLAService Level Agreement - contractual99.5% uptime or credits

Relationship:

  • SLIs measure actual performance
  • SLOs set targets (with error budget)
  • SLAs are promises to customers (usually looser than SLOs)

Error budget: If your SLO is 99.9%, you have 0.1% error budget. Use it for deployments and experiments. When it's exhausted, focus on reliability.

On-Call Best Practices

Interview questions often cover on-call practices:

  • Rotation: Regular schedules, fair distribution
  • Escalation: Clear paths when primary doesn't respond
  • Handoffs: Document ongoing issues between shifts
  • Runbooks: Step-by-step guides for common alerts
  • Blameless postmortems: Learn from incidents without finger-pointing

Common Interview Questions

Conceptual Questions

Q: Explain the three pillars of observability.

A: Metrics (numerical time-series data for dashboards and alerts), logs (detailed event records for debugging), and traces (request flow visualization for distributed systems). Each serves a different purpose: metrics show what is happening, logs explain why, traces show where in distributed systems.

Q: What's the difference between monitoring and observability?

A: Monitoring answers predefined questions through dashboards and alerts. Observability lets you investigate novel problems without deploying new code. An observable system has sufficient instrumentation to debug issues you haven't anticipated.

Q: When would you use the RED method vs the USE method?

A: RED (Rate, Errors, Duration) for request-driven services like APIs—it measures user-facing behavior. USE (Utilization, Saturation, Errors) for infrastructure resources like CPU, memory, disk—it measures resource health.

Scenario Questions

Q: Your API's p99 latency suddenly increased from 200ms to 2s. How would you investigate?

A:

  1. Check if it's all endpoints or specific ones (metrics)
  2. Look for recent deployments or config changes (annotations)
  3. Check downstream service latencies (traces)
  4. Look for resource saturation—CPU, memory, connections (USE metrics)
  5. Check for database slow queries (logs)
  6. Verify external dependencies (third-party API latency)

Q: How would you design alerting for a new microservice?

A:

  1. Start with RED metrics—alert on error rate and latency SLOs
  2. Add resource alerts only if they correlate with user impact
  3. Include runbook links in alert descriptions
  4. Set up PagerDuty/OpsGenie integration with escalation
  5. Review and tune thresholds after collecting baseline data

Q: You're getting 1000 alerts per day and the team is ignoring them. How do you fix this?

A:

  1. Audit alerts—delete ones nobody acts on
  2. Consolidate related alerts (alert grouping)
  3. Separate pages (immediate action) from notifications (FYI)
  4. Add hysteresis to prevent flapping
  5. Track metrics: acknowledge rate, time to resolution
  6. Establish alert review process (weekly pruning)

Tool-Specific Questions

Q: Explain Prometheus metric types and when to use each.

A: Counter (only increases—requests, errors), Gauge (can go up/down—temperature, queue size), Histogram (distribution in buckets—latency), Summary (client-side percentiles—legacy use). Use histograms over summaries because they're aggregatable.

Q: What is context propagation in distributed tracing?

A: Passing trace and span IDs between services so spans can be correlated into a complete trace. Usually done via HTTP headers (W3C Trace Context standard) or gRPC metadata. Without propagation, you get isolated spans instead of connected traces.

Q: How does the ELK stack handle high log volume?

A: Elasticsearch scales horizontally with sharding. Index lifecycle management moves old logs to cheaper storage or deletes them. Ingest pipelines (Logstash/Beats) buffer and batch writes. Sampling can reduce volume for high-traffic services.


Quick Reference

Three Pillars:

  • Metrics: What's happening (Prometheus, Grafana)
  • Logs: Why it happened (ELK, structured logging)
  • Traces: Where in the system (Jaeger, OpenTelemetry)

RED Method (for services):

  • Rate, Errors, Duration

USE Method (for resources):

  • Utilization, Saturation, Errors

Metric Types:

  • Counter (increases), Gauge (up/down), Histogram (distribution), Summary (percentiles)

SLI/SLO/SLA:

  • SLI measures, SLO targets, SLA promises

Alert Best Practices:

  • Symptoms not causes
  • Actionable with runbooks
  • Regular review and pruning

Related Articles

If you found this helpful, explore our other DevOps guides:


What's Next?

Observability is a skill that separates operators from developers. Anyone can write code that works in development. Understanding how to debug it in production—at scale, under pressure, at 3 AM—is what makes you valuable.

Start with the three pillars. Learn Prometheus and Grafana for metrics. Understand structured logging. Get familiar with distributed tracing concepts. Then practice: set up observability for a side project and intentionally break things to practice debugging.

Ready to ace your interview?

Get 550+ interview questions with detailed answers in our comprehensive PDF guides.

View PDF Guides