Avatar of Mahmoud AbdelwahabMahmoud Abdelwahab

Monitoring & Observability: Using Logs, Metrics, Traces, and Alerts to Understand System Failures

When your application ships to production, it becomes partly opaque. You own the code, but the runtime, network, and platform behaviors often fall outside your direct line of sight. That’s where Monitoring and Observability come in.

Monitoring warns you when predefined thresholds break. Observability lets you explore unknowns, asking new questions in real time and getting meaningful answers without redeploying.

For engineers running software in production, observability rests on three pillars: logs, metrics, and traces. Each offers a different lens into system behavior. Understanding where each excels and where it doesn’t is essential for building a practical, scalable visibility strategy.

  • Logs: the detailed narrative and audit trail. Use structured logs, centralize them, tag each request with a correlation or trace ID, and avoid logging sensitive data.
  • Metrics: the fast, aggregated signal. Great for dashboards, trends, SLOs, and real-time alerting. Easy to query, but light on context.
  • Traces: track a request as it flows through your distributed system. A trace is a collection of spans: each span represents a single operation within a service. Together, spans form a tree structure showing the complete path a request took through your system. Ideal for pinpointing bottlenecks and mapping dependencies.
  • Alerts: the early warning system. Alert on user-impacting symptoms aligned to SLOs (Service Level Objectives), route by severity, and attach runbooks to reduce mean time to recover.

Use them together: an alert points to a metric spike, a trace isolates the slow hop, and logs reveal the root cause and exact error payload.

+---------+          +-----------+          +---------+          +------+
|  ALERT  |  --->    |  METRIC   |  --->    |  TRACE  |  --->    | LOG  |
+---------+          +-----------+          +---------+          +------+
     |                     |                     |                 |
     |                     |                     |                 |
     v                     v                     v                 v
  SLO breach → metric spike → bottleneck found → root cause confirmed
PillarWhat It CapturesStrengthsLimitationsPrimary Use
LogsDiscrete events with full contextDetailed debugging, audits, forensicsWeak for real-time or cross-service insightRoot cause analysis and compliance records
MetricsAggregated numeric signals over timeFast detection, trend analysis, SLO trackingLacks context and per-user granularitySystem health, capacity planning, alerting
TracesRequest paths across servicesDependency mapping, latency analysis, bottleneck isolationGaps without full instrumentation, limited trend visibilityDistributed performance and latency debugging
AlertsThreshold-based notificationsProactive incident response, SLO enforcementNoise, false positives, tuning overheadOn-call operations and early warning signals

Logs are the most familiar pillar of observability. They're discrete records of events that happened in your system, typically written as text lines with timestamps. When something goes wrong, logs are often your first stop for understanding what happened.

2025-11-10T14:23:47.612Z INFO [auth-service] User login succeeded user_id=4821 ip=192.168.10.45 duration=142ms

Logs excel at providing detailed context. When you need to understand exactly what happened during a specific request or transaction. They capture the sequence of events, error messages, stack traces, and any contextual data your application chose to record. This makes them indispensable for debugging: you can trace through the exact execution path that led to a problem.

For compliance and audit requirements, logs are essential. They provide an immutable record of what actions were taken, by whom, and when. This is particularly important for applications handling sensitive data or operating under regulatory frameworks like GDPR. When you need to demonstrate that certain data was accessed or modified, logs provide that proof.

Logs also shine when you need to understand user behavior or business events. If you want to know why a specific user encountered an error, or track a particular transaction through your system, logs are your tool.

Setting up centralized logging is one of the first observability tasks engineers tackle when deploying to production. Your application instances generate logs locally, but you need them aggregated in one place where you can search, filter, and analyze them. This becomes especially important in containerized environments where containers can be ephemeral: if a container crashes, its local logs disappear with it.

Structured logging makes logs even more useful. Instead of free-form text, structured logs use a consistent format (often JSON) that makes them machine-readable, enabling advanced querying and filtering capabilities. You can quickly find all logs related to a specific user ID, transaction ID, or error type, and the structure makes it easier to build dashboards and visualizations.

{
  "timestamp": "2025-11-10T14:23:47.612Z",
  "level": "INFO",
  "service": "auth-service",
  "event": "user_login_succeeded",
  "user_id": 4821,
  "ip": "192.168.10.45",
  "duration_ms": 142
}
  • Stream to stdout and stderr, not local files
  • Attach a correlation or trace ID to every request and include it in each log line
  • Use structured JSON with consistent keys; capture multi-line errors as single JSON lines
  • Sanitize at the source: avoid secrets and PII, mask sensitive fields, and add automated redaction
  • Make CI/CD and deployment logs searchable with application logs to connect releases to behavior

Logs have a major weakness: real-time analysis at scale. Searching millions of log lines for trends or anomalies is slow. Logs are not built to answer time-sensitive questions like "What’s the current error rate?" or "Is latency rising right now?" For that kind of visibility, metrics are the better fit.

Logs also fall short when requests span multiple services. Each service writes its own logs, so reconstructing a full request path means stitching together records from different sources and aligning them by timestamps or IDs. Distributed tracing solves this problem by following a request end to end across every service it touches.

Metrics are numerical measurements collected over time. Unlike logs, which preserve individual events, metrics aggregate data into time-series measurements. Think of metrics as the dashboard of your car: they give you a high-level view of how your system is performing right now.

Example Grafana dashboard

Example Grafana dashboard

Metrics excel at providing real-time visibility into system health. You can answer questions like "What's the current request rate?" or "What's the 95th percentile response time?" instantly, without searching through logs. This makes metrics ideal for dashboards that give you an at-a-glance view of your system's state.

Metrics are perfect for trend analysis, capacity planning, and alerting. By tracking metrics over time, you can identify patterns and predict future needs.

You can set thresholds on metrics and get notified when values exceed acceptable ranges.

  • System metrics: CPU, memory, disk I/O, network throughput and errors
  • Application metrics: request rate, error rate, latency percentiles, queue depth, cache hit rates
  • Prefer percentiles (p95, p99) over averages to reflect real user experience
  • Avoid unbounded label cardinality; do not label by user ID or raw URL when it explodes series count
  • Define SLIs and SLOs, then alert on burn rate across short and long windows
  • Pair metrics with deploy markers so regressions are visible at the moment of change

The aggregation that makes metrics efficient is also their weakness. Metrics lose the detail that logs preserve. If your error rate metric shows a spike, you know something is wrong, but you can't see the individual errors that caused it. You'll need to dive into logs to understand what actually happened.

Metrics also don't help much with debugging specific issues. If a user reports a problem, metrics won't tell you what happened to that particular user's request. You need logs or traces for that level of detail.

Understanding user behavior is another area where metrics fall short. While metrics can tell you how many users are active or what the average session duration is, they can't tell you what a specific user did or why they took a particular action. For that, you need logs or specialized analytics tools.

Distributed tracing, or simply traces, track a request as it flows through your distributed system. A trace is a collection of spans: each span represents a single operation within a service. Together, spans form a tree structure showing the complete path a request took through your system.

Trace ID: 9f1c7a32b4f94c89a7e6c2d01b8b1234

gateway:      ───────────────────────────────────────────── 520ms
              │
auth-service:  ──────────────────────────────────────── 380ms
                   │
db:                 ─────────────── 120ms
                           (SELECT * FROM users WHERE id=4821)

Traces are invaluable for understanding request flow in distributed systems. When a request touches multiple services (which is common in microservices architectures, serverless platforms, or any distributed setup), traces show you the complete journey. You can see which services were involved, how long each operation took, and where bottlenecks occurred.

Traces pinpoint which service or operation causes latency when users report slowness. This level of visibility is difficult to achieve with logs alone, especially when requests span multiple services.

Tagging log entries with trace IDs is a common pattern that combines the strengths of logs and traces. When you include a trace ID in your logs, you can start with a trace to see the high-level flow, then use the trace ID to find all related logs for detailed context. Traces also help identify dependencies and understand system architecture by showing which services call which other services, how frequently, and with what latency.

  • Head-based sampling: decide at the start of a trace whether to record it; cheap, but can miss rare failures
  • Tail-based sampling: keep traces that exhibit errors or high latency; better for incident analysis, higher cost
  • Environment-aware sampling: higher sampling in staging or canary, lower in steady-state production
  • Propagate context: pass trace and span IDs across services, threads, and async boundaries
  • Instrument dependencies: external APIs, databases, queues, and caches, so spans cover the full path

Traces introduce storage and processing overhead, especially in high-traffic systems. Every request creates a trace with multiple spans, and storing all of this data can be expensive. Many teams sample traces, only storing a percentage of requests, to manage costs while still maintaining visibility.

Traces also add some overhead to your application. The instrumentation required to create traces adds latency, though modern tracing libraries keep this overhead minimal. Still, in extremely latency-sensitive applications, even small overheads matter.

For simple monolithic applications, traces provide less value. If your entire application runs in a single process and you don't have distributed components, logs and metrics might be sufficient. Traces become more valuable as your system becomes more distributed.

Alerts are notifications triggered when specific conditions are met. They're your system's way of telling you that something needs attention, ideally before users notice a problem.

Alerts detect issues proactively by notifying you when metrics breach thresholds or critical services fail. They also catch gradual degradations such as slow memory leaks.

For on-call engineers, well-configured alerts are essential. They reduce the time between when a problem occurs and when someone starts investigating it. Alerts are also crucial for SLA monitoring, helping you track compliance and respond quickly when you're at risk of violating commitments.

  • Alert on user-impacting symptoms aligned to SLOs, not only on low-level causes
  • Use multi-window thresholds to catch fast regressions and slow burns without noise
  • Route by severity and ownership; page for critical issues, create tickets for non-urgent work
  • Attach runbooks that name the likely cause, the first commands to run, and rollback steps
  • Deduplicate and group related alerts to reduce noise during incidents

The biggest challenge with alerts is alert fatigue. When alerts fire too frequently, especially for non-critical issues, engineers start ignoring them. False positives are another problem: if alerts fire for conditions that aren't actually problems, engineers lose trust in the alerting system. This often happens when thresholds are set too aggressively or when alerts don't account for normal variations in system behavior.

Alerts require proper threshold configuration and are only as good as the data they're based on. Set thresholds too tight, and you'll be overwhelmed with noise. Set them too loose, and real problems will go undetected. Finding the right balance takes time and iteration, and it changes as your system evolves.

Understanding the concepts behind logs, metrics, traces, and alerts is essential, but putting them into practice requires the right tools. Railway provides built-in observability features that address many of the challenges engineers face when deploying to production, integrating all four pillars into a unified platform.

You can try these observability tools in your own environment — deploy a service on Railway and inspect logs.

Railway automatically captures all logs emitted to standard output or standard error from your applications. Any console.log() statements, error messages, or application output are immediately available for viewing and searching without additional configuration.

You can access logs in different ways:

Service logs

Drill into a single deployment's build, deployment and runtime logs

Service-level build, deployment and HTTP logs on Railway

Service-level build, deployment and HTTP logs on Railway

The Log Explorer

The Log Explorer enables environment-wide search across all services. It also supports advanced filtering syntax: search for partial text matches, filter by service or replica, or use structured log attributes like @level:error to find all error-level logs. Railway's environment logs let you query logs from all services simultaneously, addressing the challenge of correlating events across services.

Railway log explorer

Railway log explorer

Structured logging is fully supported. When you emit JSON-formatted logs with fields like level, message, and custom attributes, Railway automatically parses and indexes them. You can filter by custom attributes using @attributeName:value, making it easy to find logs related to a specific user ID, transaction, or any metadata you include.

Railway Log explorer filtering

Railway Log explorer filtering

Filtering examples:

  • request: find logs that contain the word request
  • "POST /api": find logs that contain the substring POST /api
  • @level:error: filter by error level

The CLI

You can use the Railway CLI for quick checks from the terminal by running railway logs

~ railway logs --help
View the most-recent deploy's logs

Usage: railway logs [OPTIONS]

Options:
  -d, --deployment  Show deployment logs
  -b, --build       Show build logs
      --json        Output in JSON format
  -h, --help        Print help
  -V, --version     Print version

If you want to see how structured logs behave in production, deploy a small service on Railway and stream its output.

Railway provides real-time metrics for CPU, memory, disk usage, and network traffic for each service, available directly in the service dashboard with up to 30 days of historical data.

Service-level metrics on Railway which include CPU/Memory utilization, Number of Requests broken down by status code and egress

Service-level metrics on Railway which include CPU/Memory utilization, Number of Requests broken down by status code and egress

If a service has multiple replicas, you can view metrics as a combined sum or per replica.

Railway's Observability Dashboard brings logs, metrics, and project usage together in a single customizable view. It is scoped per environment, and you can create widgets that display specific metrics, filtered logs, or project spend data.

Railway Observability

Railway Observability

Railway provides two complementary approaches to alerting: monitors for metric-based alerts and webhooks for deployment notifications.

Monitors allow you to configure email alerts/notification when metrics exceed thresholds for CPU, RAM, disk usage, or network egress. This addresses proactive issue detection: instead of waiting for users to report problems, you're notified when resource usage indicates potential issues. Monitors are configured directly on dashboard widgets.

Set up monitoring on Railway

Set up monitoring on Railway

Webhooks provide a flexible notification mechanism for deployment state changes and custom events. Railway automatically transforms payloads for popular destinations like Discord and Slack, so you can integrate notifications into your existing team communication channels.

Whether you're deploying a side project or running a production SaaS, Railway’s observability features give you full visibility into your system. Logs are centralized automatically, metrics are collected with no setup, and alerts are easy to configure. Request tracing support is coming soon. Railway handles the infrastructure so you can focus on your application, not the tooling. Start a project and see it in action.