Avatar of Mahmoud Abdelwahab
Mahmoud Abdelwahab

Logs, Metrics, and Traces: What Does Each Signal Tell You?

Logs, metrics, and traces are the three core signals of observability. Each captures a different dimension of what's happening inside a running system, and together they give you the visibility you need to understand, debug, and maintain production applications.

A metric fires the alert but names no suspect, a trace narrows the suspect down to one service but can't explain what went wrong inside it, and the logs can't explain the service issue because they can't find the right ones fast enough. When the three signals can't point at the same event, a fix that takes 20 minutes can take 3 hours to investigate.

This guide explains what each signal captures, where each one falls short on its own, and how the three work together to cut incident response from hours to minutes.

Observability is how well you can diagnose what's going wrong inside your system from the signals (specifically logs, metrics, and traces) it produces.

It differs from traditional threshold-based monitoring, which catches failures you predicted in advance. Observability ties logs, metrics, and traces together so that when a failure falls outside every rule you wrote, you can detect the root cause.

Logs capture what happened inside your services by recording individual events with full context. They are timestamped records, written by an application as events occur, and each log entry retains the full event record.

Written as JSON with consistent field names, logs can be read by machines and searched by attribute. A stable schema makes structured logs easy to validate, parse, line up with traces and metrics, and analyze at scale. A structured log entry looks like the following:

{

	"timestamp": "2024-03-15T13:26:23.505892Z",
	
	"level": "ERROR",
	
	"message": "Payment processing failed: timeout after 5000ms",
	
	"service": "payment-service",
	
	"trace_id": "4bf92f3577b34da6a3ce929d0e0e4736",
	
	"span_id": "00f067aa0ba902b7",
	
	"order_id": "ord-8821",
	
	"http_status_code": 504

}

The trace_id and span_id fields link a log entry to the trace it came from, which matters when the three signals work together.


Use logs once you know which service is failing, as they show exactly what a specific request or session went through.


The limitation is volume. Without request IDs in log entries, searching logs across dozens of services is slow and manual. Logs from a crashed container also only live on the node for a short time and can be lost unless they're collected and sent to a central store.


Railway automatically captures all logs emitted to standard output or standard error from your applications, so logs from crashed containers aren't lost. Any console.log() statements, error messages, or application output are immediately available for viewing and searching without additional configuration.

You can access logs in four ways:

  • Service logs: drill into a single deployment's build, deployment, and runtime logs
  • Log Explorer: environment-wide search across all services, with filtering by service, level, trace ID, or custom attributes
  • CLI: railway logs streams the latest deployment in real time
  • Railway Remote MCP server: connect your AI coding agent (Claude, Cursor, Codex, GitHub Copilot, and others) to mcp.railway.com over OAuth and invoke the railway-agent tool to investigate logs in natural language.
railway logs -d      # deployment logs
railway logs -b      # build logs
railway logs --json  # JSON output

Structured logging is fully supported. You can filter by custom attributes using @attributeName:value, making it easy to find logs related to a specific user ID, transaction, or any metadata you include.

Metrics detect your system's behavior shifts by rolling many events into a single number you can query. They are numeric measurements of how your system behaves, aggregated over time.

Common examples include CPU, memory, disk, and network. Application-level metrics like request latency or error rates require a third-party tool connected via OpenTelemetry or a vendor SDK.

Metrics tell you when something changed, but not why it changed. Use them for alerting, capacity planning, and setting service-level objectives (SLOs) because they reduce many events to a few numeric series you can query and compare against thresholds. A CPU spike shows up as a rising line on a graph, but the number alone can't tell you which code path caused it. A metric alert is the start of an investigation.

Railway builds metrics in with per-service graphs for CPU, memory, disk, and network that include deploy markers, so you can see at a glance when a code push lines up with a resource spike. Metrics are available for up to 30 days. For services with multiple replicas, toggle between a combined sum or a per-replica view to isolate which instance is running hot.

Traces pinpoint where a request slowed down or failed across service boundaries. It is a record of a request's path through all the services in a system, captured as a timeline of operations.

Each operation along the path is a span, a named step with a start time, duration, and metadata. Spans form a tree through parent and child relationships, with the full tree making up the trace.

Here's a concrete example. A user submits a checkout request, and the trace shows every hop across the API gateway, order service, inventory service, and payment service, how long each took, and where the chain broke.

POST /checkout  (api-gateway)              880ms
├── PaymentService.Charge  (payment-svc)   500ms  ← bottleneck
│   └── db.query: INSERT payments          120ms
├── InventoryService.Reserve (inv-svc)     240ms
│   └── db.query: UPDATE inventory          80ms
└── NotificationService.Send (notif-svc)    40ms

PaymentService.Charge accounts for 500ms of the 880ms total, so the slow step is visible right from the trace, with no need to check dashboards for each downstream service one by one.

Traces answer the location question: which service in the chain is responsible, and how much of the total latency it owns.

However, traces require some setup work because services need to pass trace context using the W3C traceparent header, and if a service receives a traceparent header, it must pass it on outgoing requests unless it starts a new trace. End-to-end traces depend on each hop passing context along. The setup work is real, especially across systems written in multiple languages, where each language SDK has to be configured separately.

Railway doesn't collect traces natively. Implement your services with OpenTelemetry, point the exporter at your backend of choice (such as Honeycomb, Grafana Tempo, Jaeger, or Datadog), and carry trace_id in your structured logs. Railway's third-party observability guide covers the setup.

Logs, metrics, and traces each answer a different question across four dimensions (data structure, granularity, query method, and the question each best answers).

DimensionLogsMetricsTraces
Data structureTimestamped text events (ideally structured JSON)Numeric time-series with labelsTrees of spans, each with timestamps, duration, and metadata
GranularityIndividual eventsMany events rolled up into a single numberFull path of one specific request across services
Query methodSearch by attribute or full textRoll up over time windows, set thresholdsFollowed by trace_id across service boundaries
The question each best answerWhy did it happen?What happened, and at what rate?Where in the service graph did it happen?

No single signal gives a complete picture on its own. The three signals must work together to provide a complete picture of system health.

Your checkout endpoint runs on top of several services, and an outside fraud-check vendor starts timing out. Here's how the investigation unfolds:

Your alert fires when p99 checkout latency goes from 200ms to 4 seconds at 14:23 UTC, so you know when the spike started and which top-level endpoint is affected, but you can't tell which of six downstream services is responsible.

You filter traces for the checkout service in the 14:23 to 14:35 window, sort by duration, and in the waterfall view on a slow trace, the fraud-check span takes up 3.0 seconds of a 3.4-second request, narrowing the slow step to one service and one call. The trace gives you a trace_id: 123.

You query logs filtered by the trace_id and the fraud-check service, and instead of thousands of log lines from hundreds of requests running at the same time, you get exactly the lines from that one request.

14:23:47.232 DEBUG Calling external fraud API         trace_id=abc123...
14:23:50.187 WARN  External API timeout after 2955ms  trace_id=abc123...
14:23:50.188 ERROR Falling back to synchronous DB lookup trace_id=abc123...

The root cause is confirmed because the external fraud vendor API is timing out, and the fallback to a synchronous database lookup adds to the delay. 

Across more than 15 million monthly deployments, Railway keeps logs and metrics on a single platform while supporting third-party connections for tracing via OpenTelemetry.

Two tools tie that together in practice: the Observability Dashboard for investigation, and Monitors and Webhooks for alerts and notifications.

The Observability Dashboard gives you a customizable per-environment view of logs, metrics, and project usage in one place. Create widgets for specific services, filter logs by attribute, or track spend across a billing period.

The dashboard is supported by the Railway Agent. It lets you query and interact with your Railway services and telemetry in natural language, and acts as a debugging companion when you're working through a problem. It can surface failed deployment issues on request, but it isn't a background monitor; it acts when you ask it to, not on its own..

Bilt, the nationwide loyalty program for renters, runs that workflow at scale, handling more than 1,500 requests per second under 50ms on the first of the month. At that volume, jumping from a metric spike to the matching logs without switching tools, and using Railway’s Observability Dashboard, is the difference between a 20-minute fix and a three-hour investigation.

Railway provides two complementary approaches to alerting: monitors for metric-based alerts and webhooks for deployment notifications.

Monitors proactively detect issues by notifying you when resource usage indicates potential problems before users report them. Configure threshold-based alerts directly on dashboard widgets for CPU, memory, disk, or network egress, and Railway notifies you via email, in-app notifications, or webhooks.

Webhooks cover deployment state changes and custom events. Railway automatically formats payloads for popular destinations like Discord and Slack, so you can integrate notifications into your existing team communication channels.

Whether you're running a side project or a production SaaS, Railway gives you full visibility into your system from day one. Logs are automatically centralized, metrics are collected, and traces can be added via OpenTelemetry as your architecture grows and needs them. With the infrastructure handled, you can focus on the application production. Get started for free.