Avatar of Phin Walton
Phin Walton

Networking Is a Black Box, We Used eBPF to Open It

Networking is the black box of application development. You get an IP address, you expect it to work, and when it doesn't, you have no idea where to start.

I'm Phin, a network engineer at Railway. I just shipped Network Flows, a feature that lets you see, in real time, how traffic moves between your services. Pipes on a canvas (editor’s note: like Factorio) with the throughput visualized, and detailed logs just a click away. No tcpdump, no guesswork. In this blog post, I will discuss our motivation behind working on this feature, how we implemented it, and what I think the future holds for networking products.

Developers treat networking like a magical force. They get an IP address and they expect it to work.

The problem starts when something actually goes wrong, a misconfigured firewall, a networking layer you don't understand, or a experimental kernel that silently drops your packets. With traditional tooling, as a developer, it’s extremely hard to know what’s going on.

High latency. Timeouts. The app appears slow. It looks like an application bug. And that's where most people starts debugging, because that's the only layer they know.

You see this with ORMs and applications all the time.

Things manifest as "connection: cannot connect to database." But why can't it connect? Was the port wrong? Was the connection refused? Prisma doesn't tell you. The Linux kernel, when a syscall like connect() errors, actually gives you a real error (e.g. connection reset, address not available, or an ICMP error from the peer like EHOSTUNREACH). But most frameworks and ORMs just throw that information away.

Networking is foreign to a lot of developers. And I think it’s only going to get more foreign. People are vibe coding, the layers of abstraction keep getting higher, and the gap between what developers understand and what's actually happening on the wire is just going to widen. Figuring out that your problem is even a networking problem is going to take more and more effort.

Usually when I have to work out network issues like say you're on AWS and you want to understand how networking works between your services. There’s not much you get out of the box. You would have to set up your own observability stack. It's not a primitive they have.

You'd have to SSH into each box, run TCP dump, and know what to look for. You'd need to figure out which services are communicating with each other, know where to run the dump, know what to query for. Most developers don't even know where to start.

So the journey goes something like: you've exhausted application-level debugging, you realize it might be networking, you google around, you learn about TCP dump, you run it, you see packets leaving one side but not arriving at the other box. It's tedious and it assumes a level of systems knowledge that most app developers don't have.

Traditional platforms approach the problem bottom-up. You start from raw packet captures and try to work your way up to understanding. We wanted to flip that. Give you the problem first, then let you dig into the details if you need to.

Support tickets.

The support team deals with hundreds of tickets a day so it was difficult for the support team to properly differentiate the source of the issues.

We had so many threads from customers saying "why is your networking broken? Look at this high latency. I can't connect to my database." These all got escalated to the infrastructure team, because support can't spend time diagnosing a customer's application. So the tickets land on our desks, and we're spending hours doing TCP dumps and network diagnosis ourselves.

Almost all of them were application issues and usually not Railway's networking stack. The underlying network was operating fine - the customer was hitting the wrong port, or had too many open connections to their database, or had something misconfigured in their app.

We needed a way to show users what's actually happening on the network so they could see it themselves. And we needed it to lighten the load on our team, because we were drowning in tickets that weren't ours to fix. (Although we do fix the issues if it’s our fault, ask me about peering relationships.)

On the other hand, I really wanted to eventually have the canvas to be more interactive and useful. Customers really liked the variable reference lines, offering another view for them was something that I meaningfully wanted to do but couldn’t until we got our machines running certain kernels. Which leads me to my next point…

eBPF is still relatively new. Legacy cloud platforms are running VMs, and the hooks and APIs for this kind of kernel-level tracing didn't exist until recently. Even now, VMs make it harder - we run a combination of containers and VMs, so hooking into packet egress and ingress is more straightforward for us. When we launched this to beta, immediately we had some more DevOps forward customers ask us why a public cloud can’t do this. The above answers that.

But the bigger reason is the abstraction model. On traditional platforms, you create machines. You don't necessarily create services. You might have a Kubernetes cluster where one node is running hundreds of services. The data is muddy. It's hard to plumb that data and show it in a way that makes sense to a developer.

I figured that Railway already has the canvas a 2D view where your services are laid out logically and you can organize them however you want. That gave us something to paint on. We could draw connections between services in a way that actually maps to how developers think about their architecture, not how the infrastructure happens to be organized.

I got inspired by UniFi's network observability page, where you can see packets flowing between devices. (Editor’s note: Phin has a solid homelab) I took it further and built pipes that actually represent the traffic between services.

The more throughput between two services, the larger the pipe. The more packets per second, the more dots flowing through the pipe, and the faster they move. Purple is ingress, blue is egress. Eventually I want to add the functionality where pipes will turn red when there's a high error rate.

Because you've already arranged your services on the canvas, the pipes just connect them. You see at a glance what's talking to what and how much. If something looks off, you click into a pipe and you're looking at the real logs, actual Linux kernel drop codes, if you want that level of detail.

The whole thing follows what I think is Railway’s approach: on the surface, it's simple, you get what you need out of the box. But the deeper information is there if you want it. We just don't shove it in your face because most people don't need it.

I already spoke about eBPF, but the way we make this working is by using a eBPF information that gives us hooks into container packet egress and ingress. What you get back is a raw socket buffer but that is not useful on its own.

So we enrich it.

We pull out destination port, source port, source and destination addresses, and we paint the packet with context about what it's actually doing.

These enriched packets get buffered into batches of 10,000 flows per second. The enrichment engine resolves source and destination addresses, looks up the cgroup, figures out which Railway service sent the packet, resolves the peer service by doing internal lookups on the destination address. Then it flushes those batches in 10,000-row chunks to a writer service.

The writer service flushes to ClickHouse. About a million rows per second. Every second, across all of Railway, we're writing roughly a million rows of network flow data.

It sounds straightforward when you describe it like that, but it wasn't.

Our initial implementation wrote directly from every host machine - every Railway stacker, about a thousand of them. Each one writing to ClickHouse independently.

ClickHouse uses a replacing merge tree, which means every batch you write has to be merged later for efficient storage. With a thousand writers pushing 10,000 rows per second, ClickHouse was spending so much CPU on merge operations that it degraded the entire cluster.

Alex, one of our engineers, built an ingestion pipeline that sits between the stackers and ClickHouse. Instead of a thousand writers, we now have three. All the stackers write to the ingestion service, which batches everything together and does far fewer, much larger inserts. Same data throughput, dramatically less merge pressure on ClickHouse. We have now found that this is holding up well for the millions of services live on Railway.

The plan is to keep making networking accessible for application developers. That's what Railway is doing broadly - taking lower-layer concepts and abstracting them so developers don't have to think about them, while keeping the data available for the people who do want to go deep. I am now working on our CDN to make it better to host front-ends on Railway. (Yes, you can host front-ends on Railway.)

That said, although I built this, I hope that application developers should be thinking harder about how they surface errors. The kernel gives you rich information when connections fail. Most of that gets thrown away at the application layer. With LLMs in the mix now, a detailed error is something that can actually be reasoned about and "cannot connect to database" helps nobody.

And from the network engineering side, we have to accept that most application developers do not understand layer four and below. I have to assume that now. So Railway needs to build systems that surface this information in a way that makes sense to someone who thinks in HTTP and SQL, not TCP flags and kernel return codes.

Network Flows is a start. There's a lot more we can do with this data.


Network Flows is available now. Check it out on Railway →