Slack Overflow: How We Scaled Slack to Support 1000s of Developers
Railway makes software infrastructure for humans. Our pitch is simple. You give us a docker image or GitHub repo. We deploy and scale it, no friction.
“/end-pitch”
As of this writing, we’ve crossed 4,000 teams, some of whom are logos like Automattic and Alchemy.
It’s been a massive effort to get there.
When a customer signs up on Railway and hits certain growth thresholds, they now automatically trigger creation of a Slack channel and increased priority in our home-rolled help center software that two-way syncs to Slack. We can then build a solutions relationship on Slack in less time than it takes a Google Cloud Support Engineer to acknowledge your request.
Railway doesn’t use Slack for internal comms but there is nothing else like it when it comes to support and sales. It gives us 50x better engagement than email and 8x better response times to customers.
It’s not uncommon for a customer to ask a few questions on the Slack channel and then hit a 5-figure-a-year deployment — all with minimal involvement from the Success team at Railway.
This is how we’ve gone from conversation underflow, to Slack Overflow.
In a previous blog post, we mentioned that the more support we give a customer, the more money they tend to give us.
However, developers hate email.
When we enter a world full of AI agents, you can bet that we’ll get more automated slop delivered right into our inboxes. Never has it been more important to know there is a real human behind the screen.
Initially, we would make Slack Connect channels artisanally, by hand after demo. Customers loved this — it was like a nice hand-written note.
However, after 100+ channels, we became overwhelmed. Especially when the person responsible for sales was also responsible for compliance and also responsible making channels by hand. (We’re hiring btw.)
In parallel, we were busy in-housing our support tooling. There are a litany of reasons why, but that’s better explained by Ray who leads Support at Railway. Stay tuned.
The short story is that we were juggling email, Slack, Discord, Linear, forums, and our roadmap software…and had to keep all conversations in context next to any revenue conversations. It was all too much and none of the off-the-shelf solutions were going to work with the level of human touch we needed to supply.
Trust us, we didn’t want to build this ourselves. But we were heavily context smashed.
Time to build.
Help Station is great because compared to our old support tools, a logged-in Railway user can see all their projects, explain the issue, and then fling their logs over to the Support Engineer on the other side.
That context helps Railway Engineers deliver a truly first class support experience.
Help Station’s slick new thread flow
The back-of-napkin requirement was simple: get that same 5-star support experience experience within Slack.
To do so we would need to:
- Find a way to associate a Slack User to a Railway User
- Create Slack connect channels
- Have a form that matches the structure of the Help Station Support form
- Sync messages from Slack to Help Station and back
- Allow Railway employees to edit and delete messages
I had a rough implementation ready within a week.
Using the Slack Bolt SDK in socket mode, I listened for modal, message, and edit events and performed the requisite actions within Help Station’s server for a generated Slack thread.
All done.
This was then the start of a three month long project. You know what they say about assumptions…
When prepping for the first version for release, I neglected to properly simulate the external party within a Slack connect channel in local dev. Apparently, non-workspace members aren’t allowed to call the bot hosted by the parent Slack Connect channel- so this means that someone from an external company can’t type “/support” and get the modal. Egg meet face.
Okay, what about a workaround?
Message interactions can’t trigger a modal for the user, but, despite the org permissions, your bot could reply to any and all messages. We then put the flow behind the “!support” command.
The bot would reply with a button that you could press that would trigger the support form modal for you to submit an issue to us.
The first iteration of the Slack bridge required hitting the !support bang tag …
Except people had no idea that this command existed.
… It wasn’t super effective
If it did work, any replies from Railway would be in the form of: “Railway Support (Employee Name)” — it was cold, and anything but human.
Back to the drawing board.
We wanted to maintain the same human warmth of manually making channels for users. Our implementation so far didn’t offer that. So we scraped the account link requirement and scrapped “!support”.
We figured that customers who had critical questions, either sales or support would ask us directly. So we took an extreme measure- every message would become a thread by default. To make sure that messages would always come from the team member talking to the customer, we implemented an impersonation flow for Railway employees.
Impersonation is now available to Railway Support Engineers
Using GPT-4o-mini, we would then take the contents of the customer’s message, summarize the intent, and put it into one of two buckets, sales or support.
Except those buckets began to leak.
Somewhere buried at the start of the Slack Bolt SDK docs, there’s some fine copy that says:
“HTTP is more useful for apps being deployed to hosting environments (like AWS or Heroku to stably respond within a large corporate Slack workspaces/organization, or apps intended for distribution via the Slack Marketplace.”
After a period of consistent message delivery, deploying the application (to Railway of course) in Socket Mode would halt message delivery.It was extremely challenging to debug until I found that block of text in the docs.
We then switched to registering HTTP routes on the bot with the Bolt SDK.
Except that the Slack Bolt default server you get is Express, and we use Fastify for our API. My colleague Ray wasn’t happy with the amount of unknown
castings I had to do to get those routes to play nice with each other.
To implement this, I revisited the organization of the handlers. Now data flowed cleanly from request to controller with all of the logic neatly contained in the handlers.
// Handler pattern for Railway's Slack bot
export async function startSlackBot() {
const ctx = createSystemContext("system/slack-bridge");
slackClient.event("message", async args => {
await handleMessageEvent(ctx, args);
});
slackClient.action("open_details_modal", async args => {
await handleAddDetailsButton(ctx, args);
});
slackClient.view("update_details_modal", async args => {
await handleUpdateDetailsModalSubmission(ctx, args);
});
slackClient.shortcut("announce_message", async args => {
await handleAnnounceMessageShortcut(ctx, args);
});
slackClient.action("send_all", async args => {
await handleSendAllAnnouncement(ctx, args);
});
slackClient.action("send_test", async args => {
await handleSendTestAnnouncement(ctx, args);
});
}
Moving from socket mode to API routes was a tiny hurdle compared to the challenge of processing around 100 messages a second at peak.
Assume you are a bad developer like I am and you introduce a bug. Say you cause the server to crash. For the period the server was down- we would have to replay those message events to sync those messages.
What if we had a message queue that gave us guarantees on execution?
That sounds like a queue like Temporal.
Temporal is an open-source workflow orchestration platform. It handles retries, timeouts, and state persistence for Railway, and it was time to bring it to Help Station.
At a high-level, the work needed was simple, just add a workflow within the messageHandler
and have Temporal and it’s pool of Temporal workers do the things we want it to do.
await startWorkflowOnDefaultQueue({
workflow: processInboundSlackMessageWorkflow,
options: {
workflowExecutionTimeout: "5m",
workflowId: WORKFLOW_ID.processInboundSlackMessage({
slackChannelId: params.slackChannelId,
slackThreadTs: params.slackThreadTs ?? params.slackMessageTs,
slackMessageTs: params.slackMessageTs,
}),
args: [params],
retry: slackMessageRetryPolicy,
},
});
Except- it needs to be held very carefully.
The way Temporal works is that Workflows call Activities which are functions that MUST be stateless, idempotent, and retry-able by design. Which means you have to be careful about what your function does and how it imports things. If you mess up- you get the most opaque error of your career. (…And I used to have to interact with Java libs from Clojure.)
I was not careful with my imports in a workflow- learn from my mistakes. Days lost.
So after decomposing my massive workflow into atomic activities like this example:
await sendEphemeralMessageActivity({
channelId: args.slackChannelId,
userId: args.slackUserId,
message: `Your message has been added to the recent support thread.`,
});
We were in business.
(Shameless plug — I mean, this is the Railway blog, but deploying Temporal on Railway wasn’t an issue. There’s a template for that.)
Railway hosts most of Railway on (you guessed it) Railway
This current architecture is holding well for us processing at peak thousands of messages a day, which isn’t at say… Erlang/Elixir scale but respectable when it comes to scaling Node services.
Now with zero downtime deploys using Railway Healthchecks, and if I likely introduce a runtime error- I can sleep easy knowing that we won’t lose any messages.
We think non-core-product experiences deserve first class treatment. I feel like the Slack bridge meets that standard — the integration should feel as seamless as the integration with GitHub… it’s arguably just as important.
Companies who serve developers understand that developers are adverse to being sold to. My theory is because the people responsible for GTM aren’t engineers and will say terms that trip their Bad Salesperson alarm. This goes for disclosure forms and the like.
Heck, even I don’t want to talk to anyone when I use a product.
However, for revenue teams, a user base that doesn’t want to talk to you is difficult to work with, even if you have nothing to sell to them. For Railway, when a developer deploys on Railway, we have little to no understanding on that user’s commercial intent. (Barring fraud.)
This is by design.
We purposely get completely out of the way. We strongly believe that developers should deploy what they want without a demo. This presents a challenge for a company with revenue goals. If everyone can just simply use the product without having to talk to anyone, what’s the value of a traditional sales team?
Our answer that we came to in the last blog post was: there was none.
Instead, we built a data pipeline of significant customer events that would signify if there was expansion or contraction in the account signifying that the GTM team should jump in to offer help or be on guard for a question.
It’s “pre-cog” sales.
However, over the last year, we also learned that it was our responsibility to increase the surface area for serendipitous conversation. The Slack bridge we built is the other half to facilitate those conversations- after when we have high confidence that it’s time to build a relationship with our customers.
For other companies reading this- we highly recommend that you rally your communications with your customers via Slack. (Or any platform where your customers are.)
Railway Slack support autoresponder
We feel the 50x better engagement and 8x response time improvement has showed in our revenue and our customers satisfaction. As far as I know- no other platform gives you the access to it’s engineers as much as we do.
When our customers do well, so do we.
Internal notification that a team has reached the threshold to get a Slack channel airdropped to them
…and we are pleased to announce that we are opening up the Slack Connect experience to all Pro Teams on the platform.
Now you can spin up a Slack Connect channel for you and your team right from the Workspace setting page. We’ll be waiting for you on there.
We can’t wait to hear what you are building for your companies.
All aboard.