Kyryl Truskovskyi

Dec 13, 2024

Deploy Triton Inference Server on Railway

Machine learning—have you heard about it?

Of course, you have! That’s how I discovered Railway - ChatGPT recommended it for my deployment. And naturally, you might think you need tons of GPUs and hardware to deploy ML models, right? But what if I told you that’s not always the case?

Believe it or not, there are many ML models outside of the multi-billion parameter LLMs - the ones that predict your ETA in navigation apps, decide whether you qualify for a loan, detect malfunctions in rocket engineering, and predict anomalies in security systems.

And you know what they all have in common? They’re fast and don’t require GPUs. A powerful CPU is more than enough to run and operate them. Even in the GPU-hungry world of LLMs, there’s a big push to move inference to CPUs to reduce operational costs, see Llama3.2 1B or SmolLM.

So today, we’re going to explore how to serve ML models on Railway and scale them efficiently!

Main Suspects

Let’s be more concrete. How exactly are we going to do this?

Fry from Futurama feeling suspicious

We’ll need models of course, a way to deploy them, and a way to store checkpoints. Let’s talk about each.

Bring models

Of course, you need an ML model for this. You can bring your own, train one from scratch, or just use an existing one. For example, on HuggingFace, there are more than 1M open-weight models, many of which can be served on a CPU.

For this example, I am going to use two models:

A dummy model that simply adds and subtracts arrays.
resnet18, computer vision models trained on the ImageNet dataset.

P.S. Please feel free to replace them with any others relevant to your use case.

Options to deploy them

Now let’s choose how to serve them via an API.

Should we write our own FastAPI web server around it like in this guide?

That’s a perfectly good option, but what if you need some advanced functionality, such as serving multiple models, adding and removing them dynamically, version support, queue buffers, caching, and so on?

NVIDIA Triton inference server

For this purpose, inference servers exist. There are quite a few of them, but today we will focus on one specific and very popular option: the NVIDIA Triton Inference Server .

It has one of the most comprehensive sets of features for serving ML models and is supported by many ML platforms:

The downside is that it might be quite complex to start with. NVIDIA software is never easy to work with, but at the end of the day, this is why we have this guide for you.

Way to store checkpoints

Last but not least, we will also need object storage to keep model files (serialized checkpoints of the trained model), and for this, we are going to use MinIO and connect it with Triton.

So!

ML models: AddSub and ResNet - ✅
Inference server: Triton - ✅
Model storage: MinIO - ✅
Platform to host & connect everything together: Railway - ✅

Let’s get into building!

Start slow

Even the Triton team recognizes that Triton hard to start, so they put PyTriton together!

PyTriton is a simplified Python wrapper around the inference server. This is exactly where we are going to start - just to get a feel for how to work with it.

Define first code

Let’s begin by defining our AddSub model and PyTriton inference.

import numpy as np

from pytriton.decorators import batch
from pytriton.model_config import ModelConfig, Tensor
from pytriton.triton import Triton


@batch
def _add_sub(**inputs):
    a_batch, b_batch = inputs.values()
    add_batch = a_batch + b_batch
    sub_batch = a_batch - b_batch
    return {"add": add_batch, "sub": sub_batch}


with Triton() as triton:
    print("Loading AddSub model")
    triton.bind(
        model_name="AddSub",
        infer_func=_add_sub,
        inputs=[
            Tensor(name="input_a", dtype=np.float32, shape=(-1,)),
            Tensor(name="input_b", dtype=np.float32, shape=(-1,)),
        ],
        outputs=[
            Tensor(name="add", dtype=np.float32, shape=(-1,)),
            Tensor(name="sub", dtype=np.float32, shape=(-1,)),
        ],
        config=ModelConfig(max_batch_size=128),
        strict=True,
    )
    print("Serving model")
    triton.serve()

Easy!

Now, let’s define a Dockerfile for it.

FROM python:3.13

WORKDIR /app
RUN pip install nvidia-pytriton==0.5.12

ENV PYTHONPATH /app
COPY server.py server.py 

CMD python /app/server.py

I built and push it via these commands:

docker build --platform=linux/amd64 -t ghcr.io/kyryl-opens-ml/test:latest .
docker push ghcr.io/kyryl-opens-ml/test:latest

Make sure you provide credentials to pull this Docker image or make it available. In my case, this Docker image is public, but Railway supports private registries as well.

Deploy to Railway

And finally, we are really ready to deploy it via Docker image!

Simply create a new project in Railway, add a new service, and link the service to the image you published above. See Railway’s Quickstart guide for more information if needed.

Successful deployment of Pytriton image in Railway

Also be sure to generate a domain for the service.

Pytriton service in Railway with public domain generated

Call model with client

So now you have a simple PyTriton deployment on Railways. Let’s check if the model is available. Triton provides some information about the model, which you can find via this endpoint:

https://<your-generated-domain>/v2/models/AddSub/config

Data about the model from the config endpoint

Here you can see all configurations (mostly default values) for our AddSub model.

It exposes both HTTP and gRPC protocols to call it for inference.

Now let’s call PyTriton

You cannot use simple REST to call PyTriton; you need to use its own specific client. Simple code to use it might look like this:

import numpy as np
import tritonclient.http as httpclient    

def call_add_sub_model(host: str):
    batch_size = 2
    a_batch = np.ones((batch_size, 1), dtype=np.float32)
    b_batch = np.ones((batch_size, 1), dtype=np.float32)

    triton_client = httpclient.InferenceServerClient(url=host, verbose=True, ssl=True)

    inputs = []
    inputs.append(httpclient.InferInput('input_a', a_batch.shape, "FP32"))
    inputs.append(httpclient.InferInput('input_b', b_batch.shape, "FP32"))

    inputs[0].set_data_from_numpy(a_batch)
    inputs[1].set_data_from_numpy(b_batch)

    outputs = []
    outputs.append(httpclient.InferRequestedOutput('add'))
    outputs.append(httpclient.InferRequestedOutput('sub'))
    results = triton_client.infer(
        model_name="AddSub",
        inputs=inputs,
        outputs=outputs
    )

    add_result = results.as_numpy('add')
    sub_result = results.as_numpy('sub')
    print(f"add_result = {add_result}; sub_result = {sub_result}")

if __name__ == '__main__':
    host = "py-triton-production.up.railway.app"
    call_add_sub_model(host=host)

Great! Now we can query our simple model on Railway!

But this is just the start! We haven’t even scratched the surface. Let’s scale up our serving game with multiple models and model registry concepts

Scale Up

Let’s define our advanced use case!

Leo DiCaprio saying we need to go deeper to scale up

Model management

So, we know how to serve one model - great success! But now we want to scale up, support multiple models, experiment with versions, and do it really, really fast. Welcome to model management! And, surprise surprise, Triton has amazing support for it.

This particular feature is called Model Repository and how to work with it is called Model Management. Let’s see how to use them with examples.

Migrate from PyTriton to Triton

To put it simply, the Model Repository is just a folder that Triton scans to load and unload models, depending on how you specify it. To enable this, let’s first migrate from PyTriton to Triton by running this command:

tritonserver --model-control-mode=poll --repository-poll-secs=5 --model-repository=/models

This is exactly what happens under the hood when you run PyTriton, so we’re removing one layer in our journey. What does this command do?

Starts the Triton Inference Server
Loads all models from the folder /models
Scans this folder every 5 seconds to check if any models are added, removed, or updated

What’s inside this --model-repository=/models folder? It’s essentially just a list of folders, each corresponding to its own model.

For AddSub, it’s:

add_sub (name of the model)
│   config.pbtxt (configuration)
└───1 (version folder)
    │   model.py (python implementation of TritonPythonModel interface)

where config looks like:

name: "add_sub"
platform: "python"
default_model_filename: "model.py"
input [
  {
    name: "INPUT0"
    data_type: TYPE_FP32
    dims: [ 4 ]
  }
]
input [
  {
    name: "INPUT1"
    data_type: TYPE_FP32
    dims: [ 4 ]
  }
]
output [
  {
    name: "OUTPUT0"
    data_type: TYPE_FP32
    dims: [ 4 ]
  }
]
output [
  {
    name: "OUTPUT1"
    data_type: TYPE_FP32
    dims: [ 4 ]
  }
]

instance_group [{ kind: KIND_CPU }]

For ResNet, it’s:

resnet (name of the model)
│   config.pbtxt (configuration)
└───1 (version folder)
    │   model.pt (serialized libtorch model file)

where config looks like:

name: "resnet"
platform: "pytorch_libtorch"
max_batch_size : 0
input [
  {
    name: "input__0"
    data_type: TYPE_FP32
    dims: [ 3, 224, 224 ]
    reshape { shape: [ 1, 3, 224, 224 ] }
  }
]
output [
  {
    name: "output__0"
    data_type: TYPE_FP32
    dims: [ 1, 1000 ,1, 1]
    reshape { shape: [ 1, 1000 ] }
  }
]

Implementation options

In Triton, you have different implementations depending on which model you are serving.

For the ‘python’ platform, you just need to write Python code - great if you need a custom implementation. For the ‘pytorch’ platform, you simply add the saved model weights and specify the input/output schema.

Triton supports a wide range of options, including:

And many more. You can also chain these models into a “meta” model, which people in ML refer to as ensembles, see Model Ensembles. But we will leave that for now.

For the full code of my example check this repo.

Bringing proper object storage

Okay, let’s say we have our ‘model-repository,’ whether you used my example, Nvidia’s example Conceptual Guides and Python examples., or created your own.

Where is this folder going to live?

For this, we’re going to use MinIO and take advantage of the amazing persistence features of Railway platforms. And to avoid starting from scratch, Railway already has a great template to deploy it: https://railway.app/template/SMKOEA

MinIO template overview page

If you’re not familiar with it, MinIO is an S3-compatible object storage system that can be connected to Triton as the primary location for your model files using this command:

tritonserver --model-control-mode=poll --repository-poll-secs=5 --model-repository=s3://https://<generated-minio-url>:443/triton-bucket/model_registry/  --log-verbose=1

Where https://<generated-minio-url> is your MinIO URI from Railway, triton-bucket is the bucket name, and model_registry is the folder where you store the models.

Update your service to pull the Triton server image: nvcr.io/nvidia/tritonserver:24.09-py3

Triton service settings in Railway

Ensure that you set the start command appropriately:

tritonserver --model-control-mode=poll --repository-poll-secs=1

Triton service custom start command

Make sure to add two environment variables: AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY to your Triton deployment, which should correspond to the MinIO username and password.

Triton service variables

And now you’re ready for action!

Our Triton server is pulling models from MinIO, and the MinIO bucket setup looks like this:

Triton-bucket in MinIO

Let’s create a folder for our registry and upload the add_sub model first.

Folder setup in MinIO

If you check the Triton logs, you’ll find a new message:

Triton service logs in Railway

This means Triton was able to successfully pull the model, and we can run inference against it.

Let’s do the same with the ResNet model.

Folder setup in MinIO

The same happens in the Triton logs:

More data in Triton service logs in Railway

Oh, I love Railway logs feature!

CRUD operations with client

So now you see how you can dynamically add and remove ML models with the combination of Triton and MINIO. Time to call for an inference - here’s the full code for the client:

from io import BytesIO
from pprint import pprint

import numpy as np
import requests
import tritonclient.http as httpclient
import typer
from PIL import Image
from torchvision import transforms

DEFAULT_HOST = "triton.up.railway.app"
app = typer.Typer()

def resnet_preprocess(img_url="https://www.railway-technology.com/wp-content/uploads/sites/13/2023/05/shutterstock_2123019209.jpg"):
    response = requests.get(img_url)
    img = Image.open(BytesIO(response.content))
    preprocess = transforms.Compose(
        [
            transforms.Resize(256),
            transforms.CenterCrop(224),
            transforms.ToTensor(),
            transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
        ]
    )
    return preprocess(img).numpy()

@app.command()
def list_of_models(host: str = DEFAULT_HOST):
    triton_client = httpclient.InferenceServerClient(url=host, ssl=True)
    d = triton_client.get_model_repository_index()
    pprint(d)

@app.command()
def call_add_sub_model(host: str = DEFAULT_HOST):
    triton_client = httpclient.InferenceServerClient(url=host, ssl=True)

    shape = [4]
    a_batch = np.random.rand(*shape).astype(np.float32)
    b_batch = np.random.rand(*shape).astype(np.float32)

    inputs = [httpclient.InferInput('input_a', a_batch.shape, "FP32"), httpclient.InferInput('input_b', b_batch.shape, "FP32")]
    inputs[0].set_data_from_numpy(a_batch)
    inputs[1].set_data_from_numpy(b_batch)

    outputs = [httpclient.InferRequestedOutput('add'), httpclient.InferRequestedOutput('sub')]
    results = triton_client.infer(model_name="add_sub", model_version="1", inputs=inputs, outputs=outputs)
    print(f"add_result = {results.as_numpy('add')};\nsub_result = {results.as_numpy('sub')}")

@app.command()
def call_resnet(host: str = DEFAULT_HOST):
    transformed_img = resnet_preprocess()

    # Setting up client
    client = httpclient.InferenceServerClient(url=host, ssl=True)

    inputs = httpclient.InferInput("input__0", transformed_img.shape, datatype="FP32")
    inputs.set_data_from_numpy(transformed_img, binary_data=True)

    outputs = httpclient.InferRequestedOutput("output__0", binary_data=True, class_count=1000)
    results = client.infer(model_name="resnet", model_version="1", inputs=[inputs], outputs=[outputs])
    inference_output = results.as_numpy('output__0')
    print(inference_output[:5])

if __name__ == '__main__':
    app()

You can check the list of all models:

python client.py list-of-models
[{'name': 'add_sub', 'state': 'READY', 'version': '1'},
 {'name': 'resnet', 'state': 'READY', 'version': '1'}]

Run inference against the add_sub model:

python client.py call-add-sub-model
add_result = [1.5022974 1.0049845 1.1029736 1.4775192];
sub_result = [ 0.1715591   0.753664   -0.84188515 -0.1987344 ]

Run inference against the resnet model:

python client.py call-resnet
[b'19.045240:466' b'16.321390:705' b'15.594160:547' b'14.090655:829'
 b'13.497784:627']

And of course, you can do this dynamically - add more models, remove existing ones, and experiment with different versions - by simply uploading folders to MINIO on Railway! Pretty neat, I would say!

Put Everything Together: Final Architecture

As software engineers, we all love to draw rectangles and arrows between them, right? So let’s summarize everything we’ve discussed into one diagram. Triton Inference Server acts as the main engine to serve models, MinIO serves as the model registry, and there are two clients: Triton, specific for sending requests, and an S3-compatible client for adding and removing models.

Final architecture diagram

And this is how it looks in Railway UI.

Final project in Railway canvas

Takeaways

I would love for you to take away 4 key learnings and 1 tangible artifact from this read:

More ML models require CPU than GPU
Don’t be afraid to self-serve
Triton is a solid option
Railway is your best friend for this!

Start with this code.