386. Packaging ML Models

▮ Advantage Of Packaging

Packaging ML models means getting a model into a container to take advantage of the following.

Fig.1 – Container Deployment
  1. You can run a container locally as long as container runtime is installed
  2. You can easily distribute and share your container
  3. There are many options to deploy a container in the cloud and scale it if necessary
  4. The deployment method becomes less complicated which makes it easier to debug and maintain

Here is one example to containerize an NLP model using Docker and Flask.

Reference: Practical MLOps

▮ Directory Structure

This is the final directory structure we want to make.

Working Directory:.
│  Dockerfile
│  requirements.txt
│
└─webapp
        app.py
        roberta-sequence-classification-9.onnx

▮ Dockerfile

First, we will create a Dockerfile to install everything inside the container.

FROM python:3.8 #use python image

COPY ./requirements.txt /webapp/requirements.txt

WORKDIR /webapp

RUN pip install -r requirements.txt

COPY webapp/* /webapp

ENTRYPOINT [ "python" ]

CMD ["app.py"]

▮ requirements.txt

Here we specify the list of libraries we will need to package the model.

simpletransformers==0.4.0
tensorboardX==1.9
transformers==2.1.0
flask==2.1.0
torch==1.7.1
onnxruntime==1.7.0

▮ app.py

We will create a simple flask app with a predict() function which links with the /predict URL.

from flask import Flask, request, jsonify
import torch
import numpy as np
from transformers import RobertaTokenizer
import onnxruntime


app = Flask(__name__)
tokenizer = RobertaTokenizer.from_pretrained("roberta-base")
session = onnxruntime.InferenceSession(
    "roberta-sequence-classification-9.onnx")


@app.route("/predict", methods=["POST"])
def predict():
    input_ids = torch.tensor(tokenizer.encode(
        request.json[0], add_special_tokens=True)).unsqueeze(0)

    if input_ids.requires_grad:
        numpy_func = input_ids.detach().cpu().numpy()
    else:
        numpy_func = input_ids.cpu().numpy()

    name = session.get_inputs()[0].name

    inputs = {session.get_inputs()[0].name: numpy_func}
    out = session.run(None, inputs)

    result = np.argmax(out)
    return jsonify({"positive": bool(result)})


if __name__ == "__main__":
    app.run(host="0.0.0.0", port=5000, debug=True)

▮ ONNX model

For this post, I am using an ONNX model called RoBERTa(Sequence Classification Model). Download the model from this link and place it under the “webapp” directory.

▮ Getting Inference Result

Now that everything is all set, we will first build the docker container and tag it as “roberta“.

docker build -t roberta .

Then run the container

docker run -it -p 5000:5000 --rm roberta

If the container is running successfully it should show a log like below.

 * Running on all addresses (0.0.0.0)
 * Running on http://127.0.0.1:5000
 * Running on http://172.17.0.2:5000

Finally, use curl to send an HTTP request.
I am using Window’s command prompt. the code below might slightly change depending on your OS.

curl -X POST -H "Content-Type: application/JSON" --data "[\"Packing ML Models is important\"]" http://127.0.0.1:5000/predict

The flask app should return a log like below.

{
  "positive": true
}