390. Monitoring GPU Metrics

▮ Monitoring

After you have successfully deployed your machine learning model, it is crucial to be able to collect not only metrics such as throughput and latency but also GPU usage and utilization(since your ML model is most likely to be using GPUs to run inference).

So for this post, I’d like to share several tools that can help you collect GPU metrics and gain insights by visualizing them.

▮ Steps

There are mainly 3 steps to go from metric collection to metric visualization.
1. Export
2. Store
3. Visualize

▮ 1. Export

In our first step, we want to export the GPU metrics so that they can be later accessed by other tools.

The tools you should choose depends on where your model is being deployed. In some cases, you won’t even need to do this step manually. For example, if you are deploying your model in Nvidia’s Triton Inference Server by passing in additional arguments when running the server, it would expose an HTTP port to export GPU metrics for you.

One of the easiest exporter tools to use is NVIDIA’s DCGM-Exporter.

Fig.1 – DCGM Exporter

By using the dcgm-exporter command, it will expose an HTTP endpoint(For the example above, localhost:9400/metrics) to send the GPU metrics.

▮ 2. Store

Now that the metrics are being exposed to a certain HTTP endpoint, we will then need to store them in a database. The exporter itself will not store the metrics for us. For this phase, you can leverage tools such as Prometheus and InfluxDB to do such tasks. Since DCGM-Exporter supports exports for Prometheus, I’ll focus on Prometheus for this post.

What is Prometheus?

For example, let’s say there is an app that has multiple servers and multiple containers running within each server.

Fig.2 – Monitoring Services

One day, 1 container crashed due to some kind of error which then crashes other containers which were depending on the crashed container. However, when this happens, the USER will only see an error such as “Cannot log in”. If no infrastructure collects metrics from each container, developers will have to look at each crashed container to find the root cause which would be time-consuming to find the cause.

Tools such as Prometheus can unify the metrics collection and triggers alerts when certain conditions are met which can help developers monitor and debug system failures.

Prometheus consists of mainly 3 components:

  1. Retrieval
    Pull metrics from Target services such as dcgm exporter through HTTP.
  2. Storage
    Store Pulled metrics
  3. HTTP Server
    Endpoint for other tools to access and run queries to the data stored in Prometheus.
Fig.3 – Prometheus Components

▮ 3. Visualize

Finally, after the exported metrics are pulled by the monitoring system, we want to visualize the collected metrics by accessing the endpoint exposed by those monitoring services(such as Prometheus). We can use tools such as Grafana for this step.

Fig.3 – Prometheus Components

Grafana can run queries to multiple data sources besides Prometheus such as InfluxDB, MySQL, etc.
After Setting the data source and the endpoint exposed by it, you can easily create dashboards and add panels to create numerous types of visuals.