51. Model Optimization with TensorRT

OPTIMIZING YOUR MODEL
There is a machine learning framework called, TensorRT to “optimize” your model for faster inference. I’ve been converting many models to a TensorRT engine for a while, but I didn’t really know how it was optimizing the model. So for this post, I’d like to share what I’ve learned about what is happening during the optimization phase.

KEY FEATURES OF TENSORRT
1. Layer/Tensor Fusion
2. Precision Calibration
3. Dynamic Tensor Memory
4. Kernel Auto-Tuning
5. Multi-Stream Execution

1.LAYER/TENSOR FUSION
There are 2 types of fusion, horizontal and vertical.
Horizontal(Orange): Even if the routes are different, if the process is the same, unify the route.
Vertical(Blue): If the model is using similar parts, turn that into BLOCKS and reuse them.
2.PRECISION CALIBRATION
When running inference, there is no need to backpropagate(unlike the training phase), so you can prioritize speed over precision.
According to NVIDIA, compared to the original model, TensorRT optimized FP32 model is 1.5x faster, and FP16 model 5.5x faster(Precision DOES make a difference).

So the question is, how much should you lower your precision?
TensorRT finds the balance between precision and speed for you without additional parameters.

3.DYNAMIC TENSOR MEMORY
TensorRT allocates memory just enough for each tensor to work and keeps that memory just as long as the tensor’s duration. This helps optimize the usage of the memory. Therefore, the optimized TensorRT model is heavily related to the hardware in which the engine file is being built. You can’t reuse them between different hardware.
(I tried, just in case, but it didn’t work. JetsonTX2⇔JetsonNano⇔JetsonXavier)

4.KERNEL AUTO-TUNING
TensorRT selects the right kernel depending on the model’s batch size, filter size, and input data size.

5.MULTI-STREAM EXECUTION
Allows processing multiple input streams in parallel.

EXPERIMENTING
After I’ve understood the overall image of how the optimization is executed, I’ve experimented with a faster_rcnn model from TensorFlow1(I’m using import_pb_to_tensorboard.py which comes when you install Tensorflow 1).
The image on the top is the original model and the bottom is the TensorRT optimized model
python PATH\TO\import_pb_to_tensorboard.py --model_dir PATH\TO\frozen_inference_graph.pb --log_dir ./visualize_model.log
tensorboard --logdir=visualization:visualize_model.log
You can clearly see that the network is MUCH cleaner.