OPTIMIZING YOUR MODEL There is a machine learning framework called, TensorRT to “optimize” your model for faster inference. I’ve been converting many models to a TensorRT engine for a while, but I didn’t really know how it was optimizing the model. So for this post, I’d like to share what I’ve learned about what is happening during the optimization phase.
KEY FEATURES OF TENSORRT 1. Layer/Tensor Fusion 2. Precision Calibration 3. Dynamic Tensor Memory 4. Kernel Auto-Tuning 5. Multi-Stream Execution
1.LAYER/TENSOR FUSION There are 2 types of fusion, horizontal and vertical. Horizontal(Orange): Even if the routes are different, if the process is the same, unify the route. Vertical(Blue): If the model is using similar parts, turn that into BLOCKS and reuse them.
2.PRECISION CALIBRATION When running inference, there is no need to backpropagate(unlike the training phase), so you can prioritize speed over precision.
According to NVIDIA, compared to the original model, TensorRT optimized FP32 model is 1.5x faster, and FP16 model 5.5x faster(Precision DOES make a difference).
So the question is, how much should you lower your precision? TensorRT finds the balance between precision and speed for you without additional parameters.
3.DYNAMIC TENSOR MEMORY TensorRT allocates memory just enough for each tensor to work and keeps that memory just as long as the tensor’s duration. This helps optimize the usage of the memory. Therefore, the optimized TensorRT model is heavily related to the hardware in which the engine file is being built. You can’t reuse them between different hardware. (I tried, just in case, but it didn’t work. JetsonTX2⇔JetsonNano⇔JetsonXavier)
4.KERNEL AUTO-TUNING TensorRT selects the right kernel depending on the model’s batch size, filter size, and input data size.
5.MULTI-STREAM EXECUTION Allows processing multiple input streams in parallel.
EXPERIMENTING After I’ve understood the overall image of how the optimization is executed, I’ve experimented with a faster_rcnn model from TensorFlow1(I’m using import_pb_to_tensorboard.py which comes when you install Tensorflow 1). The image on the top is the original model and the bottom is the TensorRT optimized model