55. Performance Comparison: Quantization

When detecting objects, the method to display the detected boxes can affect the inference speed. One of the models I’m working on uses a clustering method to display those boxes and it is SUPER SLOW.
I’ve noticed something when I was trying to make the model somehow faster so I’d like to share it here!


Quantization

You can apply quantization when you want to have the model do faster inference. This also helps when you want to run models on edge devices due to the lack of hardware resources compared to the cloud.

There are 2 main methods to apply quantization:
A) Quantization After Training
Apply Quantization after the training phase can reduce the model size, and increase inference speed but, there is a high chance of relatively high precision loss.

B) Quantization Aware Training
As the naming goes, make the model aware of quantization during the training phase. This can help minimize the precision loss that the previous method suffered.

Considering this, I understood that the precision would be higher if I took the latter approach, but what about the SPEED?
So… to see whether quantization after training can be as fast as the latter approach, I gave it a shot and compared the same model to see which method does better on edge devices.


Experimenting

The Left is Quantization DURING the training, and on the right, AFTER.
As a result, both of them were around 2.00 for the whole time.

Maybe not much of a big deal, but it turns out that quantization AFTER training can be as fast as AWARE-trained models, which also points out the huge precision loss. (I mean it’s fast as quantization aware trained models despite not considering quantization during the training process at all)