Quantization
Quantization is a technique to change the data type used to compute neural networks for faster inference.
After you’ve deployed your model, there is no need to backpropagate(which is sensitive to precision). This means, if a slight decrease in precision is acceptable, we can lower the precision to compute the neural network for faster inference by applying quantization.
3 Quantization Methods supported by Pytorch
- Dynamic Quantization
Apply quantization at run time. Only layers belonging to the set of types we pass to the function are quantized. - Post-Training Quantization
Apply quantization after training and before run time. Automatically applies quantization to all layers. - Quantization-Aware Training
Apply quantization at train time. Most tedious, but has the most accuracy.
Reference: Document