▮ 4 Mainstream Methods
Originally, model compression’s main objective was to fit models to edge devices. However, in most cases, if you compress your model it would speed up inference as well. So lately, this method is also used when the objective is to reduce inference latency.
There are mainly 4 popular model compression methods.
▮ 1. Low-Rank Factorization
The name itself is quite intimidating, but what it is doing is just replacing high-dimension tensors with low-dimension tensors.
Here are a couple of examples.
Ex. 1 SqueezeNet
By replacing layers, such as 3×3 convolution with 1×1 convolution, this network was able to achieve AlexNet-level performance with 50x fewer parameters
Ex. 2 MobileNet
By replacing the standard convolution of size (KxKxC) with a depth-wise convolution(KxKx1) and a point-wise convolution(1x1xC)it was able to reduce the number of parameters from (K^2*C) to (K^2+C).
For example, if there is a convolution of size(3x3x10), the original would have 90 parameters while the new version would only have 19.
▮ 2. Knowledge Distillation
Knowledge distillation helps reduce the model size by “distilling” the knowledge in a huge model(“parent” model) into a smaller model(“student” model) that is much easier to deploy.
3 Knowledge
There are 3 different forms of knowledge
1. Relation-Based Knowledge
In addition to knowledge represented in the output layers and the intermediate layers of a neural network, the knowledge that captures the relationship between feature maps can also be used to train a student model.
2. Response-Based Knowledge
This focuses on the final output layer of the teacher model. The hypothesis is that the student model will learn to mimic the predictions of the teacher model.
3. Feature-Based Knowledge
The intermediate layers learn to discriminate specific features and this knowledge can be used to train a student model.
▮ 3. Pruning
“Pruning” means sparsing the network for faster inference. Most of the weights inside networks are quite useless, so this can help reduce the model size.
Methods
There are mainly 2 methods to prune a model.
1. Unstructured Pruning
This method just simply removes all the unnecessary weights. All neurons will remain, which means some neurons might be fully connected while others are sparsely connected.
2. Structured Pruning
This method removes neurons that are connected with unnecessary weights so that all remaining neurons would be fully connected.
▮ 4. Quantization
Quantization is a technique to change the data type used to compute neural networks for faster inference.
After you’ve deployed your model, there is no need to back-propagate(which is sensitive to precision). This means, if a slight decrease in precision is acceptable, we can lower the precision to compute the neural network for faster inference by applying quantization.
Methods
There are mainly 3 methods to apply quantization.
1. Dynamic Quantization
Apply quantization at run time. Only layers belonging to the set of types we pass to the function are quantized.
2. Post-Training Quantization
Apply quantization after training and before run time. Automatically applies quantization to all layers.
3. Quantization-Aware Training
Apply quantization at train time. Most tedious, but has the most accuracy.