375. Model Compression Methods

▮ 4 Mainstream Methods

Originally, model compression’s main objective was to fit models to edge devices. However, in most cases, if you compress your model it would speed up inference as well. So lately, this method is also used when the objective is to reduce inference latency.

There are mainly 4 popular model compression methods.

  1. Low-Rank Factorization
  2. Knowledge Distillation
  3. Pruning
  4. Quantization

▮ 1. Low-Rank Factorization

The name itself is quite intimidating, but what it is doing is just replacing high-dimension tensors with low-dimension tensors.

Here are a couple of examples.

Ex. 1 SqueezeNet

By replacing layers, such as 3×3 convolution with 1×1 convolution, this network was able to achieve AlexNet-level performance with 50x fewer parameters

Ex. 2 MobileNet

By replacing the standard convolution of size (KxKxC) with a depth-wise convolution(KxKx1) and a point-wise convolution(1x1xC)it was able to reduce the number of parameters from (K^2*C) to (K^2+C).

For example, if there is a convolution of size(3x3x10), the original would have 90 parameters while the new version would only have 19.

▮ 2. Knowledge Distillation

Knowledge distillation helps reduce the model size by “distilling” the knowledge in a huge model(“parent” model) into a smaller model(“student” model) that is much easier to deploy.
AI-80.jpg (183.6 kB)

3 Knowledge

There are 3 different forms of knowledge
knowledge_distil.jpg (260.6 kB)

1. Relation-Based Knowledge

In addition to knowledge represented in the output layers and the intermediate layers of a neural network, the knowledge that captures the relationship between feature maps can also be used to train a student model.

2. Response-Based Knowledge

This focuses on the final output layer of the teacher model. The hypothesis is that the student model will learn to mimic the predictions of the teacher model.

3. Feature-Based Knowledge

The intermediate layers learn to discriminate specific features and this knowledge can be used to train a student model.

▮ 3. Pruning

“Pruning” means sparsing the network for faster inference. Most of the weights inside networks are quite useless, so this can help reduce the model size.

Methods

There are mainly 2 methods to prune a model.

1. Unstructured Pruning

This method just simply removes all the unnecessary weights. All neurons will remain, which means some neurons might be fully connected while others are sparsely connected.

2. Structured Pruning

This method removes neurons that are connected with unnecessary weights so that all remaining neurons would be fully connected.
image.png (210.5 kB)

▮ 4. Quantization

Quantization is a technique to change the data type used to compute neural networks for faster inference.

After you’ve deployed your model, there is no need to back-propagate(which is sensitive to precision). This means, if a slight decrease in precision is acceptable, we can lower the precision to compute the neural network for faster inference by applying quantization.

Methods

There are mainly 3 methods to apply quantization.

1. Dynamic Quantization

Apply quantization at run time. Only layers belonging to the set of types we pass to the function are quantized.

2. Post-Training Quantization

Apply quantization after training and before run time. Automatically applies quantization to all layers.

3. Quantization-Aware Training

Apply quantization at train time. Most tedious, but has the most accuracy.