373. Knowledge Distillation

The Challenges

While training large models helps improve state-of-the-art performance, deploying such cumbersome models fail to meet performance benchmarks at the time of inference on real-world test data.

Knowledge distillation helps overcome these challenges by “distilling” the knowledge in a huge model(“parent” model) into a smaller model(“student” model) that is much easier to deploy.

Knowledge

There are 3 different forms of knowledge:

  1. Response-Based Knowledge
    This focuses on the final output layer of the teacher model. The hypothesis is that the student model will learn to mimic the predictions of the teacher model.
  2. Feature-Based Knowledge
    The intermediate layers learn to discriminate specific features and this knowledge can be used to train a student model.
  3. Relation-Based Knowledge
    In addition to knowledge represented in the output layers and the intermediate layers of a neural network, the knowledge that captures the relationship between feature maps can also be used to train a student model.

Knowledge Distillation is applied by minimizing the “distillation loss” for each knowledge between the “parent” model and the “student” model.

Reference

Knowledge Distillation: Principles, Algorithms, Applications