The Challenges
While training large models helps improve state-of-the-art performance, deploying such cumbersome models fail to meet performance benchmarks at the time of inference on real-world test data.
Knowledge distillation helps overcome these challenges by “distilling” the knowledge in a huge model(“parent” model) into a smaller model(“student” model) that is much easier to deploy.
Knowledge
There are 3 different forms of knowledge:
- Response-Based Knowledge
This focuses on the final output layer of the teacher model. The hypothesis is that the student model will learn to mimic the predictions of the teacher model. - Feature-Based Knowledge
The intermediate layers learn to discriminate specific features and this knowledge can be used to train a student model. - Relation-Based Knowledge
In addition to knowledge represented in the output layers and the intermediate layers of a neural network, the knowledge that captures the relationship between feature maps can also be used to train a student model.
Knowledge Distillation is applied by minimizing the “distillation loss” for each knowledge between the “parent” model and the “student” model.
Reference
Knowledge Distillation: Principles, Algorithms, Applications