Distill Bert
Contents
Distill Bert¶
前面的研究¶
有人用Bi-LSTM来实现作用在Fine-tuned的Bert上面
Loss¶
The final training objective: a linear combination of the distillation loss \(L_{c e}\), the masked language modeling loss \(L_{m l m}\), and a cosine embedding loss \(L_{\cos }\): Loss= 5.0Lce+2.0 Lmlm+1.0* Lcos,
Lce: soft_label的KL散度: The student is trained with a distillation loss over the soft target probabilities of the teacher: \(L_{c e}=\sum_{i} t_{i} * \log \left(s_{i}\right)\) where \(t_{i}\) (resp. \(s_{i}\) ) is a probability estimated by the teacher (resp. the student).
This objective results in a rich training signal by leveraging the full teacher distribution. Following Hinton et al. [2015] we used a softmax-temperature: \(p_{i}=\frac{\exp \left(z_{i} / T\right)}{\sum_{j} \exp \left(z_{j} / T\right)}\) where \(T\) controls the smoothness of the output distribution and \(z_{i}\) is the model score for the class \(i\).
The same temperature \(T\) is applied to the student and the teacher at training time, while at inference, \(T\) is set to 1 to recover a standard softmax.
\(L_{m l m}\): hard_label的交叉熵
余弦相似度cosine embedding loss \(L_{\cos }\): cosine similarity loss of hidden layer embedding between student and teacher
tend to align the directions of the student and teacher hidden states vectors.