Sparse Progressive Distillation: Resolving Overfitting under Pretrain-and-Finetune Paradigm
This is the Pytorch implementation for sparse progressive distillation (SPD). For more details about the motivation, techniques and experimental results, refer to our paper here.
-
Environment Preparation (using python3)
pip install -r requirements.txt
-
Dataset Preparation
The original GLUE dataset could be downloaded here.
We use finetuned BERT_base as the teacher. For each task of GLUE benchmark, we obtain the finetuned model using the original huggingface transformers code with the following script.