Sparse Progressive Distillation: Resolving Overfitting under Pretrain-and-Finetune Paradigm

This is the Pytorch implementation for sparse progressive distillation (SPD). For more details about the motivation, techniques and experimental results, refer to our paper here.

Environment Preparation (using python3)
```
pip install -r requirements.txt
```
Dataset Preparation

The original GLUE dataset could be downloaded here.

We use finetuned BERT_base as the teacher. For each task of GLUE benchmark, we obtain the finetuned model using the original huggingface transformers code with the following script.

To finish reading, please visit source site