Frustratingly Simple Pretraining Alternatives to Masked Language Modeling
This is the official implementation for “Frustratingly Simple Pretraining Alternatives to Masked Language Modeling” (EMNLP 2021).
Requirements
- torch
- transformers
- datasets
- scikit-learn
- tensorflow
- spacy
How to pre-train
1. Clone this repository
git clone https://github.com/gucci-j/light-transformer-emnlp2021.git
2. Install required packages
cd ./light-transformer-emnlp2021
pip install -r requirements.txt
requirements.txt
is located just underlight-transformer-emnlp2021
.
We also need spaCy’s en_core_web_sm
for preprocessing. If you have not installed this model, please run python -m spacy download en_core_web_sm
.
3. Preprocess datasets
cd ./src/utils
python preprocess_roberta.py --path=/path/to/save/data/
You need to specify the following argument:
path
: (str
) Where to save the processed data?
4. Pre-training
You need to secify configs as command line arguments. Sample configs for pre-training MLM are shown as below. python pretrainer.py --help
will display helper messages.
cd ../
python pretrainer.py
--data_dir=/path/to/dataset/
--do_train
--learning_rate=1e-4
--weight_decay=0.01
--adam_epsilon=1e-8
--max_grad_norm=1.0
--num_train_epochs=1
--warmup_steps=12774