Don’t Sweep your Learning Rate under the Rug- A Closer Look at Cross-modal Transfer of Pretrained Transformers
Abstract
Self-supervised pre-training of large-scale transformer models on text corpora followed by fine-tuning has achieved state-of-the-art on a number of natural language processing tasks. Recently, Lu et al. (2021) claimed that frozen pretrained transformers (FPTs) match or outperform training from scratch as well as unfrozen (fine-tuned) pretrained transformers in a set of transfer tasks to other modalities. In our work, we find that this result is, in fact, an artefact of not tuning the learning rates. After carefully redesigning the empirical setup, we