Beyond Masking: Demystifying Token-Based Pre-Training for Vision Transformers

The code is coming

Figure 1: Pipeline of token-based pre-training.

Figure 2: The visualization of the proposed 5 tasks.

All the results are pre-trained for 300 epochs using Vit-base as default.

	zoomed-in	zoomed-out	distorted	blurred	de-colorized
finetune	`82.7`	`82.5`	`82.1`	`81.8`	`81.4`

	zoomed-in (a)	mask (m)	(a)+(m)
finetune	`82.7`	`82.9`	`83.2`

We note that the integrated version dose not require