Vision-and-Language Transformer Without Convolution or Region Supervision

ViLT

Code for the ICML 2021 (long talk) paper: “ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision”

figure

Install

pip install -r requirements.txt
pip install -e .

Download Pretrained Weights

We provide five pretrained weights

  1. ViLT-B/32 Pretrained with MLM+ITM for 200k steps on GCC+SBU+COCO+VG (ViLT-B/32 200k) link
  2. ViLT-B/32 200k finetuned on VQAv2 link
  3. ViLT-B/32 200k finetuned on NLVR2 link
  4. ViLT-B/32 200k finetuned on COCO IR/TR link
  5. ViLT-B/32 200k finetuned on F30K IR/TR link

Out-of-the-box MLM + Visualization Demo

mlm

pip install gradio==1.6.4
python demo.py with num_gpus=<0 if you have no gpus else 1> load_path="/vilt_200k_mlm_itm.ckpt"

ex)
python demo.py with num_gpus=0 load_path="weights/vilt_200k_mlm_itm.ckpt"

Out-of-the-box VQA Demo

vqa

pip install

 

 

 

To finish reading, please visit source site