End-to-End Pre-training for Vision-Language Representation Learning
Seeing Out of tHe bOx
End-to-End Pre-training for Vision-Language Representation Learning [CVPR’21, Oral]
By Zhicheng Huang*, Zhaoyang Zeng*, Yupan Huang*, Bei Liu, Dongmei Fu and Jianlong Fu
arxiv: https://arxiv.org/pdf/2104.03135.pdf
This is the official implementation of the paper. In this paper, we propose SOHO to “See Out of tHe bOx” that takes a whole image as input, and learns vision-language representation in an end-to-end manner. SOHO does not require bounding box annotations which enables inference 10 times faster than region-based approaches.
Release Progress
conda create -n soho python=3.7
conda activate soho
conda install pytorch==1.8.0 torchvision==0.9.0 torchaudio==0.8.0 cudatoolkit=11.1 -c pytorch -c conda-forge
git clone https://github.com/NVIDIA/apex.git
cd apex
python setup.py install --cuda_ext --cpp_ext
cd ../ && rm -rf apex
git clone https://github.com/researchmm/soho.git