VinVL: Advancing the state of the art for vision-language models
Humans understand the world by perceiving and fusing information from multiple channels, such as images viewed by the eyes, voices heard by the ears, and other forms of sensory input. One of the core aspirations in AI is to develop algorithms that endow computers with a similar ability: to effectively learn from multimodal data like vision-language to make sense of the world around us. For example, vision-language (VL) systems allow searching the relevant images for a text query (or vice versa) and describing the content of an image using natural language.
As illustrated in Figure 1, a typical VL system uses a modular architecture with two modules to achieve VL understanding:
- An image encoding module, also known as a visual feature extractor,