Microsoft and NVIDIA introduce parameter-efficient multimodal transformers for video representation learning
Understanding video is one of the most challenging problems in AI, and an important underlying requirement is learning multimodal representations that capture information about objects, actions, sounds, and their long-range statistical dependencies from audio-visual signals. Recently, transformers have been successful in vision-and-language tasks such as image captioning and visual question answering due to their ability to learn multimodal contextual representations. However, training multimodal transformers end-to-end is difficult because of the excessive memory requirement. In fact, most existing vision and language transformers rely on pretrained language transformers to train them successfully.
Today, in collaboration with NVIDIA Research, we are excited to announce our work: “Parameter Efficient Multimodal Transformers for Video Representation Learning.” In this paper,