ACAV100M: Scaling up self-supervised audio-visual learning with automatically curated internet videos
The natural association between visual observations and their corresponding sounds has exhibited powerful self-supervision signals for learning video representations, which makes the ever-growing amount of online video an attractive data source for self-supervised learning. However, online videos often provide imperfectly aligned audio-visual signals because of overdubbed audio; models trained on uncurated videos have been shown to learn suboptimal representations due to the misalignment issues. Therefore, existing approaches rely almost exclusively on manually curated datasets with a predetermined taxonomy of semantic concepts, where there is a high chance of audio-visual correspondence. This severely limits the utility of online videos for self-supervised learning, which begs the question: How can we fully leverage online videos without extensive human effort?