Facestar dataset. High quality audio-visual recordings of human conversational speech
Description Existing audio-visual datasets for human speech are either captured in a clean, controlled environment but contain only a small amount of speech data without natural conversations, or are collected in-the-wild with unreliable audio quality, interfering sounds, low face resolution, and unreliable or occluded lip motion. The Facestar dataset aims to enable research on audio-visual modeling in a large-scale and high-quality setting. Core dataset features: 10 hours of high-quality audio-visual speech data audio recordings in a quiet environment at 16kHz […]
Read more