Facestar dataset. High quality audio-visual recordings of human conversational speech
Description
Existing audio-visual datasets for human speech are either captured in a clean, controlled environment but contain only a small amount of speech data without natural conversations, or are collected in-the-wild with unreliable audio quality, interfering sounds, low face resolution, and unreliable or occluded lip motion.
The Facestar dataset aims to enable research on audio-visual modeling in a large-scale and high-quality setting. Core dataset features:
- 10 hours of high-quality audio-visual speech data
- audio recordings in a quiet environment at 16kHz
- video of resolution 1300 x 1600 at 60 frames per second
- one female and one male speaker
- natural speech: all data is conversational speech in a video-conferencing setup
- full face visibility: speakers are facing the camera while talking
See the paper for more details.