This model addresses the challenge of limited audio-visual data by encoding long-term audio context and predicting animated 3D face meshes. The use of self-supervised pre-trained speech ...