This model addresses the challenge of limited audio-visual data by encoding long-term audio context and predicting animated 3D face meshes. The use of self-supervised pre-trained speech ...
Some results have been hidden because they may be inaccessible to you
Show inaccessible results