Controlled setup: Same base model, same audio, same transcript. Only the prompt/control layer differs.
|
Baseline |
Realistic animation prompt with AI audio enhanced with Animini |
Pixar/Disney styled animation with real voice audio prompt enhanced with Animini |
|
Baseline |
Prompt with AI audio enhanced with Animini |
Real voice audio prompt enhanced with Animini |
Speech-driven facial animation has a long lineage, from early audio-to-control approaches to modern neural and 3D face models conditioned on speech audio. Yet character performance in dialogue closeups frequently fails in predictable ways:
The result: faces that are technically animated but feel lifeless, even when lip sync is correct.
Instead of retraining models, this work introduces instruction-level controls that translate an animator’s diagnosis loop into structured constraints:
Control Primitives:
This fits a broader pattern in generative modeling: prompt/instruction changes and auxiliary conditioning can steer pretrained models without full retraining, as seen in instruction-following diffusion editing, prompt-to-prompt attention control, and conditional control modules such as ControlNet.
| Term | Definition |
|---|---|
| Beat | A change in intention/thought the audience should notice (new idea, realization, emotional turn) |
| Accent | A motion peak that marks a beat or stressed word (head nod/tilt, brow pop, lid change, jaw “hit”) |
| Hold | Intentional stillness (or near-stillness) that creates contrast and makes the next accent readable |
| Even timing | Accents at near-uniform intervals with low stress/unstress contrast |
| Micro-motion noise | Continuous small movement that does not communicate intention |
Four metrics quantify temporal structure without claiming “objective Disney”:
| Metric | Abbr. | What it measures | Goal |
|---|---|---|---|
| Accent Alignment Error | AAE | Distance between motion peaks and stress anchors | ↓ Lower |
| Even Timing Score | ETS | Coefficient of variation of accent gaps | ↑ Higher (less uniform) |
| Hold Ratio | HR | % frames below motion threshold | ↑ Higher (more stillness) |
| Contrast Ratio | CR | Motion magnitude on stressed vs unstressed words | ↑ Higher |
Controlled Comparison Design:
This isolation allows causal attribution: improvements come from the control layer, not confounding variables.
Human Evaluation: Paired A/B (baseline vs control), randomized order, identical audio, raters blinded. Report preference rate + rubric deltas with uncertainty estimates.
@misc{chen2026prosody,
title = {Prompt-Level Controls for Prosody-Aligned Facial Performance},
author = {Alice Chen},
year = {2025},
howpublished = {\url{https://floweralicee.github.io/lipsync-ai-demo}}
}