Prompt-Level Controls for Prosody-Aligned Facial Performance

TL;DR: Speech-driven facial animation often feels unnatural even with correct lip sync because accents hit at uniform intervals with low stress contrast. I introduce a prompt-level control layer that snaps accents to speech emphasis and adds intentional holds, without retraining the model.

Controlled setup: Same base model, same audio, same transcript. Only the prompt/control layer differs.


Baseline

Realistic animation prompt with AI audio enhanced with Animini

Pixar/Disney styled animation with real voice audio prompt enhanced with Animini

Baseline

Prompt with AI audio enhanced with Animini

Real voice audio prompt enhanced with Animini

The Problem: Even Timing

Speech-driven facial animation has a long lineage, from early audio-to-control approaches to modern neural and 3D face models conditioned on speech audio. Yet character performance in dialogue closeups frequently fails in predictable ways:

The result: faces that are technically animated but feel lifeless, even when lip sync is correct.


The Method: Prompt-Level Control Layer

Instead of retraining models, this work introduces instruction-level controls that translate an animator’s diagnosis loop into structured constraints:

Control Primitives:

This fits a broader pattern in generative modeling: prompt/instruction changes and auxiliary conditioning can steer pretrained models without full retraining, as seen in instruction-following diffusion editing, prompt-to-prompt attention control, and conditional control modules such as ControlNet.


Key Terms

Term Definition
Beat A change in intention/thought the audience should notice (new idea, realization, emotional turn)
Accent A motion peak that marks a beat or stressed word (head nod/tilt, brow pop, lid change, jaw “hit”)
Hold Intentional stillness (or near-stillness) that creates contrast and makes the next accent readable
Even timing Accents at near-uniform intervals with low stress/unstress contrast
Micro-motion noise Continuous small movement that does not communicate intention

Proxy Metrics

Four metrics quantify temporal structure without claiming “objective Disney”:

Metric Abbr. What it measures Goal
Accent Alignment Error AAE Distance between motion peaks and stress anchors ↓ Lower
Even Timing Score ETS Coefficient of variation of accent gaps ↑ Higher (less uniform)
Hold Ratio HR % frames below motion threshold ↑ Higher (more stillness)
Contrast Ratio CR Motion magnitude on stressed vs unstressed words ↑ Higher

Evaluation Protocol

Controlled Comparison Design:

This isolation allows causal attribution: improvements come from the control layer, not confounding variables.

Human Evaluation: Paired A/B (baseline vs control), randomized order, identical audio, raters blinded. Report preference rate + rubric deltas with uncertainty estimates.


Citation

@misc{chen2026prosody,
  title   = {Prompt-Level Controls for Prosody-Aligned Facial Performance},
  author  = {Alice Chen},
  year    = {2025},
  howpublished = {\url{https://floweralicee.github.io/lipsync-ai-demo}}
}