Prompt-Level Controls for Prosody-Aligned Facial Performance
floweralicee.github.io/lipsync-ai-demoAbstract
Speech-driven facial performance often feels unnatural even when mouth shapes match phonemes (Suwajanakorn et al., 2017; Prajwal et al., 2020). A common failure mode is even timing, where motion accents land at near-uniform intervals with low contrast between stressed and unstressed words, so the face is "busy," but nothing lands. This paper frames weak animation choices as operational sub-problems (beats, timing, spacing/noise, contrast, eye intention, staging) and focuses on the subset most responsible for perceived intent in dialogue closeups: rhythm of emphasis, holds, and stress contrast. The method introduces a prompt-level acting control layer (instruction structure, not weight updates) and evaluates it under controlled comparisons (same audio, same base pipeline; only instruction changes). Four proxy metrics quantify temporal structure (Accent Alignment Error (AAE), Even Timing Score (ETS), Hold Ratio (HR), Contrast Ratio (CR)) and a case study provides an auditable stress-timestamp annotation table derived from the audio waveform, plus a rubric and iteration loop that converts subjective "taste" into measurable progress.
1. Introduction
Modern video systems can produce impressive lighting, environments, and camera aesthetics. Yet character performance in dialogue closeups frequently fails in predictable ways: timing is flat, stillness is missing, accents do not land on meaning, and facial motion becomes constant low-amplitude noise. The result is often perceived as "dead" even when lip sync is correct.
Speech-driven facial animation has a long lineage, from early audio-to-control approaches to modern neural and 3D face models conditioned on speech audio (Brand, 1999; Cudeiro et al., 2019; Thies et al., 2020; Zhou et al., 2020).
This paper targets a specific, repeatable defect—even timing—and asks a narrow question:
Can instruction-level controls reliably shift facial motion toward stress-aligned accents and readable holds, under controlled comparisons, without changing model weights?
1.1 Problem Decomposition: Operationalizing "Weak Animation Choices"
To make "this animation doesn't look good" actionable, the work decomposes perceived quality into separable sub-problems, each mapped to (i) an annotation target, (ii) a controllable instruction primitive, or (iii) a metric.
How this decomposition was derived.
This list did not come from theory first—it came from repeated diagnosis. The author developed the decomposition by scene-by-scene performance analysis across thousands of shots from feature-film (e.g., Zootopia, Frozen, Soul) and by comparing that benchmark against hundreds of student shots that failed in consistent ways (Buck & Lee, 2013; Docter & Powers, 2020; Howard & Moore, 2016). Across these comparisons, different "bad" outputs often looked different on the surface, but they collapsed into a small set of recurring failure modes: unclear beats, uniform rhythm, missing holds, excess micro-motion noise, drifting eye intention, and poor staging/silhouette. In other words, the list below reflects a practical attempt to translate principles used in production practice into a diagnostic framework consistent with classic animation guidance (Thomas & Johnston, 1981/1995; Williams, 2001/2009).
Decomposed sub-problems.
- Beat clarity (staging + beats): Are intention changes readable as discrete beats?
Label: beat boundaries Rubric: beat readability - Timing (rhythm of emphasis): Do accents land on stressed words / beat boundaries rather than uniform spacing?
Labels: stress anchors Metrics: AAE, ETS - Spacing / micro-motion noise: Is motion dominated by jitter that adds busyness without intent?
Metric: HR (with noise thresholds) Rubric: noise vs intent - Contrast (stillness vs movement): Are there holds that make beats readable?
Metrics: HR, CR Rubric: hold placement - Eye path / intention: Do eyes lead thought changes, or drift without purpose?
Label: gaze shift events Rubric: eye-leads-thought - Silhouette / readability (for wider shots): Does a single frame communicate the beat?
Rubric item: pose/silhouette (optional in closeup-only evaluation)
Scope: This paper focuses on B + D (timing and contrast) and tracks A/E as rubric items for future expansion.
2. Key Terms
- Beat
- A change in intention/thought the audience should notice (new idea, realization, emotional turn).
- Accent
- A motion peak that marks a beat or a stressed word (head nod/tilt, brow pop, lid change, jaw "hit").
- Hold
- Intentional stillness (or near-stillness) that creates contrast and makes the next accent readable.
- Even timing
- Accents occur at near-uniform intervals with low stress/non-stress contrast.
- Micro-motion noise
- Continuous small movement that does not communicate intention (wiggle that reduces readability).
3. Benchmark: Feature-Film Performance Rubric (v0.1)
Instead of claiming "cinematic," the paper uses a practical rubric aligned with what viewers perceive in dialogue closeups. The rubric is informed by classic animation principles emphasizing clarity, timing, staging, and readable intention (Thomas & Johnston, 1981/1995; Williams, 2001/2009).
| Dimension | Description |
|---|---|
| Beat readability | Can a viewer point to the thought changes? |
| Emphasis correctness | Do accents land on stressed words? |
| Hold placement | Is stillness used to make beats readable? |
| Micro-motion noise (reverse) | Is there purposeless jitter? (lower score = more noise) |
| Eye intention (optional) | Do eyes lead thought changes? |
4. Method: A Prompt-Level "Acting Control Layer"
The method translates an animator's diagnosis loop into structured instruction constraints.
Animator diagnosis loop (conceptual):
Control primitives (instruction-level):
- Rhythm of emphasis: Assign accents to stressed words and beat boundaries; suppress filler-word motion.
- Holds: Insert short stillness before key beats to increase contrast and readability.
- Stress contrast: Amplify motion near stressed words, reduce motion elsewhere, and avoid constant micro-motion.
Key idea.
These controls live in instruction structure (prompt-level). No weight updates are required, enabling rapid ablations.
This fits a broader pattern in generative modeling: prompt/instruction changes and auxiliary conditioning can steer pretrained models without full retraining, as seen in instruction-following diffusion editing (Brooks et al., 2023), prompt-to-prompt attention control (Hertz et al., 2023), and conditional control modules such as ControlNet (Zhang et al., 2023).
5. Data & Annotation (Auditable, Lightweight)
This paper uses a one-clip case study designed to be reproducible in a single day, even without forced alignment tooling.
5.1 Case Study Transcript
"You don't understand, I am a celebrity, it is all about me, it has been for decades, hahaha, that is the point of celebrity."
5.2 Annotation Schema
Minimal labels that capture performance structure (not aesthetic preference):
- Stress anchors (word-level timestamps)
- Beat boundaries (timestamps for intention shifts)
- Pause/inhale windows (silence gaps)
- Optional acting events: gaze shifts, blink clusters, laughter burst, hold segments
5.3 Stress-Timestamp Table (Derived from Audio Waveform)
These anchors are approximate (audio energy/prosody peaks + silence segmentation), intended to be auditable and "good enough" for controlled proxy metrics. Peak time is the recommended target for the primary accent.
| Token (anchor) | Type | Window (s) | Peak (s) |
|---|---|---|---|
| don't | stress | 0.715–0.955 | 0.835 |
| I | stress | 1.700–1.845 | 1.725 |
| celebriTY (first) | stress | 2.285–2.525 | 2.405 |
| ME | stress | 4.385–4.608 | 4.505 |
| deCADES | stress | 6.005–6.245 | 6.125 |
| hahaha | event (laugh) | 6.445–6.671 | 6.565 |
| POINT | stress | 7.215–7.455 | 7.335 |
| celebriTY (final) | stress | 8.445–8.685 | 8.565 |
5.4 Pause/Inhale Candidates (Silence Gaps)
| Silence Window (s) | Duration | Use |
|---|---|---|
| 1.472–1.700 | 228 ms | inhale / pre-"I" beat |
| 2.858–3.400 | 542 ms | inhale / beat reset |
| 4.608–4.767 | 159 ms | micro-pause (pre-"decades" ramp) |
| 6.671–6.877 | 206 ms | pause (post-laugh) |
| 7.546–7.670 | 124 ms | micro-pause (pre-final) |
5.5 Beat Boundary Proposal (For This Line)
| Beat | Description | Approx. Time (s) |
|---|---|---|
| B1 | dismissal: "you don't understand" | ~0.63–1.27 |
| B2 | self-assertion: "I am a celebrity" | ~1.69–2.88 |
| B3 | grandiose peak: "it is all about me" | ~3.39–4.62 |
| B4 | history claim + crack: "it has been for decades" | ~4.76–6.46 |
| B5 | mask slip: "hahaha" | ~6.54–6.69 |
| B6 | strained regain control: "that is the point of celebrity" | ~6.87–8.77 |
6. Evaluation Protocol
6.1 Controlled Comparison Design
To claim causality, evaluation must hold everything constant except instruction structure:
- Same audio
- Same base generation pipeline/model
- Same sampling/seed where possible (if not, report variability across N runs)
- Only change: baseline instruction vs control-layer instruction
6.2 Proxy Metrics
These metrics do not claim "objective Disney." They quantify whether the output better matches the intended structure (emphasis + holds + contrast).
6.2.1 Accent Alignment Error (AAE)
Lower is better.
6.2.2 Even Timing Score (ETS)
Compute accent gaps $\Delta t$ between consecutive motion peaks.
Low CV ⇒ too even ⇒ worse. Higher CV (within reason) ⇒ better rhythm.
6.2.3 Hold Ratio (HR)
Higher is better up to a point.
6.2.4 Contrast Ratio (CR)
Higher is better (within reason).
| Metric | Abbreviation | Desired Direction |
|---|---|---|
| Accent Alignment Error | AAE | ↓ (lower is better) |
| Even Timing Score (CV) | ETS | ↑ (higher CV = less uniform) |
| Hold Ratio | HR | ↑ (more stillness) |
| Contrast Ratio | CR | ↑ (more stress differentiation) |
6.3 Human Evaluation Protocol (Blinded A/B + Rubric)
- Design: Paired A/B (baseline vs control), randomized order, identical audio, raters blinded.
- Task: Preference + rubric scores (Section 3).
- Report: Preference rate + rubric deltas with uncertainty estimates.
6.4 Inter-Rater Reliability
- Preference agreement: Fleiss' κ or Cohen's κ
- Rubric consistency: ICC(2,k) or Krippendorff's α
- Calibration: A small "gold" set aligns rubric interpretation; calibration excluded from final reporting.
7. Case Study: Applying the Control Layer to the Clip
7.1 Baseline Diagnosis (Typical Signature)
- Accents near-uniform intervals (low CV)
- Holds missing (low HR)
- Low stress contrast (low CR)
- Motion peaks drift relative to stress anchors (high AAE)
- Micro-motion noise fills gaps (rubric penalty)
7.2 Intervention (Instruction-Level)
- Place primary accents near:
don't,I,celebriTY,ME,deCADES,POINT,celebriTY(final) - Insert holds in silence windows: 1.472–1.700, ~4.60, 6.67–6.88
- Suppress filler-word motion to protect contrast
- Treat "hahaha" as an event with body stillness and minimal deliberate facial intent
7.3 Expected Measurable Outcome
AAE ↓, CV ↑ (less even), HR ↑, CR ↑, rubric deltas improve on emphasis correctness and hold placement.
| Metric | Baseline | Expected (Control Layer) |
|---|---|---|
| AAE | High | ↓ (improved alignment) |
| CV (ETS) | Low (too even) | ↑ (less uniform) |
| HR | Low | ↑ (more holds) |
| CR | Low | ↑ (better contrast) |
8. Iteration: Metrics → Diagnosis → Next Experiment (Closed Loop)
Iteration summary (research framing):
The method uses measurement to localize failure, then modifies one control factor at a time to attribute improvements to specific causes.
Closed-Loop Process
- Step 1 — Measure: Compute AAE/ETS/HR/CR + rubric deltas.
- Step 2 — Diagnose (metric → cause mapping):
- AAE high ⇒ accents late/early → shift accent timing constraints by Δt
- ETS too low ⇒ too even → add holds at beat boundaries; suppress filler motion more
- HR too low ⇒ not enough stillness → enforce explicit holds; reduce micro-motion
- CR too low ⇒ stress not readable → amplify stressed-word accents; reduce unstressed motion
- Step 3 — Ablate: Change one factor at a time:
- head-only vs head+eyes+brows coupling
- hold length sweep (2/3/5/8 frames)
- stress threshold sweep (which words are primary)
- Step 4 — Select (decision rule summary): Accept a change only if target metrics improve by ≥ΔX while regressions stay below ε (e.g., HR ↑ without AAE worsening beyond ε).
| Metric Signal | Diagnosis | Control Adjustment |
|---|---|---|
| AAE high | Accents misaligned | Shift accent timing constraints by Δt |
| ETS/CV low | Rhythm too uniform | Add holds at beat boundaries; suppress filler |
| HR low | Insufficient stillness | Enforce explicit holds; reduce micro-motion |
| CR low | Poor stress contrast | Amplify stressed accents; reduce unstressed motion |
9. Limitations
- Proxy metrics approximate perceived quality; blinded human preference remains primary.
- Stress anchors derived from waveform peaks are approximate; forced alignment would improve precision.
- Prompt-level controls improve consistency but may fail for extreme styles or noisy audio.
- Eye intention and staging require additional labels and signal extraction.
10. Next Steps
- Replace approximate anchors with forced alignment + prosody features ($F_0$, energy, duration).
- Convert anchors into an explicit emphasis track for conditioning (not only instruction).
- Expand evaluation across speaking styles and emotional contexts.
- Add dedicated metrics for gaze alignment and micro-motion noise.
- Build a small benchmark set with labeled beats/stress + rubric + reliability as standard.
References
- Brand, M. (1999). Voice puppetry. In Proceedings of SIGGRAPH 1999 (pp. 21–28). ACM. https://doi.org/10.1145/311535.311537
- Brooks, T., Holynski, A., & Efros, A. A. (2023). InstructPix2Pix: Learning to follow image editing instructions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
- Buck, C., & Lee, J. (2013). Frozen [Film]. Walt Disney Pictures.
- Cudeiro, D., Bolkart, T., Laidlaw, C., Ranjan, A., & Black, M. J. (2019). Capture, learning, and synthesis of 3D speaking styles. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 10093–10103). https://doi.org/10.1109/CVPR.2019.01034
- Docter, P., & Powers, K. (2020). Soul [Film]. Disney Animation Studios.
- Hertz, A., Mokady, R., Tenenbaum, J., Aberman, K., Pritch, Y., & Cohen-Or, D. (2023). Prompt-to-prompt image editing with cross-attention control. In Proceedings of the International Conference on Learning Representations (ICLR).
- Howard, B., & Moore, R. (2016). Zootopia [Film]. Walt Disney Animation Studios.
- Prajwal, K. R., Mukhopadhyay, R., Namboodiri, V., & Jawahar, C. V. (2020). A lip sync expert is all you need for speech to lip generation in the wild. In Proceedings of the 28th ACM International Conference on Multimedia (ACM MM). https://doi.org/10.1145/3394171.3413532
- Suwajanakorn, S., Seitz, S. M., & Kemelmacher-Shlizerman, I. (2017). Synthesizing Obama: Learning lip sync from audio. ACM Transactions on Graphics, 36(4). https://doi.org/10.1145/3072959.3073640
- Thies, J., Elgharib, M., Tewari, A., Theobalt, C., & Niessner, M. (2020). Neural voice puppetry: Audio-driven facial reenactment. In Computer Vision – ECCV 2020 (LNCS 12361, pp. 716–731). Springer. https://doi.org/10.1007/978-3-030-58517-4_42
- Thomas, F., & Johnston, O. (1981/1995). The Illusion of Life: Disney Animation. Disney Editions.
- Williams, R. (2001/2009). The Animator's Survival Kit. Faber & Faber.
- Zhang, L., Rao, A., & Agrawala, M. (2023). Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) (pp. 3813–3824). https://doi.org/10.1109/ICCV51070.2023.00355
- Zhou, Y., Han, X., Shechtman, E., Echevarria, J., Kalogerakis, E., & Li, D. (2020). MakeItTalk: Speaker-aware talking-head animation. ACM Transactions on Graphics, 39(6), Article 221. https://doi.org/10.1145/3414685.3417774