Prompt-Level Controls for Prosody-Aligned Facial Performance

Alice Chen

floweralicee.github.io/lipsync-ai-demo

Abstract

Speech-driven facial performance often feels unnatural even when mouth shapes match phonemes (Suwajanakorn et al., 2017; Prajwal et al., 2020). A common failure mode is even timing, where motion accents land at near-uniform intervals with low contrast between stressed and unstressed words, so the face is "busy," but nothing lands. This paper frames weak animation choices as operational sub-problems (beats, timing, spacing/noise, contrast, eye intention, staging) and focuses on the subset most responsible for perceived intent in dialogue closeups: rhythm of emphasis, holds, and stress contrast. The method introduces a prompt-level acting control layer (instruction structure, not weight updates) and evaluates it under controlled comparisons (same audio, same base pipeline; only instruction changes). Four proxy metrics quantify temporal structure (Accent Alignment Error (AAE), Even Timing Score (ETS), Hold Ratio (HR), Contrast Ratio (CR)) and a case study provides an auditable stress-timestamp annotation table derived from the audio waveform, plus a rubric and iteration loop that converts subjective "taste" into measurable progress.

1. Introduction

Modern video systems can produce impressive lighting, environments, and camera aesthetics. Yet character performance in dialogue closeups frequently fails in predictable ways: timing is flat, stillness is missing, accents do not land on meaning, and facial motion becomes constant low-amplitude noise. The result is often perceived as "dead" even when lip sync is correct.

Speech-driven facial animation has a long lineage, from early audio-to-control approaches to modern neural and 3D face models conditioned on speech audio (Brand, 1999; Cudeiro et al., 2019; Thies et al., 2020; Zhou et al., 2020).

This paper targets a specific, repeatable defect—even timing—and asks a narrow question:

Can instruction-level controls reliably shift facial motion toward stress-aligned accents and readable holds, under controlled comparisons, without changing model weights?

1.1 Problem Decomposition: Operationalizing "Weak Animation Choices"

To make "this animation doesn't look good" actionable, the work decomposes perceived quality into separable sub-problems, each mapped to (i) an annotation target, (ii) a controllable instruction primitive, or (iii) a metric.

How this decomposition was derived.

This list did not come from theory first—it came from repeated diagnosis. The author developed the decomposition by scene-by-scene performance analysis across thousands of shots from feature-film (e.g., Zootopia, Frozen, Soul) and by comparing that benchmark against hundreds of student shots that failed in consistent ways (Buck & Lee, 2013; Docter & Powers, 2020; Howard & Moore, 2016). Across these comparisons, different "bad" outputs often looked different on the surface, but they collapsed into a small set of recurring failure modes: unclear beats, uniform rhythm, missing holds, excess micro-motion noise, drifting eye intention, and poor staging/silhouette. In other words, the list below reflects a practical attempt to translate principles used in production practice into a diagnostic framework consistent with classic animation guidance (Thomas & Johnston, 1981/1995; Williams, 2001/2009).

Decomposed sub-problems.

Beat clarity (staging + beats): Are intention changes readable as discrete beats?
Label: beat boundaries Rubric: beat readability
Timing (rhythm of emphasis): Do accents land on stressed words / beat boundaries rather than uniform spacing?
Labels: stress anchors Metrics: AAE, ETS
Spacing / micro-motion noise: Is motion dominated by jitter that adds busyness without intent?
Metric: HR (with noise thresholds) Rubric: noise vs intent
Contrast (stillness vs movement): Are there holds that make beats readable?
Metrics: HR, CR Rubric: hold placement
Eye path / intention: Do eyes lead thought changes, or drift without purpose?
Label: gaze shift events Rubric: eye-leads-thought
Silhouette / readability (for wider shots): Does a single frame communicate the beat?
Rubric item: pose/silhouette (optional in closeup-only evaluation)

Scope: This paper focuses on B + D (timing and contrast) and tracks A/E as rubric items for future expansion.

2. Key Terms

Beat: A change in intention/thought the audience should notice (new idea, realization, emotional turn).
Accent: A motion peak that marks a beat or a stressed word (head nod/tilt, brow pop, lid change, jaw "hit").
Hold: Intentional stillness (or near-stillness) that creates contrast and makes the next accent readable.
Even timing: Accents occur at near-uniform intervals with low stress/non-stress contrast.
Micro-motion noise: Continuous small movement that does not communicate intention (wiggle that reduces readability).

3. Benchmark: Feature-Film Performance Rubric (v0.1)

Instead of claiming "cinematic," the paper uses a practical rubric aligned with what viewers perceive in dialogue closeups. The rubric is informed by classic animation principles emphasizing clarity, timing, staging, and readable intention (Thomas & Johnston, 1981/1995; Williams, 2001/2009).

Table 1: Feature-Film Performance Rubric (1–5 scale per dimension)
Dimension	Description
Beat readability	Can a viewer point to the thought changes?
Emphasis correctness	Do accents land on stressed words?
Hold placement	Is stillness used to make beats readable?
Micro-motion noise (reverse)	Is there purposeless jitter? (lower score = more noise)
Eye intention (optional)	Do eyes lead thought changes?

4. Method: A Prompt-Level "Acting Control Layer"

The method translates an animator's diagnosis loop into structured instruction constraints.

Animator diagnosis loop (conceptual):

find the beat → add a hold → place the accent → clean up in-betweens → check readability

Control primitives (instruction-level):

Rhythm of emphasis: Assign accents to stressed words and beat boundaries; suppress filler-word motion.
Holds: Insert short stillness before key beats to increase contrast and readability.
Stress contrast: Amplify motion near stressed words, reduce motion elsewhere, and avoid constant micro-motion.

Key idea.

These controls live in instruction structure (prompt-level). No weight updates are required, enabling rapid ablations.

This fits a broader pattern in generative modeling: prompt/instruction changes and auxiliary conditioning can steer pretrained models without full retraining, as seen in instruction-following diffusion editing (Brooks et al., 2023), prompt-to-prompt attention control (Hertz et al., 2023), and conditional control modules such as ControlNet (Zhang et al., 2023).

5. Data & Annotation (Auditable, Lightweight)

This paper uses a one-clip case study designed to be reproducible in a single day, even without forced alignment tooling.

5.1 Case Study Transcript

"You don't understand, I am a celebrity, it is all about me, it has been for decades, hahaha, that is the point of celebrity."

5.2 Annotation Schema

Minimal labels that capture performance structure (not aesthetic preference):

Stress anchors (word-level timestamps)
Beat boundaries (timestamps for intention shifts)
Pause/inhale windows (silence gaps)
Optional acting events: gaze shifts, blink clusters, laughter burst, hold segments

5.3 Stress-Timestamp Table (Derived from Audio Waveform)

These anchors are approximate (audio energy/prosody peaks + silence segmentation), intended to be auditable and "good enough" for controlled proxy metrics. Peak time is the recommended target for the primary accent.

Table 2: Stress-Timestamp Annotations for Case Study Clip
Token (anchor)	Type	Window (s)	Peak (s)
don't	stress	0.715–0.955	0.835
I	stress	1.700–1.845	1.725
celebriTY (first)	stress	2.285–2.525	2.405
ME	stress	4.385–4.608	4.505
deCADES	stress	6.005–6.245	6.125
hahaha	event (laugh)	6.445–6.671	6.565
POINT	stress	7.215–7.455	7.335
celebriTY (final)	stress	8.445–8.685	8.565

5.4 Pause/Inhale Candidates (Silence Gaps)

Table 3: Silence Windows for Hold Placement
Silence Window (s)	Duration	Use
1.472–1.700	228 ms	inhale / pre-"I" beat
2.858–3.400	542 ms	inhale / beat reset
4.608–4.767	159 ms	micro-pause (pre-"decades" ramp)
6.671–6.877	206 ms	pause (post-laugh)
7.546–7.670	124 ms	micro-pause (pre-final)

5.5 Beat Boundary Proposal (For This Line)

Table 4: Proposed Beat Boundaries for Case Study
Beat	Description	Approx. Time (s)
B1	dismissal: "you don't understand"	~0.63–1.27
B2	self-assertion: "I am a celebrity"	~1.69–2.88
B3	grandiose peak: "it is all about me"	~3.39–4.62
B4	history claim + crack: "it has been for decades"	~4.76–6.46
B5	mask slip: "hahaha"	~6.54–6.69
B6	strained regain control: "that is the point of celebrity"	~6.87–8.77

6. Evaluation Protocol

6.1 Controlled Comparison Design

To claim causality, evaluation must hold everything constant except instruction structure:

Same audio
Same base generation pipeline/model
Same sampling/seed where possible (if not, report variability across N runs)
Only change: baseline instruction vs control-layer instruction

6.2 Proxy Metrics

These metrics do not claim "objective Disney." They quantify whether the output better matches the intended structure (emphasis + holds + contrast).

6.2.1 Accent Alignment Error (AAE)

$$\text{AAE} = \frac{1}{N}\sum_{i=1}^{N} |t^{\text{peak}}_i - t^{\text{stress}}_i|$$

Lower is better.

6.2.2 Even Timing Score (ETS)

Compute accent gaps $\Delta t$ between consecutive motion peaks.

$$\text{CV} = \frac{\text{std}(\Delta t)}{\text{mean}(\Delta t)}$$

Low CV ⇒ too even ⇒ worse. Higher CV (within reason) ⇒ better rhythm.

6.2.3 Hold Ratio (HR)

$$\text{HR} = \frac{\text{\# frames below motion threshold } \tau}{\text{total dialogue frames}}$$

Higher is better up to a point.

6.2.4 Contrast Ratio (CR)

$$\text{CR} = \frac{\text{avg motion magnitude near stressed words}}{\text{avg motion magnitude near unstressed words}}$$

Higher is better (within reason).

Table 5: Summary of Proxy Metrics
Metric	Abbreviation	Desired Direction
Accent Alignment Error	AAE	↓ (lower is better)
Even Timing Score (CV)	ETS	↑ (higher CV = less uniform)
Hold Ratio	HR	↑ (more stillness)
Contrast Ratio	CR	↑ (more stress differentiation)

6.3 Human Evaluation Protocol (Blinded A/B + Rubric)

Design: Paired A/B (baseline vs control), randomized order, identical audio, raters blinded.
Task: Preference + rubric scores (Section 3).
Report: Preference rate + rubric deltas with uncertainty estimates.

6.4 Inter-Rater Reliability

Preference agreement: Fleiss' κ or Cohen's κ
Rubric consistency: ICC(2,k) or Krippendorff's α
Calibration: A small "gold" set aligns rubric interpretation; calibration excluded from final reporting.

7. Case Study: Applying the Control Layer to the Clip

7.1 Baseline Diagnosis (Typical Signature)

Accents near-uniform intervals (low CV)
Holds missing (low HR)
Low stress contrast (low CR)
Motion peaks drift relative to stress anchors (high AAE)
Micro-motion noise fills gaps (rubric penalty)

7.2 Intervention (Instruction-Level)

Place primary accents near: don't, I, celebriTY, ME, deCADES, POINT, celebriTY (final)
Insert holds in silence windows: 1.472–1.700, ~4.60, 6.67–6.88
Suppress filler-word motion to protect contrast
Treat "hahaha" as an event with body stillness and minimal deliberate facial intent

7.3 Expected Measurable Outcome

AAE ↓, CV ↑ (less even), HR ↑, CR ↑, rubric deltas improve on emphasis correctness and hold placement.

Table 6: Expected Metric Changes After Intervention
Metric	Baseline	Expected (Control Layer)
AAE	High	↓ (improved alignment)
CV (ETS)	Low (too even)	↑ (less uniform)
HR	Low	↑ (more holds)
CR	Low	↑ (better contrast)

8. Iteration: Metrics → Diagnosis → Next Experiment (Closed Loop)

Iteration summary (research framing):

The method uses measurement to localize failure, then modifies one control factor at a time to attribute improvements to specific causes.

Closed-Loop Process

Step 1 — Measure: Compute AAE/ETS/HR/CR + rubric deltas.
Step 2 — Diagnose (metric → cause mapping):
- AAE high ⇒ accents late/early → shift accent timing constraints by Δt
- ETS too low ⇒ too even → add holds at beat boundaries; suppress filler motion more
- HR too low ⇒ not enough stillness → enforce explicit holds; reduce micro-motion
- CR too low ⇒ stress not readable → amplify stressed-word accents; reduce unstressed motion
Step 3 — Ablate: Change one factor at a time:
- head-only vs head+eyes+brows coupling
- hold length sweep (2/3/5/8 frames)
- stress threshold sweep (which words are primary)
Step 4 — Select (decision rule summary): Accept a change only if target metrics improve by ≥ΔX while regressions stay below ε (e.g., HR ↑ without AAE worsening beyond ε).

Table 7: Diagnosis Mapping: Metric Failure → Control Adjustment
Metric Signal	Diagnosis	Control Adjustment
AAE high	Accents misaligned	Shift accent timing constraints by Δt
ETS/CV low	Rhythm too uniform	Add holds at beat boundaries; suppress filler
HR low	Insufficient stillness	Enforce explicit holds; reduce micro-motion
CR low	Poor stress contrast	Amplify stressed accents; reduce unstressed motion

9. Limitations

Proxy metrics approximate perceived quality; blinded human preference remains primary.
Stress anchors derived from waveform peaks are approximate; forced alignment would improve precision.
Prompt-level controls improve consistency but may fail for extreme styles or noisy audio.
Eye intention and staging require additional labels and signal extraction.

10. Next Steps

Replace approximate anchors with forced alignment + prosody features ($F_0$, energy, duration).
Convert anchors into an explicit emphasis track for conditioning (not only instruction).
Expand evaluation across speaking styles and emotional contexts.
Add dedicated metrics for gaze alignment and micro-motion noise.
Build a small benchmark set with labeled beats/stress + rubric + reliability as standard.

References

Brand, M. (1999). Voice puppetry. In Proceedings of SIGGRAPH 1999 (pp. 21–28). ACM. https://doi.org/10.1145/311535.311537
Brooks, T., Holynski, A., & Efros, A. A. (2023). InstructPix2Pix: Learning to follow image editing instructions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
Buck, C., & Lee, J. (2013). Frozen [Film]. Walt Disney Pictures.
Cudeiro, D., Bolkart, T., Laidlaw, C., Ranjan, A., & Black, M. J. (2019). Capture, learning, and synthesis of 3D speaking styles. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 10093–10103). https://doi.org/10.1109/CVPR.2019.01034
Docter, P., & Powers, K. (2020). Soul [Film]. Disney Animation Studios.
Hertz, A., Mokady, R., Tenenbaum, J., Aberman, K., Pritch, Y., & Cohen-Or, D. (2023). Prompt-to-prompt image editing with cross-attention control. In Proceedings of the International Conference on Learning Representations (ICLR).
Howard, B., & Moore, R. (2016). Zootopia [Film]. Walt Disney Animation Studios.
Prajwal, K. R., Mukhopadhyay, R., Namboodiri, V., & Jawahar, C. V. (2020). A lip sync expert is all you need for speech to lip generation in the wild. In Proceedings of the 28th ACM International Conference on Multimedia (ACM MM). https://doi.org/10.1145/3394171.3413532
Suwajanakorn, S., Seitz, S. M., & Kemelmacher-Shlizerman, I. (2017). Synthesizing Obama: Learning lip sync from audio. ACM Transactions on Graphics, 36(4). https://doi.org/10.1145/3072959.3073640
Thies, J., Elgharib, M., Tewari, A., Theobalt, C., & Niessner, M. (2020). Neural voice puppetry: Audio-driven facial reenactment. In Computer Vision – ECCV 2020 (LNCS 12361, pp. 716–731). Springer. https://doi.org/10.1007/978-3-030-58517-4_42
Thomas, F., & Johnston, O. (1981/1995). The Illusion of Life: Disney Animation. Disney Editions.
Williams, R. (2001/2009). The Animator's Survival Kit. Faber & Faber.
Zhang, L., Rao, A., & Agrawala, M. (2023). Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) (pp. 3813–3824). https://doi.org/10.1109/ICCV51070.2023.00355
Zhou, Y., Han, X., Shechtman, E., Echevarria, J., Kalogerakis, E., & Li, D. (2020). MakeItTalk: Speaker-aware talking-head animation. ACM Transactions on Graphics, 39(6), Article 221. https://doi.org/10.1145/3414685.3417774