← Back to Demo

Prompt-Level Controls for Prosody-Aligned Facial Performance

Alice Chen

floweralicee.github.io/lipsync-ai-demo

Abstract

Speech-driven facial performance often feels unnatural even when mouth shapes match phonemes (Suwajanakorn et al., 2017; Prajwal et al., 2020). A common failure mode is even timing, where motion accents land at near-uniform intervals with low contrast between stressed and unstressed words, so the face is "busy," but nothing lands. This paper frames weak animation choices as operational sub-problems (beats, timing, spacing/noise, contrast, eye intention, staging) and focuses on the subset most responsible for perceived intent in dialogue closeups: rhythm of emphasis, holds, and stress contrast. The method introduces a prompt-level acting control layer (instruction structure, not weight updates) and evaluates it under controlled comparisons (same audio, same base pipeline; only instruction changes). Four proxy metrics quantify temporal structure (Accent Alignment Error (AAE), Even Timing Score (ETS), Hold Ratio (HR), Contrast Ratio (CR)) and a case study provides an auditable stress-timestamp annotation table derived from the audio waveform, plus a rubric and iteration loop that converts subjective "taste" into measurable progress.

1. Introduction

Modern video systems can produce impressive lighting, environments, and camera aesthetics. Yet character performance in dialogue closeups frequently fails in predictable ways: timing is flat, stillness is missing, accents do not land on meaning, and facial motion becomes constant low-amplitude noise. The result is often perceived as "dead" even when lip sync is correct.

Speech-driven facial animation has a long lineage, from early audio-to-control approaches to modern neural and 3D face models conditioned on speech audio (Brand, 1999; Cudeiro et al., 2019; Thies et al., 2020; Zhou et al., 2020).

This paper targets a specific, repeatable defect—even timing—and asks a narrow question:

Can instruction-level controls reliably shift facial motion toward stress-aligned accents and readable holds, under controlled comparisons, without changing model weights?

1.1 Problem Decomposition: Operationalizing "Weak Animation Choices"

To make "this animation doesn't look good" actionable, the work decomposes perceived quality into separable sub-problems, each mapped to (i) an annotation target, (ii) a controllable instruction primitive, or (iii) a metric.

How this decomposition was derived.

This list did not come from theory first—it came from repeated diagnosis. The author developed the decomposition by scene-by-scene performance analysis across thousands of shots from feature-film (e.g., Zootopia, Frozen, Soul) and by comparing that benchmark against hundreds of student shots that failed in consistent ways (Buck & Lee, 2013; Docter & Powers, 2020; Howard & Moore, 2016). Across these comparisons, different "bad" outputs often looked different on the surface, but they collapsed into a small set of recurring failure modes: unclear beats, uniform rhythm, missing holds, excess micro-motion noise, drifting eye intention, and poor staging/silhouette. In other words, the list below reflects a practical attempt to translate principles used in production practice into a diagnostic framework consistent with classic animation guidance (Thomas & Johnston, 1981/1995; Williams, 2001/2009).

Decomposed sub-problems.

  1. Beat clarity (staging + beats): Are intention changes readable as discrete beats?
    Label: beat boundaries   Rubric: beat readability
  2. Timing (rhythm of emphasis): Do accents land on stressed words / beat boundaries rather than uniform spacing?
    Labels: stress anchors   Metrics: AAE, ETS
  3. Spacing / micro-motion noise: Is motion dominated by jitter that adds busyness without intent?
    Metric: HR (with noise thresholds)   Rubric: noise vs intent
  4. Contrast (stillness vs movement): Are there holds that make beats readable?
    Metrics: HR, CR   Rubric: hold placement
  5. Eye path / intention: Do eyes lead thought changes, or drift without purpose?
    Label: gaze shift events   Rubric: eye-leads-thought
  6. Silhouette / readability (for wider shots): Does a single frame communicate the beat?
    Rubric item: pose/silhouette (optional in closeup-only evaluation)

Scope: This paper focuses on B + D (timing and contrast) and tracks A/E as rubric items for future expansion.

2. Key Terms

Beat
A change in intention/thought the audience should notice (new idea, realization, emotional turn).
Accent
A motion peak that marks a beat or a stressed word (head nod/tilt, brow pop, lid change, jaw "hit").
Hold
Intentional stillness (or near-stillness) that creates contrast and makes the next accent readable.
Even timing
Accents occur at near-uniform intervals with low stress/non-stress contrast.
Micro-motion noise
Continuous small movement that does not communicate intention (wiggle that reduces readability).

3. Benchmark: Feature-Film Performance Rubric (v0.1)

Instead of claiming "cinematic," the paper uses a practical rubric aligned with what viewers perceive in dialogue closeups. The rubric is informed by classic animation principles emphasizing clarity, timing, staging, and readable intention (Thomas & Johnston, 1981/1995; Williams, 2001/2009).

Table 1: Feature-Film Performance Rubric (1–5 scale per dimension)
Dimension Description
Beat readabilityCan a viewer point to the thought changes?
Emphasis correctnessDo accents land on stressed words?
Hold placementIs stillness used to make beats readable?
Micro-motion noise (reverse)Is there purposeless jitter? (lower score = more noise)
Eye intention (optional)Do eyes lead thought changes?

4. Method: A Prompt-Level "Acting Control Layer"

The method translates an animator's diagnosis loop into structured instruction constraints.

Animator diagnosis loop (conceptual):

find the beat → add a hold → place the accent → clean up in-betweens → check readability

Control primitives (instruction-level):

Key idea.

These controls live in instruction structure (prompt-level). No weight updates are required, enabling rapid ablations.

This fits a broader pattern in generative modeling: prompt/instruction changes and auxiliary conditioning can steer pretrained models without full retraining, as seen in instruction-following diffusion editing (Brooks et al., 2023), prompt-to-prompt attention control (Hertz et al., 2023), and conditional control modules such as ControlNet (Zhang et al., 2023).

5. Data & Annotation (Auditable, Lightweight)

This paper uses a one-clip case study designed to be reproducible in a single day, even without forced alignment tooling.

5.1 Case Study Transcript

"You don't understand, I am a celebrity, it is all about me, it has been for decades, hahaha, that is the point of celebrity."

5.2 Annotation Schema

Minimal labels that capture performance structure (not aesthetic preference):

5.3 Stress-Timestamp Table (Derived from Audio Waveform)

These anchors are approximate (audio energy/prosody peaks + silence segmentation), intended to be auditable and "good enough" for controlled proxy metrics. Peak time is the recommended target for the primary accent.

Table 2: Stress-Timestamp Annotations for Case Study Clip
Token (anchor) Type Window (s) Peak (s)
don'tstress0.715–0.9550.835
Istress1.700–1.8451.725
celebriTY (first)stress2.285–2.5252.405
MEstress4.385–4.6084.505
deCADESstress6.005–6.2456.125
hahahaevent (laugh)6.445–6.6716.565
POINTstress7.215–7.4557.335
celebriTY (final)stress8.445–8.6858.565

5.4 Pause/Inhale Candidates (Silence Gaps)

Table 3: Silence Windows for Hold Placement
Silence Window (s) Duration Use
1.472–1.700228 msinhale / pre-"I" beat
2.858–3.400542 msinhale / beat reset
4.608–4.767159 msmicro-pause (pre-"decades" ramp)
6.671–6.877206 mspause (post-laugh)
7.546–7.670124 msmicro-pause (pre-final)

5.5 Beat Boundary Proposal (For This Line)

Table 4: Proposed Beat Boundaries for Case Study
Beat Description Approx. Time (s)
B1dismissal: "you don't understand"~0.63–1.27
B2self-assertion: "I am a celebrity"~1.69–2.88
B3grandiose peak: "it is all about me"~3.39–4.62
B4history claim + crack: "it has been for decades"~4.76–6.46
B5mask slip: "hahaha"~6.54–6.69
B6strained regain control: "that is the point of celebrity"~6.87–8.77

6. Evaluation Protocol

6.1 Controlled Comparison Design

To claim causality, evaluation must hold everything constant except instruction structure:

6.2 Proxy Metrics

These metrics do not claim "objective Disney." They quantify whether the output better matches the intended structure (emphasis + holds + contrast).

6.2.1 Accent Alignment Error (AAE)

$$\text{AAE} = \frac{1}{N}\sum_{i=1}^{N} |t^{\text{peak}}_i - t^{\text{stress}}_i|$$

Lower is better.

6.2.2 Even Timing Score (ETS)

Compute accent gaps $\Delta t$ between consecutive motion peaks.

$$\text{CV} = \frac{\text{std}(\Delta t)}{\text{mean}(\Delta t)}$$

Low CV ⇒ too even ⇒ worse. Higher CV (within reason) ⇒ better rhythm.

6.2.3 Hold Ratio (HR)

$$\text{HR} = \frac{\text{\# frames below motion threshold } \tau}{\text{total dialogue frames}}$$

Higher is better up to a point.

6.2.4 Contrast Ratio (CR)

$$\text{CR} = \frac{\text{avg motion magnitude near stressed words}}{\text{avg motion magnitude near unstressed words}}$$

Higher is better (within reason).

Table 5: Summary of Proxy Metrics
Metric Abbreviation Desired Direction
Accent Alignment ErrorAAE↓ (lower is better)
Even Timing Score (CV)ETS↑ (higher CV = less uniform)
Hold RatioHR↑ (more stillness)
Contrast RatioCR↑ (more stress differentiation)

6.3 Human Evaluation Protocol (Blinded A/B + Rubric)

6.4 Inter-Rater Reliability

7. Case Study: Applying the Control Layer to the Clip

7.1 Baseline Diagnosis (Typical Signature)

7.2 Intervention (Instruction-Level)

7.3 Expected Measurable Outcome

AAE ↓, CV ↑ (less even), HR ↑, CR ↑, rubric deltas improve on emphasis correctness and hold placement.

Table 6: Expected Metric Changes After Intervention
Metric Baseline Expected (Control Layer)
AAEHigh↓ (improved alignment)
CV (ETS)Low (too even)↑ (less uniform)
HRLow↑ (more holds)
CRLow↑ (better contrast)

8. Iteration: Metrics → Diagnosis → Next Experiment (Closed Loop)

Iteration summary (research framing):

The method uses measurement to localize failure, then modifies one control factor at a time to attribute improvements to specific causes.

Closed-Loop Process

  1. Step 1 — Measure: Compute AAE/ETS/HR/CR + rubric deltas.
  2. Step 2 — Diagnose (metric → cause mapping):
    • AAE high ⇒ accents late/early → shift accent timing constraints by Δt
    • ETS too low ⇒ too even → add holds at beat boundaries; suppress filler motion more
    • HR too low ⇒ not enough stillness → enforce explicit holds; reduce micro-motion
    • CR too low ⇒ stress not readable → amplify stressed-word accents; reduce unstressed motion
  3. Step 3 — Ablate: Change one factor at a time:
    • head-only vs head+eyes+brows coupling
    • hold length sweep (2/3/5/8 frames)
    • stress threshold sweep (which words are primary)
  4. Step 4 — Select (decision rule summary): Accept a change only if target metrics improve by ≥ΔX while regressions stay below ε (e.g., HR ↑ without AAE worsening beyond ε).
Table 7: Diagnosis Mapping: Metric Failure → Control Adjustment
Metric Signal Diagnosis Control Adjustment
AAE highAccents misalignedShift accent timing constraints by Δt
ETS/CV lowRhythm too uniformAdd holds at beat boundaries; suppress filler
HR lowInsufficient stillnessEnforce explicit holds; reduce micro-motion
CR lowPoor stress contrastAmplify stressed accents; reduce unstressed motion

9. Limitations

10. Next Steps

References