Troubleshooting4 min readApr 15, 2026

Debugging Lip Sync: Why the Mouth Is Out of Step

HappyHorse 1.0 ships with native lip sync across seven languages. When it works, mouth shapes track consonants and the timing lines up with the duration.

HappyHorse 1.0 ships with native lip sync across seven languages. When it works, mouth shapes track consonants and the timing lines up with the duration. When it does not, you get a jaw that flaps arbitrarily or a face that moves for the first word and then freezes.

Run through these four failure modes first. All four have a fix in the prompt.

1. Dialogue buried under ambient

The most common mistake is writing a prompt that reads like a scene description with the line tacked on at the end. HappyHorse builds the scene in the order you describe it. If you open with atmosphere, the model spends its budget on atmosphere. Speech becomes an afterthought, and the mouth animation suffers first.

Bad:

CODE

1A quiet library at night, dim green banker's lamps.
2A librarian says, "closing in five minutes."

Good:

CODE

1SAY: "closing in five minutes."
2Speaker: a librarian at a reference desk.
3Setting: quiet library at night, dim green banker's lamps.

Lead with SAY:. That convention tells the model the utterance is the payload and the rest is staging. Mouth animation quality improves immediately.

2. Language mismatch

You wrote the scene description in English but the spoken line is in Japanese, German, or Portuguese. The model now has to guess which language the phonemes belong to. Often it guesses wrong, and you get a mouth that moves to English phonemes while the audio is in Japanese, or vice versa.

The fix is to be explicit. Put the language of the utterance in the prompt, in English.

CODE

1SAY (Japanese): "mou sugu heiten desu."
2Speaker: a librarian at a reference desk.

Do this for every non-English line. All seven supported languages respond well when you declare them. Do not rely on the model to detect the language from the glyphs alone.

3. Script too long for the duration

The audio and the visual track have to fit the same duration window. If you ask for a 3-second clip and hand the model a 6-word line, the mouth will start on time and then clip off mid-word or accelerate to finish inside the window.

Rough rule: budget one second per two syllables of natural speech. A 3-second clip holds about six syllables. A 5-second clip holds about ten. If words are cut off at the end, the script is too long.

Failed (3 seconds, 11 syllables):

CODE

1SAY: "I will meet you at the station after the movie."
2Duration: 3s.

Fixed (3 seconds, 6 syllables):

CODE

1SAY: "meet me at the station."
2Duration: 3s.

Or extend the duration to 5 seconds and the mouth has room to land each consonant.

4. Judging at 256p

This one catches new users constantly. At 256p a face is 40 to 60 pixels tall and the mouth is maybe 8 pixels wide. Any lip sync at that resolution looks approximate, regardless of how well it landed. Consonant-heavy languages like German suffer most; the stop sounds need real mouth shape contrast, and at 256p the whole lower face is a smudge.

Before you decide the sync is broken, regenerate at 720p or 1080p with the same seed so the composition is preserved. You will frequently find the sync was fine; the resolution was lying to you.

1import { fal } from "@fal-ai/client";
2
3// or fal-ai/happyhorse/v1/text-to-video once available
4const ENDPOINT = "fal-ai/seedance-2.0/text-to-video";
5
6async function syncCheck(prompt: string, seed: number) {
7  const result = await fal.subscribe(ENDPOINT, {
8    input: { prompt, resolution: "720p", duration: 5, seed },
9  });
10  return result.data;
11}

Triage checklist

When a sync looks broken, walk this list in order. Stop at the first yes.

Did you lead with SAY:. If not, rewrite.
Is the spoken language declared in the prompt in English. If not, declare it.
Does the script length match the duration. If the ratio is off, shorten or extend.
Are you judging at 256p. If yes, regenerate at 720p with the same seed.
Is the face occluded by a hand, hair, or prop. The model cannot sync what it cannot show. Reframe.

Most of the time you stop at step 1 or 3. The rest are usually step 4, where the fix is free because the seed carries over. Real model failures at full resolution with a declared language and a correctly sized script are rare and should be treated as bug reports.

Back to all posts

Debugging Lip Sync: Why the Mouth Is Out of Step

1. Dialogue buried under ambient

2. Language mismatch

3. Script too long for the duration

4. Judging at 256p

Triage checklist

HappyHorse 1.0 vs Veo 3.1: The 60-Elo Gap Explained

Prompting HappyHorse: Shot Descriptions That Land

Image-to-Video: Using a First Frame to Pin the Shot