Technique 5 min readApr 18, 2026

Multilingual Lip Sync: Seven Languages in One Pass

HappyHorse ships native audio across Chinese, English, Japanese, Korean, German, French, and Cantonese with ultra-low WER lip sync. Here is how the unified architecture handles it and what to write into your prompt.

HappyHorse 1.0 ships native audio synthesis across seven languages: Chinese, English, Japanese, Korean, German, French, and Cantonese. Word Error Rate on lip sync lands under 5 percent in the initial Artificial Analysis benchmarks. That is a different category of output from any other public video model. Veo 3.1 does English well and degrades outside it. Kling v3 Pro supports native audio but not multilingual lip targeting. Seedance 2.0 is silent and you add audio in post.

If you are shooting a talking head in any of those seven languages, HappyHorse becomes the first tool where you do not need a separate TTS pass plus a Wav2Lip cleanup.

Why seven languages

The model is a single 40-layer Transformer. The first four and last four layers are modality-specific, handling tokenization for text, audio, and video separately. The middle 32 layers are shared and have no cross-attention. You get a joint embedding space where a Japanese phoneme, a Korean mouth shape, and a video frame all sit in the same representational neighborhood. The team trained on a large bilingual Chinese and English base, then extended with targeted passes in Japanese, Korean, Cantonese, German, and French.

Spanish, Portuguese, and Arabic are expected in a later revision but are not in the April 2026 release.

Unified architecture diagram with shared middle layers highlighted

Writing a prompt that triggers native audio

Two things matter. Name the language, and put the dialogue in quotes. The first four layers read the quoted span as audio tokens and hand the phoneme sequence to the shared middle block. If you forget the quotes, you get a silent clip with lip motion that does not match anything.

Medium close shot, a chef in a white jacket leans over a steaming pot and says in French, "Tu sens ce romarin? C'est la cle." Warm kitchen lighting, soft steam.

The language tag goes right before the quoted dialogue. The quoted text itself must be in the target language. Mixing English instructions with a non-English dialogue line produces garbled lip sync.

Keep the line short

Target 6 to 14 words per 5-second clip. The temporal prior runs at roughly 24 frames per second, and a natural speaking rate is 2 to 3 words per second. A 10-word line fits with breathing room. Longer lines compress into rushed delivery.

If your scene needs a longer line, split into two 5-second clips with a cut. HappyHorse handles the second clip cleanly because you re-establish speaker identity through the image-to-video first-frame path.

Dialogue length calculator with frame timings per language

A working example

While the HappyHorse endpoint is pending on fal.ai, you run the same prompt shape against the fallback to validate text structure. Audio on the fallback will be silent, but the shot composition and lip motion grammar transfers.

JAVASCRIPT|example.ts

1import { fal } from "@fal-ai/client";
2
3// or fal-ai/happyhorse/v1/text-to-video once available
4const result = await fal.subscribe("fal-ai/seedance-2.0/text-to-video", {
5  input: {
6    prompt: "Medium close shot, a chef in a white jacket leans over a steaming pot and says in French, 'Tu sens ce romarin? C est la cle.' Warm kitchen lighting, soft steam.",
7    duration: 5,
8    resolution: "1080p",
9    seed: 19,
10  },
11  logs: true,
12});
13
14console.log(result.data.video.url);

When HappyHorse lands on fal.ai, you swap the endpoint string and the same prompt produces the clip with audio attached. Pricing is not announced; expected to land in the top video band.

Alternatives today

If you need audio now and HappyHorse is not yet available:

Kling v3 Pro at $0.14 per second for 1080p with native audio. English is strong. Non-English lip sync is weaker. A 5-second clip costs $0.70.
Veo 3.1 at $0.40 per second for 1080p with native audio. English is excellent. French, German, and Spanish are available at reduced quality. A 5-second clip costs $2.00.
Seedance 2.0 at roughly $0.014 per unit, silent, plus a separate TTS and lip-sync pass. A 5-second 720p clip is about $0.07 before audio post.

What ultra-low WER buys you

WER under 5 percent is the number where viewers stop noticing sync as a sync problem. Above 10 percent, you see it. Under 5, dialogue reads as if recorded on set. HappyHorse is the first public model to clear that threshold across seven languages.

Back to Archive