Technique4 min read

Native Audio and Joint Synthesis: When to Let the Model Do Both

HappyHorse 1.0 synthesizes audio and video in one pass, off one latent trajectory, with lip sync that holds across seven languages. That sounds like a feature you always want on.


HappyHorse 1.0 synthesizes audio and video in one pass, off one latent trajectory, with lip sync that holds across seven languages. That sounds like a feature you always want on. In practice, joint synthesis is a decision, not a default.

Why joint synthesis is different

The older pattern: render a silent clip, send it to a voice model, line up phonemes with mouth shapes, pay someone to mix. Each handoff is a seam, and seams leak. A sibilant arrives a frame late. A consonant hits without a jaw on it.

HappyHorse flattens those seams. One 40 layer Transformer holds roughly 15 billion parameters and emits both modalities together, so the vowel shape and the waveform come from the same shared conditioning. Breath lands with a chest rise. A stop consonant arrives with a lip press on the exact frame.

Close up of speaker mid phrase with matching waveform
Close up of speaker mid phrase with matching waveform

The Artificial Analysis leaderboard gives HappyHorse a top T2V Elo of 1333 and I2V Elo of 1392, about 60 points ahead of Seedance 2.0. Word error rate from an ASR pass on generated audio against the scripted line is ultra low. No exact figure is printed here because the public card has not posted one yet.

When joint synthesis pays off

Let the model do both when all three are true: the scene is dialogue driven, you want the deliverable to land without a mix pass, and the final market is a supported language.

  • Talking head shorts. One speaker, medium close up, one take. You want the breath.
  • Product demos with a voiceover in frame. Joint synthesis keeps the click on the click.
  • Educational b-roll with narration and matched foley. Pencil scratch, page turn, kettle whistle.
  • Short form social in a supported language. The lip sync holds on a phone.

When to skip it

  • Brand music bed final. A composer already delivered the track. Generated ambient would fight the mix.
  • Markets outside the seven languages. Arabic, Portuguese, Hindi, Turkish are not yet supported.
  • Legal or brand approval requires a human voice actor.
  • Stitching into a longer edit with existing music. Generated audio clashes at the cut.

In those cases, treat HappyHorse as a high end visual model and drop the audio track on ingest.

The prompt shape that works

Put speech on a SAY: label at the top, then AMBIENT:, then FOLEY:. The model treats each label as a separate instruction channel.

TEXT
1SAY (Japanese, calm, female, 28): ...
2AMBIENT: late afternoon cicadas, distant train, small room
3FOLEY: teacup lifted from saucer on line two, one clink
4VIDEO: tight medium, 35mm feel, soft window light, slow push in

For the seven language claim, here is the SAY: line in Japanese, first in romaji, then kana. HappyHorse accepts both.

TEXT
1SAY (romaji): ashita no asa, koohii o ippai nomi ni kite kudasai.
2SAY (kana): あしたの あさ、コーヒーを いっぱい のみに きて ください。
Language label matrix showing seven supported tongues
Language label matrix showing seven supported tongues

A runnable request

Until the public HappyHorse endpoint lands, code falls back to Seedance 2.0, which renders the video without joint audio. Swap the model id when the public card goes live.

JAVASCRIPT
1import { fal } from "@fal-ai/client";
2
3const result = await fal.subscribe("fal-ai/seedance-2.0/text-to-video", {
4 // or fal-ai/happyhorse/v1/text-to-video once available
5 input: {
6 prompt: [
7 "SAY (Japanese, calm, female, 28): ashita no asa, koohii o ippai nomi ni kite kudasai.",
8 "AMBIENT: late afternoon cicadas, distant train, small room",
9 "FOLEY: teacup lifted from saucer on line two, one clink",
10 "VIDEO: tight medium, 35mm feel, soft window light, slow push in"
11 ].join("\n"),
12 resolution: "1080p",
13 duration: 8,
14 aspect_ratio: "16:9"
15 },
16 logs: true
17});

A 1080p take of roughly 38 seconds is the published ceiling on an H100. For an 8 second dialogue scene with joint audio, you are well inside the budget.

Quality pass checklist

  • Scrub lip sync at quarter speed on a hard consonant word. The jaw should meet the word, not chase it.
  • Mute the track and watch the subject breathe. If the chest rise tracks the spoken line, the joint model earned its keep.
  • Listen on a phone speaker. Most of your audience will.

If the take passes all three, ship it. If not, re-roll with a tighter SAY: demographic hint and a shorter dialogue span.


Also reading