What it does
Talking Avatar takes a single photo of a face — generated or real — plus either a script or an audio file, and produces a 15-second cinematic clip where that face speaks the words with correctly synced lips, natural head motion, and matching expression. There's no rigging, no model training, no manual keyframes. Drop the photo, type the line, hit generate.
The technology is ByteDance OmniHuman v1.5 — the same model enterprise studios use for synthetic spokesperson video. EGAKU exposes it as a single button with no setup.
The two engines under the hood
OmniHuman v1.5 handles the visual half: it analyzes the face structure in the photo, then animates it to match an audio waveform. Lip shapes, jaw motion, head tilts, eye blinks, micro-expressions — all derived from the audio.
TTS (text-to-speech) handles the audio half if you don't supply your own:
- Inworld TTS-1.5 Max — naturalistic narrator voices, English + Japanese, 9 curated picks (Sarah, Mark, James, Chloe, etc.)
- OpenAI HD voices — multilingual fallback (Nova, Onyx, Alloy, Echo, Fable, Shimmer)
You can also upload your own audio file (recorded voiceover, voice clone, MP3 from elsewhere). The avatar lip-syncs whatever audio you give it.
Step 1 — Upload a photo
Open /talking-avatar. Drop a portrait into Step 1. Best results come from:
- Front-facing or three-quarter angle (not pure profile)
- Both eyes visible
- Mouth closed or slightly parted
- Single subject, no obstructions over the face
- Reasonable resolution (1024×1024 or larger)
If you don't have a photo, generate one in Premium Studio first (GPT Image 2 or Nano Banana 2 work well) and the handoff button drops it straight into Talking Avatar.
Step 2 — Write a script (or upload audio)
In Step 2, pick either:
- Script mode — type up to ~5,000 characters of text. Pick a language (en / ja / es / zh) and a voice. EGAKU synthesizes the audio, then animates.
- Audio upload mode — drop an MP3 / WAV. The avatar lip-syncs to whatever you give it. Use this for branded voiceovers, voice clones, or pre-recorded content.
For scripts, there are 6 tone presets that auto-fill voice + language + sample text: Natural / Presenter, Vlogger, Sultry / Whisper, Horror / Dark, Dramatic Monologue, 日本語ナレーション. Click one to skip the blank page.
Step 3 — Generate
Click Generate. The pipeline runs:
- TTS synthesizes the audio (~5 seconds)
- OmniHuman generates the lip-synced video (~2-3 minutes)
- Final MP4 with H.264 video + AAC audio drops into your gallery
Around 65 credits per video — roughly $0.50. The result is downloadable, shareable, and auto-tagged with EGAKU AI metadata for attribution.
5 use cases that actually convert
- Sales / cold outreach video — record yourself once as voice clone, then have an AI version of you send 100 personalized LinkedIn videos with different first names without recording 100 times.
- Course / tutorial narration — make the educator on screen consistent across hours of material, no studio time.
- VTuber-adjacent content — animate a static character image with episodic dialogue, no Live2D rig required.
- Multilingual product demo — same script in 4 languages, swap voice + language, generate 4 versions in 10 minutes.
- Social media short — 15-second talking-head reels for Instagram / TikTok / X with the EGAKU watermark driving traffic back.
Plans
Talking Avatar is available on Lite plan (¥480/mo) and above. Each video runs ~65 credits. See pricing →