VideoRetalk Video Generate
- Audio-driven lip-sync video generation — replaces the lip movements of the person in the video with ones matching the target audio
- Asynchronous processing mode; use the returned task ID to query the result
- Generated video links are valid for 24 hours — save them promptly
Typical use cases:
- Multilingual dubbing: replace the lip movements in the original video with a dubbed audio track in another language
- Virtual presenter: drive a character video with TTS-generated audio
- Advertising: quickly produce multilingual versions of the same video asset
- Education and training: replace instructor videos with explanations in different languages
Notes:
- Input URLs must be publicly accessible
- The video must contain a human face; otherwise the task will fail
- Pass
ref_image_urlwhen multiple faces are present in the video
Authorizations
##All endpoints require Bearer Token authentication##
Get your API Key:
Visit the API Key management page to obtain your API Key
Add the following header to every request:
Authorization: Bearer YOUR_API_KEYBody
Model name
videoretalk "videoretalk"
Input video URL containing the person whose lip movements will be replaced
Requirements:
- Publicly accessible video URL
- Formats: MP4, MOV, and other common formats
- The video must contain a clearly visible human face
- Recommended duration:
2~300seconds
"https://example.com/speaker.mp4"
Target audio URL — the person in the video will lip-sync to this audio
Requirements:
- Publicly accessible audio URL
- Formats: WAV, MP3, M4A, and other common formats
- Recommended to use human speech content
"https://example.com/target-speech.wav"
Reference face image URL
When the video contains multiple faces, use this image to specify the target face whose lip movements should be replaced
Requirements:
- The image should show a clear frontal view of the target person's face
- Only required when the video contains multiple faces
"https://example.com/target-person-face.jpg"
Whether to automatically extend the video to match the audio length when the audio is longer than the video
true: output duration = audio duration (video extended automatically)false: output duration = min(video duration, audio duration)
false
Face matching confidence threshold
- Range:
120~200 - Lower values match more easily (may cause false matches)
- Higher values are stricter (may fail to match)
- If "no matching face found" is reported, try lowering the value (e.g.
140) - If the wrong face is matched, try raising the value (e.g.
190)
120 <= x <= 200170
HTTPS callback URL invoked when the task completes
Trigger conditions:
- Triggered when the task is completed, failed, or cancelled
- Sent after billing confirmation
Security restrictions:
- HTTPS only
- Internal IP addresses are blocked (127.0.0.1, 10.x.x.x, 172.16-31.x.x, 192.168.x.x, etc.)
- URL length must not exceed
2048characters
Callback behavior:
- Timeout:
10seconds - Up to
3retries after failure (at 1s / 2s / 4s intervals) - Response body format matches the task query API response
- A 2xx status code is considered success; other codes trigger a retry
"https://your-domain.com/webhooks/video-task-completed"
Response
Lip-sync video generation task created successfully
Task creation timestamp
1775200000
Task ID
"task-unified-1775200000-xyz12345"
Actual model name used
"videoretalk"
Specific task type
video.generation.task Task progress percentage (0-100)
0 <= x <= 1000
Task status
| Status | progress | Description |
|---|---|---|
pending | 0~10 | Waiting to be processed |
processing | 10~80 | Processing |
completed | 100 | Completed |
failed | 0 | Failed |
pending, processing, completed, failed "pending"
Video task details
Task output type
video "video"
Usage and billing information