Upload a Photo, Get a Video
The rapid developments in AI have unlocked new possibilities for digital representation. With the help of AI models, you can now achieve a remarkable feat: bringing characters to life with just an image and an audio clip.
Jointly developed by Tencent Hunyuan and Tencent Music, the newly released HunyuanVideo-Avatar, a multimodal diffusion transformer-based model, is capable of simultaneously generating dynamic, emotion-controllable, and multi-character dialogue videos. This capability supports head-and-shoulder, half-body, and full-body views, encompassing multiple styles, species, and even dual-character scenes.
To put it simply, you just upload a photo and a voice clip, and the model figures out the context, emotion and lip movements to create a realistic animated video.
For instance, if you upload an image of a woman sitting on a beach with a guitar, along with a piece of lyrical music, the model understands the scene as "a woman playing the guitar and singing a lyrical song by the sea," and subsequently generates a video of the woman performing the song.
The model provides video creators with highly consistent and dynamic video generation capabilities. Its versatility can unlock a myriad of applications in fields like entertainment, media, e-commerce, advertising and education.
It has already been applied in multiple scenarios within Tencent Music, such as AI companions for music listening, long-form audio podcasts, and music videos (MVs).
For example, on the app QQ Music, when users listen to songs by "AI Leehom" (a fully AI-driven singer created by Tencent Music and Team Leehom), a lively and adorable AI Leehom image synchronizes its singing in real-time on the player.
On WeSing, a popular karaoke singing app, users can upload their images to generate personalized MVs of themselves singing.
In subject consistency and audio-video synchronization, the HunyuanVideo-Avatar shows top-tier industry performance. For video dynamics and natural body movements, it exceeds open-source solutions and rivals closed-source ones.
Currently, the model supports audio uploads of up to 14 seconds for video generation, with more capabilities to be released and open-sourced in the future.