For traditional approaches, using only audio-driven methods can sometimes be unstable due to relatively weak audio signals, while relying solely on facial keypoints can lead to unnatural videos as it exerts too much control over the keypoints. EchoMimic, on the other hand, can not only generate portrait videos using audio or facial keypoints separately but also combine the two for generation.