FaceTime in iOS 14 also includes a feature that makes it appear that you are looking into the camera even when you are not.<p><a href="https://appleinsider.com/articles/20/06/22/facetime-eye-contact-correction-feature-to-launch-with-ios-14" rel="nofollow">https://appleinsider.com/articles/20/06/22/facetime-eye-cont...</a>
Basically it's not video anymore, it's motion capture applied to an avatar, where the default avatar is your original face.<p>It seems like this could /also/ be used for video by using this technique along with residual coding.
This is pretty phenomenal. Just the face realignment and the bandwidth consumption can change drastically the experience with videoconferences.<p>Looking forward to have these available in consumer hardware soon.<p>More examples here:<p><a href="https://developer.nvidia.com/maxine" rel="nofollow">https://developer.nvidia.com/maxine</a>
Some comments have touched on some possible issues such as the swapping of key-frames of someone else's face and possible funky effects by introducing other faces and or objects into the camera image.<p>But I haven't seen anybody touch on the compute cost required to implement this. As I'm not in the machine learning field I don't have a good idea what the compute cost is for something like this. Can anybody chime in on that?<p>If this "codec" were to require a somewhat beefy gpu I don't see the benefits at all. Current H264 is usually done by hardware decode and sometimes even encode. In areas where bandwidth is constrained I would imagine a lack of computing resources, thus nullifying the entire premise. That said, in current times it would save a substantial amount of data transmitted. But I'm not sure if we should lock-in our entire videoconferencing system to nvidia just to save some bandwith.
What sort of latency are we looking at for these AI regenerative videos?<p>I thought comparing it in KB per frame was a strange way to measure it, since video codec are used to measurement similar to Network in kbps or mbps.<p>So the Video Codec was actually 50kbps, which is indeed a very low bitrate. But this was done on H.264, which is now nearly 20 years old. Modern Codec like HEVC and VP9, or State of the Art like AV1 and VVC would have done much much better.<p>Next problem, would this only work on Nvidia GPU? Apple are already doing something similar to FaceTime, but only with respect to eye contact. Are we entering an era where even AI video codec are bound by devices?<p>I used to hope and wish Apple introduce these kind of features to iPhone. But their act and response on App Store is making me wary.
The GAN is similar to the one with no supervision to create DeepFake by Aliksandar et al. The catch is that if they move a lot w.r.t. original frame it creates hilarious artefacts. But still great sure if you have GPUs on each end.
The unspoken elephant in the room is obviously it doesn't even have to be your face that is being animated in the video call. You could swap out the first keyframe image and appear to be any other real person during the video call with the same fidelity. Sounds great for corporate espionage and lurking on calls that you shouldn't be on.<p>I don't think it's fair to call this video compression as much as real-time photo-realistic animation via motion capture.
The magic is knowing when to take a new reference photo, in case someone else walks in, or I drink from a cup of coffee, or hold up an object to the camera. At which point we're almost back to H.264, except it's unclear if that will work without additional training.
One step closer to having decent VR meetings. I don’t want to see someone’s avatar, I want a virtual representation of their face that looks like they’re really talking.
I love seeing new technologies like this emerge during these changing and unfamiliar times.<p>Face time calls on a remote satellite internet setup will be revolutionary.
Limitations are pretty crippling for real world use.<p>Need a fixed camera, one face and a fairly static background so there goes mobile or conference room use.
Trading bandwidth for compute.<p>Reminds me of a section in Hofstadter's 'Godel, Escher, Bach' about there being knowledge in the signal vs. the receiver, or something akin to that.