> they have managed to reduce the required bandwidth for a video call by an order of magnitude. In one example, the required data rate fell from 97.28 KB/frame to a measly 0.1165 KB/frame – a reduction to 0.1% of required bandwidth.<p>A nitpick, perhaps, but isn't that three orders of magnitude?<p>We've already seen people use outlandish backgrounds in calls, now it's going to be possible to design similar outlandish views, but actually <i>be</i> this new invention in real time. There's been a lot of discussion centered around deep fakes and its problems, this is essentially deep faking yourself into whatever you want.<p>Video calls are a very important form of communication at the moment, if this becomes as accepted as background modification, that would open the societal door to a whole range of self presentation that up till now was restricted to in game virtual characters.<p>I wonder what kind of implications that could have. Would people come to identify themselves strongly with a virtual avatar, perhaps stronger than their real life "avatar"? It is an awesome freedom to have, to remake yourself.
A technology very similar to this plays a plot point in Vernor Vinge's 1992 novel A Fire Upon the Deep.<p>In his universe, both the interstellar net and combat links between ships are low bandwidth. Hence, video is interpolated between sync frames or recreated from old footage. Vinge calls the resulting video "evocations".
Fundamentally, I don't know if people realise that what we're on the verge of here.<p>It's effectively a motion-mapped keypoints of the person projected onto a simulated model. I'm assuming the cartoonish avatar was used as an example to partly avoid drawing direct lines to the full implications.<p>- There's no reason this couldn't extend to voice modelling as well. (much clearer speaking at much lower bandwidth)<p>- There's no reason this couldn't extend to replacing your sent projection with another image (or person)<p>- Professional looking suit wearing presentation when you're nude/hungover/unshaven. Hell, why even stop at using your real gender or visage? Imagine a job interview where every candidate, by definition, visually looked the same :)<p>- There's no reason you couldn't replace other people's avatar with one's of your own choosing as well.<p>- Why couldn't we model the rest of the environment?<p>Not there today, but this future is closer than many realise.
This is a lot like Framefree.[1] That was developed around 2005 at Kerner Optical, which was a spinoff from Lucasfilm. The system finds a set of morph points in successive keyframes and morphs between them. This can do slow motion without jerkyness, and increase frame rate. Any modern GPU can do morphing in real time, so playback is cheap. There used to be a browser plug-in for playing Framefree-compressed video.<p>Compression was expensive, because finding good morph points is hard. But now hardware has caught up to doing it in real time on cheap hardware.<p>As a compression method, it's great for talking heads with a fixed camera. You're just sending morph point moves, and rarely need a new keyframe.<p>You can be too early. Kerner Optical went bust a decade ago.<p>[1] <a href="https://youtu.be/VBfss0AaNaU" rel="nofollow">https://youtu.be/VBfss0AaNaU</a>
This may be projecting expectations but the example compressed video looks very slightly fake in a way that is just a little uncanny valley type unsettling.<p>Perhaps the nets they’re using are compressing out facial microexpressions and when we see it, it seems just a little unnatural. Compression artifacts might be preferable because the information they’re missing is more obvious and less artificial. In other words i’d rather be presented with something obviously flawed than something i can’t quite tell what is wrong.
What I don't like about AI processed images is that they are not real. I can't go past the fact that I am not looking at the picture as it looks like in reality but somehow smart approximation of the world that is not necessarily true.
Wonderful technical achievement but I think I’d rather squint through garbled video to see a real human.<p>Now if I can use it to add a Klingon skull ridge and hollow eyes to my boss or scribble notes on my scrum master’s generous forehead we might be on to something.
I see a lot of people being alienated by the fact that people could take on different avatars during their meeting. I would honestly accept that with no question.<p>In a work environment, I would expect the person I'm talking to to be presentable, ie their avatar would be presentable, so no goofy backgrounds or annoying accessories.<p>But the key for me is, I'd actually have something to see. So often in my work in in meetings and three people have cameras on and the rest don't. I don't really care what they look like, I care if they're engaged, nodding their heads, their facial reactions.<p>I don't always have my video on either, I don't have great upload speeds so I usually appear as a big blob anyway. I'd happily have whatever representation of me be in my place if it meant people could see my reactions
My first thought was about the diversity of faces used in the demo and how ten years ago, computers didn't think black people were humans.<p><a href="https://www.youtube.com/watch?v=t4DT3tQqgRM" rel="nofollow">https://www.youtube.com/watch?v=t4DT3tQqgRM</a><p>But after that, I was reminded of the paranoia (or not?) around Zoom and that, for an extreme example, the CCP was mining and generating facial fingerprints and social networks using video calls. It seems like this technology is the same concept except put to a useful purpose.
If the "Free View" really works well, that sounds like possibly the most important part. The missing feeling of eye contact is a significant unsolved problem in video calls.
I would imagine Apple doing this with FaceTime soon.<p>Using their own NPU ( Neural processing unit ), you can now make FaceTime call with ridiculously low bandwidth. From the Nvidia example, 0.1165 KB/frame even at buttery smooth 60fps ( I could literally hear Apple market the crap out of this ), that is 7KBps or 56Kbps! Remember when the industry were trying to compress CD Audio quality ( aka 128Kbps MP3 ) down to 64Kbps? This FaceTime Video Call is using even less!<p>And since the NPU and FaceTime are all part of Apple's platform and not available anywhere else. They now have an even better excuse not to open it up and further lock customers into their ecosystem. ( Not such a good thing with how Apple are acting right now )<p>Not so sure where Nvidia is heading for this since Not everyone will have a CUDA GPU.
I would go to the origin announcement rather than this reproduction with ads <a href="https://developer.nvidia.com/maxine?ncid=so-yout-26905#cid=dl13_so-yout_en-us" rel="nofollow">https://developer.nvidia.com/maxine?ncid=so-yout-26905#cid=d...</a>
Isn’t this just like apples animated emoji (Animoji) where your face is mapped to a emoji character? Except instead of a cartoon it’s mapped to your actual face.<p><a href="https://blog.emojipedia.org/apples-new-animoji/" rel="nofollow">https://blog.emojipedia.org/apples-new-animoji/</a><p>And how well does that work when you switch to screen sharing?
I wonder how weird it gets when you turn your head too much. This is very cool though - I was expecting to be able to tell a difference and maybe slip into uncanny valley territory but it looks good.<p>Big question though - is this just substituting the problem of not having good internet with not having a really fast nVidia graphics card?
Now the person you are speaking to is going to be n% (partially) emulated. n is going to increase in future. One day there will be a paid feature letting you emulate 100% of yourself to respond to video calls when you are not available. And finally, they will replace yourself even without you knowing, and even after you die.
At what resolution? And also, does the output actually resembles the original image? Examples with background other than uniform? Would be nice if they provided more than just screenshots<p>It's not uncommon to see video calls at 100kbs-150kbps, which is ~10KB/s, and this is for 7fps or so, including audio. So "per frame" that would be 1KB or so (more for key frames, less for I frames).<p>So they say it can be 0.1KB, so better than that... Exciting, if realistic.<p>Also, add on top audio, and packet overhead :-) there is at least 0.1KB overhead for sending the packet (bundle it with audio if possible!)
Can't wait to see the bugs! GAN's are famous for some... interesting reconstructions. And better still, nvidia will have no way to debug it since the model is essentially a black box.
People are extremely sensitive to subtleties in mouth articulation which facial landmark tracking tends to have trouble capturing. I question whether just a keyframe and facial landmarks are enough to generate convincing lip sync or gaze. I suspect that this is why the majority of the samples in the video are muted, which is a trick commonly used by facial performance capture researchers to hide bad lip sync results.
Stills OK, it would be interesting to see it move. Risk for uncanny valley?<p>Petapixel is a blog spam site btw. Why not go to the source that is linked in the post?
In Vernor Vinge's boon Fire Upon the Deep he describes how interstellar calls work in the future. In the lowest bandwidth tier you are watching an animated static 2d photo with text-to-speech
. The book also touches topics like translation, different spectrum and senses different species want from the call.
Look neat. I wonder that the system requirements / license for software will be?<p>There's a real network effect with things like codecs - unless some significant proportion of calls can use it, it'll remain a cool but obscure experiment.<p>I hope Nvidia have the foresight to release something that'll run on any hardware, and under a permissive license, but I suspect not.<p>The idea is out there already (it's basically deep fake tech, right?), and I'm sure it won't be so long before some open source version of it gets released. Nvidia would be wise to get out in front of that and at least have their brand associated with a widely used variant on the theme.
Great concept. With higher quality input, keyframe, point tracking, and ML model the output should improve significantly with similar bandwidth improvement. I think this could also smooth between dropped frames and offer higher bandwidth for the audio feed.<p>The issues are social. I would hope that the receiver is the one able to choose between original or AI stream, as I can understand some people being uncomfortable with the artifacts, gaze, expressions, and other edge cases. But when the quality is higher I could see a lot of people preferring this option as a default.
Huge if true.<p>Recall the promise of 5G is an order of magnitude increase in speed (among other things like low latency).<p>If we can get there by reducing bandwidth requirements by an order, that will be great. Wonder if it applies to Netflix...
one of the comments [1] in the article (not mine) is excellent tongue-in-cheek but thought provoking:<p>> Next step would be to just predict both sides of the conversation and sever the real-life link entirely.<p>Gmail already does a little bit of this. Google books appointments over the phone on your behalf.<p>We're on the road to this...<p>[1] <a href="http://disq.us/p/2ccckuy" rel="nofollow">http://disq.us/p/2ccckuy</a>
To reproduce something akin, use this Colab notebook: <a href="https://colab.research.google.com/github/eyaler/avatars4all/blob/master/fomm_live.ipynb" rel="nofollow">https://colab.research.google.com/github/eyaler/avatars4all/...</a><p>In the final form, use "You=" as reference and just press it one at 1 seconds to simulate keyframe.<p>AMAZING!
This includes a feature:<p>> Called “Free View,” this would allow someone who has a separate camera off-screen to seemingly keep eye contact with those on a video call.<p>Am I the only one who thinks eye contact on video calls feels creepy? I think I would prefer this feature to <i>remove</i> eye contact on video calls rather than add it.
I can steal all your video data of 'you' and call someone else back, as you?<p>I could even get an accomplice to do it while I'm talking to you. They would have your today clothes on and you'd be tied up talking to me.<p>I'm dubious on the tech being as good as they say now. But it's getting exciting.
* instead of sending a stream of pixel-packed images, it sends specific reference points on the image around the eyes, nose, and mouth*<p>So they're trading bandwidth for CPU load at either end. I wonder what the tradeoff is in terms of energy? Would this result in higher CO2 emissions?
My tests already take twice as long to run when I’m on a Google hangout. For my case, I’d honestly rather use more bandwidth and do minimal local processing. If my machine is slowed down any more then I might have to stop working completely and focus on the meeting I’m in!
Essentially deepfaking yourself. There's no way to know that the nuances of the emotions passed will be reliably passed, as everything but face lines is hallucinated. And then, it's so life like that you have no paystubs deniability
Hmm... interesting. I think it looks really good. I wonder how soon it will be before a LEO agency uses tech similar to this to alter bodycam footage.<p>(Yes, I know this is realtime webcam footage, not recorded footage, I'm just curious).
Is it just me or do the videos no longer look natural? I feel like I see highly non-linear movements (parts of the video moving when they shouldn't, or vice-versa), and facial expressions don't really look quite the same.
How soon before this incorporates GPT3 and guesses what we were going to say anyway, so we no longer need to say it? Or doesn't quite guess right, and says something that gets you fired!
Do we actually need GPUs to run this? There is no training involved, only inference, and CPUs (or low-end GPUs) should comfortably run the workload, at least for a couple of faces.
When will we get VR googles + this tech for couples so we can shape shift during sex, edit out the VR googles and explore scenes together while still viewing our partner?
Perhaps they are overdoing it (if you have a hammer, ...). I would think that the most useful way to use AI in this context would be to predict who is going to speak next.
Watching the video it no longer feels like you're looking at a real person, but instead just another npc. It no longer feels as personal. The last thing remote relationships need is more impersonality. I hope this is used only when it's needed.
That is cool use of AI !<p>Sidenote: I always had this idea for "video compression" I'm by no means an expert in compression.<p>1) Take like the top 10 most "VISUAL DIVERSE" movies (imagine Forrest Gump(slow drama) VS ToyStory(anime) Vs Rambo(fast paced visuals)<p>2) "Encode/compress"(I know these terms are not interchangeable) the movie as a "diff/ref" to the "most similar" movie from step 1<p>2b) This the "diff/ref" can be many forms "sliding window" over x section of y amount of frames.<p>3) The "end-user" or "destination" have these "10-master-movies" locally on hdd and together with the local-data can construct the original "frame" or "movie" from the compression and local-movie on disk.<p>Tl;DR
Try to compress a new movie by saying "the top corner of frame 1-120" is very similar to MasterMovie-2-FrameXYZ-Frame-ABC<p>4)