Nvidia Uses AI to Slash Bandwidth on Video Calls

293 pointsby srirangrover 4 years ago

51 comments

> they have managed to reduce the required bandwidth for a video call by an order of magnitude. In one example, the required data rate fell from 97.28 KB/frame to a measly 0.1165 KB/frame – a reduction to 0.1% of required bandwidth.A nitpick, perhaps, but isn't that three orders of magnitude?We've already seen people use outlandish backgrounds in calls, now it's going to be possible to design similar outlandish views, but actually be this new invention in real time. There's been a lot of discussion centered around deep fakes and its problems, this is essentially deep faking yourself into whatever you want.Video calls are a very important form of communication at the moment, if this becomes as accepted as background modification, that would open the societal door to a whole range of self presentation that up till now was restricted to in game virtual characters.I wonder what kind of implications that could have. Would people come to identify themselves strongly with a virtual avatar, perhaps stronger than their real life "avatar"? It is an awesome freedom to have, to remake yourself.

评论 #24730008 未加载

评论 #24728928 未加载

评论 #24728196 未加载

评论 #24730198 未加载

评论 #24728995 未加载

评论 #24728198 未加载

评论 #24729744 未加载

评论 #24729545 未加载

评论 #24733736 未加载

评论 #24730415 未加载

评论 #24729063 未加载

评论 #24734161 未加载

评论 #24730447 未加载

评论 #24728091 未加载

评论 #24728074 未加载

etcetover 4 years ago

A technology very similar to this plays a plot point in Vernor Vinge's 1992 novel A Fire Upon the Deep.In his universe, both the interstellar net and combat links between ships are low bandwidth. Hence, video is interpolated between sync frames or recreated from old footage. Vinge calls the resulting video "evocations".

评论 #24728993 未加载

评论 #24733173 未加载

评论 #24730159 未加载

ACow_Adonisover 4 years ago

Fundamentally, I don't know if people realise that what we're on the verge of here.It's effectively a motion-mapped keypoints of the person projected onto a simulated model. I'm assuming the cartoonish avatar was used as an example to partly avoid drawing direct lines to the full implications.- There's no reason this couldn't extend to voice modelling as well. (much clearer speaking at much lower bandwidth)- There's no reason this couldn't extend to replacing your sent projection with another image (or person)- Professional looking suit wearing presentation when you're nude/hungover/unshaven. Hell, why even stop at using your real gender or visage? Imagine a job interview where every candidate, by definition, visually looked the same :)- There's no reason you couldn't replace other people's avatar with one's of your own choosing as well.- Why couldn't we model the rest of the environment?Not there today, but this future is closer than many realise.

评论 #24729355 未加载

评论 #24731619 未加载

评论 #24728378 未加载

Animatsover 4 years ago

This is a lot like Framefree.[1] That was developed around 2005 at Kerner Optical, which was a spinoff from Lucasfilm. The system finds a set of morph points in successive keyframes and morphs between them. This can do slow motion without jerkyness, and increase frame rate. Any modern GPU can do morphing in real time, so playback is cheap. There used to be a browser plug-in for playing Framefree-compressed video.Compression was expensive, because finding good morph points is hard. But now hardware has caught up to doing it in real time on cheap hardware.As a compression method, it's great for talking heads with a fixed camera. You're just sending morph point moves, and rarely need a new keyframe.You can be too early. Kerner Optical went bust a decade ago.[1] <a href="https://youtu.be/VBfss0AaNaU" rel="nofollow">https://youtu.be/VBfss0AaNaU</a>

colechristensenover 4 years ago

This may be projecting expectations but the example compressed video looks very slightly fake in a way that is just a little uncanny valley type unsettling.Perhaps the nets they’re using are compressing out facial microexpressions and when we see it, it seems just a little unnatural. Compression artifacts might be preferable because the information they’re missing is more obvious and less artificial. In other words i’d rather be presented with something obviously flawed than something i can’t quite tell what is wrong.

varispeedover 4 years ago

What I don't like about AI processed images is that they are not real. I can't go past the fact that I am not looking at the picture as it looks like in reality but somehow smart approximation of the world that is not necessarily true.

评论 #24730132 未加载

评论 #24730532 未加载

评论 #24731298 未加载

评论 #24729941 未加载

评论 #24729931 未加载

评论 #24730678 未加载

评论 #24730041 未加载

评论 #24733071 未加载

评论 #24729943 未加载

jcimsover 4 years ago

Wonderful technical achievement but I think I’d rather squint through garbled video to see a real human.Now if I can use it to add a Klingon skull ridge and hollow eyes to my boss or scribble notes on my scrum master’s generous forehead we might be on to something.

conradludgateover 4 years ago

I see a lot of people being alienated by the fact that people could take on different avatars during their meeting. I would honestly accept that with no question.In a work environment, I would expect the person I'm talking to to be presentable, ie their avatar would be presentable, so no goofy backgrounds or annoying accessories.But the key for me is, I'd actually have something to see. So often in my work in in meetings and three people have cameras on and the rest don't. I don't really care what they look like, I care if they're engaged, nodding their heads, their facial reactions.I don't always have my video on either, I don't have great upload speeds so I usually appear as a big blob anyway. I'd happily have whatever representation of me be in my place if it meant people could see my reactions

评论 #24733767 未加载

评论 #24752952 未加载

Steltekover 4 years ago

My first thought was about the diversity of faces used in the demo and how ten years ago, computers didn't think black people were humans.<a href="https://www.youtube.com/watch?v=t4DT3tQqgRM" rel="nofollow">https://www.youtube.com/watch?v=t4DT3tQqgRM</a>But after that, I was reminded of the paranoia (or not?) around Zoom and that, for an extreme example, the CCP was mining and generating facial fingerprints and social networks using video calls. It seems like this technology is the same concept except put to a useful purpose.

评论 #24729141 未加载

评论 #24729268 未加载

ip26over 4 years ago

If the "Free View" really works well, that sounds like possibly the most important part. The missing feeling of eye contact is a significant unsolved problem in video calls.

评论 #24731039 未加载

ksecover 4 years ago

I would imagine Apple doing this with FaceTime soon.Using their own NPU ( Neural processing unit ), you can now make FaceTime call with ridiculously low bandwidth. From the Nvidia example, 0.1165 KB/frame even at buttery smooth 60fps ( I could literally hear Apple market the crap out of this ), that is 7KBps or 56Kbps! Remember when the industry were trying to compress CD Audio quality ( aka 128Kbps MP3 ) down to 64Kbps? This FaceTime Video Call is using even less!And since the NPU and FaceTime are all part of Apple's platform and not available anywhere else. They now have an even better excuse not to open it up and further lock customers into their ecosystem. ( Not such a good thing with how Apple are acting right now )Not so sure where Nvidia is heading for this since Not everyone will have a CUDA GPU.

评论 #24729781 未加载

评论 #24729543 未加载

评论 #24729830 未加载

评论 #24730011 未加载

评论 #24731069 未加载

评论 #24730293 未加载

评论 #24733497 未加载

评论 #24729637 未加载

antmanover 4 years ago

I would go to the origin announcement rather than this reproduction with ads <a href="https://developer.nvidia.com/maxine?ncid=so-yout-26905#cid=dl13_so-yout_en-us" rel="nofollow">https://developer.nvidia.com/maxine?ncid=so-yout-26905#cid=d...</a>

评论 #24729414 未加载

acomjeanover 4 years ago

Isn’t this just like apples animated emoji (Animoji) where your face is mapped to a emoji character? Except instead of a cartoon it’s mapped to your actual face.<a href="https://blog.emojipedia.org/apples-new-animoji/" rel="nofollow">https://blog.emojipedia.org/apples-new-animoji/</a>And how well does that work when you switch to screen sharing?

评论 #24729413 未加载

blackbear_over 4 years ago

Nvidia Uses AI to Slash Bandwidth on Video Calls... But only if everybody in the call has a 600$ Nvidia GPU

评论 #24728011 未加载

评论 #24730404 未加载

评论 #24728120 未加载

jonplackettover 4 years ago

I wonder how weird it gets when you turn your head too much. This is very cool though - I was expecting to be able to tell a difference and maybe slip into uncanny valley territory but it looks good.Big question though - is this just substituting the problem of not having good internet with not having a really fast nVidia graphics card?

评论 #24728986 未加载

评论 #24728311 未加载

qwerty456127over 4 years ago

Now the person you are speaking to is going to be n% (partially) emulated. n is going to increase in future. One day there will be a paid feature letting you emulate 100% of yourself to respond to video calls when you are not available. And finally, they will replace yourself even without you knowing, and even after you die.

评论 #24728361 未加载

评论 #24729133 未加载

评论 #24728304 未加载

dhdhhddover 4 years ago

At what resolution? And also, does the output actually resembles the original image? Examples with background other than uniform? Would be nice if they provided more than just screenshotsIt's not uncommon to see video calls at 100kbs-150kbps, which is ~10KB/s, and this is for 7fps or so, including audio. So "per frame" that would be 1KB or so (more for key frames, less for I frames).So they say it can be 0.1KB, so better than that... Exciting, if realistic.Also, add on top audio, and packet overhead :-) there is at least 0.1KB overhead for sending the packet (bundle it with audio if possible!)

评论 #24731452 未加载

colossalover 4 years ago

Can't wait to see the bugs! GAN's are famous for some... interesting reconstructions. And better still, nvidia will have no way to debug it since the model is essentially a black box.

vernieover 4 years ago

People are extremely sensitive to subtleties in mouth articulation which facial landmark tracking tends to have trouble capturing. I question whether just a keyframe and facial landmarks are enough to generate convincing lip sync or gaze. I suspect that this is why the majority of the samples in the video are muted, which is a trick commonly used by facial performance capture researchers to hide bad lip sync results.

评论 #24731047 未加载

unicornpornover 4 years ago

Stills OK, it would be interesting to see it move. Risk for uncanny valley?Petapixel is a blog spam site btw. Why not go to the source that is linked in the post?

评论 #24731590 未加载

评论 #24728053 未加载

erglover 4 years ago

We're all deepfakes now, it seems.

评论 #24729329 未加载

ryanmarshover 4 years ago

So it’s a deep fake, not video. These headlines man.

miohtamaover 4 years ago

In Vernor Vinge's boon Fire Upon the Deep he describes how interstellar calls work in the future. In the lowest bandwidth tier you are watching an animated static 2d photo with text-to-speech . The book also touches topics like translation, different spectrum and senses different species want from the call.

akerroover 4 years ago

Is AI cheaper than bandwidth?

评论 #24729323 未加载

评论 #24729158 未加载

评论 #24729419 未加载

Borborygymusover 4 years ago

Look neat. I wonder that the system requirements / license for software will be?There's a real network effect with things like codecs - unless some significant proportion of calls can use it, it'll remain a cool but obscure experiment.I hope Nvidia have the foresight to release something that'll run on any hardware, and under a permissive license, but I suspect not.The idea is out there already (it's basically deep fake tech, right?), and I'm sure it won't be so long before some open source version of it gets released. Nvidia would be wise to get out in front of that and at least have their brand associated with a widely used variant on the theme.

agumonkeyover 4 years ago

I find it very nice that for once tech will be used to lower cost of communication.

fudged71over 4 years ago

Great concept. With higher quality input, keyframe, point tracking, and ML model the output should improve significantly with similar bandwidth improvement. I think this could also smooth between dropped frames and offer higher bandwidth for the audio feed.The issues are social. I would hope that the receiver is the one able to choose between original or AI stream, as I can understand some people being uncomfortable with the artifacts, gaze, expressions, and other edge cases. But when the quality is higher I could see a lot of people preferring this option as a default.

mensetmanusmanover 4 years ago

Huge if true.Recall the promise of 5G is an order of magnitude increase in speed (among other things like low latency).If we can get there by reducing bandwidth requirements by an order, that will be great. Wonder if it applies to Netflix...

评论 #24729196 未加载

anonuover 4 years ago

one of the comments [1] in the article (not mine) is excellent tongue-in-cheek but thought provoking:> Next step would be to just predict both sides of the conversation and sever the real-life link entirely.Gmail already does a little bit of this. Google books appointments over the phone on your behalf.We're on the road to this...[1] <a href="http://disq.us/p/2ccckuy" rel="nofollow">http://disq.us/p/2ccckuy</a>

motoboiover 4 years ago

To reproduce something akin, use this Colab notebook: <a href="https://colab.research.google.com/github/eyaler/avatars4all/blob/master/fomm_live.ipynb" rel="nofollow">https://colab.research.google.com/github/eyaler/avatars4all/...</a>In the final form, use "You=" as reference and just press it one at 1 seconds to simulate keyframe.AMAZING!

sippingjippersover 4 years ago

Not shown: 12U quarter rack stuffed full of GPUs

评论 #24730294 未加载

jstanleyover 4 years ago

This includes a feature:> Called “Free View,” this would allow someone who has a separate camera off-screen to seemingly keep eye contact with those on a video call.Am I the only one who thinks eye contact on video calls feels creepy? I think I would prefer this feature to remove eye contact on video calls rather than add it.

评论 #24729267 未加载

评论 #24729696 未加载

aaron695over 4 years ago

I can steal all your video data of 'you' and call someone else back, as you?I could even get an accomplice to do it while I'm talking to you. They would have your today clothes on and you'd be tied up talking to me.I'm dubious on the tech being as good as they say now. But it's getting exciting.

edoloughlinover 4 years ago

* instead of sending a stream of pixel-packed images, it sends specific reference points on the image around the eyes, nose, and mouth*So they're trading bandwidth for CPU load at either end. I wonder what the tradeoff is in terms of energy? Would this result in higher CO2 emissions?

评论 #24731016 未加载

评论 #24730997 未加载

评论 #24731341 未加载

dkarpover 4 years ago

My tests already take twice as long to run when I’m on a Google hangout. For my case, I’d honestly rather use more bandwidth and do minimal local processing. If my machine is slowed down any more then I might have to stop working completely and focus on the meeting I’m in!

评论 #24729321 未加载

评论 #24729181 未加载

make3over 4 years ago

Essentially deepfaking yourself. There's no way to know that the nuances of the emotions passed will be reliably passed, as everything but face lines is hallucinated. And then, it's so life like that you have no paystubs deniability

comeonseriouslyover 4 years ago

Hmm... interesting. I think it looks really good. I wonder how soon it will be before a LEO agency uses tech similar to this to alter bodycam footage.(Yes, I know this is realtime webcam footage, not recorded footage, I'm just curious).

mehrdadnover 4 years ago

Is it just me or do the videos no longer look natural? I feel like I see highly non-linear movements (parts of the video moving when they shouldn't, or vice-versa), and facial expressions don't really look quite the same.

评论 #24728283 未加载

评论 #24728019 未加载

评论 #24729201 未加载

评论 #24728166 未加载

supernova87aover 4 years ago

How soon before this incorporates GPT3 and guesses what we were going to say anyway, so we no longer need to say it? Or doesn't quite guess right, and says something that gets you fired!

op1818over 4 years ago

Do we actually need GPUs to run this? There is no training involved, only inference, and CPUs (or low-end GPUs) should comfortably run the workload, at least for a couple of faces.

ada1981over 4 years ago

When will we get VR googles + this tech for couples so we can shape shift during sex, edit out the VR googles and explore scenes together while still viewing our partner?

ameliusover 4 years ago

Perhaps they are overdoing it (if you have a hammer, ...). I would think that the most useful way to use AI in this context would be to predict who is going to speak next.

plgover 4 years ago

What is the product? Is this going to be licensed to Zoom, Skype, Teams, etc? Or is this a distinct product? Does it depend on specific hardware?

Scene_Cast2over 4 years ago

For something a bit closer to traditional compression while using NNs / "AI", there's wave.one

withinboredomover 4 years ago

The biggest application that I can see is being able to send video messages from Mars and beyond.

评论 #24729288 未加载

timgudexover 4 years ago

I think I saw Apple had a patent with a similar idea when they first launched FaceTime.

ketamine__over 4 years ago

Maybe Zoom could use this so their video quality doesn't look like it's from 1999.

absolutelyradover 4 years ago

Using middle out compression?

madsbuchover 4 years ago

How does this work when I show my back garden through the video stream?

评论 #24729381 未加载

brinkover 4 years ago

Watching the video it no longer feels like you're looking at a real person, but instead just another npc. It no longer feels as personal. The last thing remote relationships need is more impersonality. I hope this is used only when it's needed.

评论 #24728251 未加载

rawoke083600over 4 years ago

That is cool use of AI !Sidenote: I always had this idea for "video compression" I'm by no means an expert in compression.1) Take like the top 10 most "VISUAL DIVERSE" movies (imagine Forrest Gump(slow drama) VS ToyStory(anime) Vs Rambo(fast paced visuals)2) "Encode/compress"(I know these terms are not interchangeable) the movie as a "diff/ref" to the "most similar" movie from step 12b) This the "diff/ref" can be many forms "sliding window" over x section of y amount of frames.3) The "end-user" or "destination" have these "10-master-movies" locally on hdd and together with the local-data can construct the original "frame" or "movie" from the compression and local-movie on disk.Tl;DR Try to compress a new movie by saying "the top corner of frame 1-120" is very similar to MasterMovie-2-FrameXYZ-Frame-ABC4)