Show HN: I Remade the Fake Google Gemini Demo, Except Using GPT-4 and It's Real

434 pointsby gregsadetskyover 1 year ago

20 comments

phireover 1 year ago

The "magic" of the fake Gemini demo was the way it seemed like the LLM was continually receiving audio + video input and knew when to jump in with a response.It appeared to be able to wait until the user had finished the drawing, or even jumping in slightly before the drawing finished. At one point the LLM was halfway through a response and then saw the user was now colouring the duck in blue, and started talking about how the duck appearing to be blue. The LLM also appeared to know when a response wasn't needed because the user was just agreeing with the LLM.I'm not sure how many people noticed that on a conscious level, but I positive everyone noticed it subconsciously, and felt the interaction was much more natural, and much more advanced than current LLMs.-----------------Checking the source code, the demo takes screenshots of the video feed every 800ms, waits until the user finishes taking and then sends the last three screenshots.While this demo is impressive, it kind of proves just how unnatural it feels to interact with an LLM in this manner when it doesn't have continuous audio-video input. It's been technically possible to do kind of thing for a while, but there is a good reason why nobody tried to present it as a product.

评论 #38597635 未加载

评论 #38598957 未加载

评论 #38597993 未加载

评论 #38597429 未加载

评论 #38599413 未加载

评论 #38597555 未加载

评论 #38599645 未加载

godelskiover 1 year ago

I don't get why companies lie like this. How much do they have to gain? It seems like they actually have a lot to lose.What's crazy to me is that these tools are wildly impressive without the hype. As a ML researcher, there's a lot of cool things we've done but at the same time almost everything I see is vastly over hyped from papers to products. I think there's a kinda race to the bottom we've created and it's not helpful to any of us except maybe in the short term. Playing short term games isn't very smart, especially for companies like Google. Or maybe I completely misunderstand the environment we live in.But then again, with the discussions in this thread[0] maybe there's a lot of people so ethically bankrupt that they don't even know how what they're doing is deceptive. Which is an entirely different and worse problem.[0] <a href="https://news.ycombinator.com/item?id=38559582">https://news.ycombinator.com/item?id=38559582</a>

评论 #38598063 未加载

评论 #38597731 未加载

评论 #38597859 未加载

评论 #38597544 未加载

评论 #38602312 未加载

评论 #38597645 未加载

评论 #38598124 未加载

评论 #38599043 未加载

评论 #38601125 未加载

评论 #38597746 未加载

sheepscreekover 1 year ago

Thank you for creating this demo. This was the point I was trying to make when the Gemini launch happened. All that hoopla for no reason.Yes - GPT-4V is a beast. I’d even encourage anyone who cares about vision or multi-modality to give LLaVA a serious shot (<a href="https://github.com/haotian-liu/LLaVA">https://github.com/haotian-liu/LLaVA</a>). I have been playing with the 7B q5_k variant last couple of days and I am seriously impressed with it. Impressed enough to build a demo app/proof-of-concept for my employer (will have to check the license first or I might only use it for the internal demo to drive a point).

评论 #38597361 未加载

评论 #38600253 未加载

swyxover 1 year ago

haha yes it was entirely possible with gpt4v. literally just screenshot and feed in the images and text in chat format, aka “interleaved”. made something similar at a hackathon recently. (<a href="https://x.com/swyx/status/1722662234680340823" rel="nofollow noreferrer">https://x.com/swyx/status/1722662234680340823</a>). the bizarre thing is that google couldve done what you did, and we wouldve all been appropriately impressed, but instead google chose to make a misleading marketing video for the general public and leave the rest of us frustrated nerds to do the nasty work of having to explain why the technology isnt as seen on tv yet; making it seem somehow our faulti am curious about the running costs of something like this

评论 #38597801 未加载

dingclancyover 1 year ago

I am now convinced that Google Deepmind really had nothing in terms of SOTA LLMs. They were just bluffing.I remember when chatgpt was released Google was saying that they had much much better models that they are not releasing because they for AI Safety. Then theu released palm and palm 2 saying that it is time to release these models to beat ChatGPT. It was not a good model.The they hyped up Gemini, and if Gemini Ultra is the best they have, I am not convinced that they have a better model. So this is it.So in one year, we went from Google has to have the best model, they just do not want to release to they have the infrastructure and data and the talent to make the best model. Why they really had was nothing.

iamleppertover 1 year ago

I’ve recently been trying to actually use Google’s AI conversational translation app that was released awhile back and has many updates and iterations since.It’s completely unusable for real conversation. I’m actually in a situation where I could benefit from it and was excited to use it because I remember watching the demo and how natural it looked but was never able to actually try it myself.Now having used it, I went back and watched their original demo and I’m 100% convinced all or part of it was faked. There is just no way this thing ever worked. If they can’t manage to make conversational live translation work (which is a lot more useful than drawing a picture of a duck) I have high doubts about this new AI.Seems like the exact same situation to me. It’s insane to me how much nerve it must take to completely fake something like this.

评论 #38599769 未加载

adtacover 1 year ago

[tangential to this really cool demo] JPEG images being the only possible interface to GPT-4 feels wasteful. the human eye works the delta between "frames", not the image itself. I wonder if the next big step that would allow real-time video processing at high resolutions is to have the model's internal state operate on keyframes and deltas similar to how video codecs like MPEG work.

评论 #38597618 未加载

sibeliussover 1 year ago

Lol at choosing the name Sagittarius, which is exactly across from Gemini in the Zodiac

评论 #38597634 未加载

zainhodaover 1 year ago

Wow, this is super cool! From the code it seems like the speech to text and text to speech are using the browser’s built-in features. I always forget those capabilities even exist!

razodactylover 1 year ago

The latency is excusable as this is through the API. Inference on local infrastructure is almost instant so this demo would smoke everything else if this dude had access.

dvaunover 1 year ago

Great demo, I laughed at the final GPT response too.Honestly: it would be fun to self-host some code hooked up to a mic and speakers to let kids, or whoever, play around with GPT4. I’m thinking of doing this on my own under an agency[0] I’m starting up on the side. Seems like a no-brainer as an application.[0]: <a href="https://www.divinatetech.com" rel="nofollow noreferrer">https://www.divinatetech.com</a>

n8fr8tooover 1 year ago

I had been working on an idea for an interface "Sorting Hat" system to help kids at schools know whether something was for trash, compost, or recycling. While I had been hacking on it for a bit, Greg's "demo" was much better integrated than what I could do, so thanks Greg!I did add ElevenLabs support to make it a little more snazzy sounding...So, here it is the "Compose/Trash/Recycle Sorting Hat, Built on Sagittarious" <a href="https://github.com/n8fr8/CompostSortingHatAI">https://github.com/n8fr8/CompostSortingHatAI</a>You can see a realtime, unedited YouTube demo video of my kid testing it out here: <a href="https://www.youtube.com/watch?v=-9Ya5rLj64Q" rel="nofollow noreferrer">https://www.youtube.com/watch?v=-9Ya5rLj64Q</a>

dingclancyover 1 year ago

I am now convinced that Google DeepMind really had nothing in terms of state-of-the-art language models (SOTA LLMs). They were just bluffing. I remember when ChatGPT was released; Google was saying that they had much better models they were not releasing due to AI safety. Then they released Palm and Palm 2, saying it's time to beat ChatGPT with these models. However, it was not a good model.They then hyped up Gemini, and if Gemini Ultra is the best they have, I am not convinced that they have a better model.Sundar's code red was genuinely alarming because they had to dig deep to make this Gemini model work, and they still ended up with a fake video. Even if Gemini was legitimate, it did not beat GPT-4 by leaps and bounds, and now GPT-5 is on the horizon, putting them a year behind. It makes me question if they had a secret powerful model all along

cylinder714over 1 year ago

Snader's Law: "Any sufficiently advanced technology is indistinguishable from a rigged demo."

iandanforthover 1 year ago

Looks like, again, this doesn't have GPT-4 processing video as much as a stack of video frames, concatenated and sent as a single image. But much closer to real!

评论 #38597529 未加载

评论 #38597344 未加载

评论 #38597479 未加载

评论 #38597450 未加载

评论 #38597329 未加载

razodactylover 1 year ago

The part that really confuses me is the lack of a "*some sequences simulated" disclaimer.

评论 #38599822 未加载

ShamelessCover 1 year ago

Sad state of affairs for Google.

评论 #38600341 未加载

评论 #38597341 未加载

op00toover 1 year ago

Very cool!

jakderridaover 1 year ago

Lmao! So, presumably, they could have hired Greg to improvise almost the exact same demonstration, but with evidence it works. I don't know how much Greg costs, but I'll bet my ass it's less than the cost in investor sentiment after getting caught committing fraud. Not saying you're cheap. Just cheaper.

fraysover 1 year ago

Thanks for sharing!

20 comments

phireover 1 year ago

评论 #38597635 未加载

评论 #38598957 未加载

评论 #38597993 未加载

评论 #38597429 未加载

评论 #38599413 未加载

评论 #38597555 未加载

评论 #38599645 未加载

godelskiover 1 year ago

评论 #38598063 未加载

评论 #38597731 未加载

评论 #38597859 未加载

评论 #38597544 未加载

评论 #38602312 未加载

评论 #38597645 未加载

评论 #38598124 未加载

评论 #38599043 未加载

评论 #38601125 未加载

评论 #38597746 未加载

sheepscreekover 1 year ago

评论 #38597361 未加载

评论 #38600253 未加载

swyxover 1 year ago

评论 #38597801 未加载

dingclancyover 1 year ago

iamleppertover 1 year ago

评论 #38599769 未加载

adtacover 1 year ago

评论 #38597618 未加载

sibeliussover 1 year ago

Lol at choosing the name Sagittarius, which is exactly across from Gemini in the Zodiac

评论 #38597634 未加载

zainhodaover 1 year ago

Wow, this is super cool! From the code it seems like the speech to text and text to speech are using the browser’s built-in features. I always forget those capabilities even exist!

razodactylover 1 year ago

The latency is excusable as this is through the API. Inference on local infrastructure is almost instant so this demo would smoke everything else if this dude had access.

dvaunover 1 year ago

n8fr8tooover 1 year ago

dingclancyover 1 year ago

cylinder714over 1 year ago

Snader's Law: "Any sufficiently advanced technology is indistinguishable from a rigged demo."

iandanforthover 1 year ago

Looks like, again, this doesn't have GPT-4 processing video as much as a stack of video frames, concatenated and sent as a single image. But much closer to real!

评论 #38597529 未加载

评论 #38597344 未加载

评论 #38597479 未加载

评论 #38597450 未加载

评论 #38597329 未加载

razodactylover 1 year ago

The part that really confuses me is the lack of a "*some sequences simulated" disclaimer.

评论 #38599822 未加载

ShamelessCover 1 year ago

Sad state of affairs for Google.

评论 #38600341 未加载

评论 #38597341 未加载

op00toover 1 year ago

Very cool!

jakderridaover 1 year ago

fraysover 1 year ago

Thanks for sharing!