Show HN: I Remade the Fake Google Gemini Demo, Except Using GPT-4 and It's Real

434 点作者 gregsadetsky超过 1 年前

20 条评论

phire超过 1 年前

The "magic" of the fake Gemini demo was the way it seemed like the LLM was continually receiving audio + video input and knew when to jump in with a response.It appeared to be able to wait until the user had finished the drawing, or even jumping in slightly before the drawing finished. At one point the LLM was halfway through a response and then saw the user was now colouring the duck in blue, and started talking about how the duck appearing to be blue. The LLM also appeared to know when a response wasn't needed because the user was just agreeing with the LLM.I'm not sure how many people noticed that on a conscious level, but I positive everyone noticed it subconsciously, and felt the interaction was much more natural, and much more advanced than current LLMs.-----------------Checking the source code, the demo takes screenshots of the video feed every 800ms, waits until the user finishes taking and then sends the last three screenshots.While this demo is impressive, it kind of proves just how unnatural it feels to interact with an LLM in this manner when it doesn't have continuous audio-video input. It's been technically possible to do kind of thing for a while, but there is a good reason why nobody tried to present it as a product.

评论 #38597635 未加载

评论 #38598957 未加载

评论 #38597993 未加载

评论 #38597429 未加载

评论 #38599413 未加载

评论 #38597555 未加载

评论 #38599645 未加载

godelski超过 1 年前

I don't get why companies lie like this. How much do they have to gain? It seems like they actually have a lot to lose.What's crazy to me is that these tools are wildly impressive without the hype. As a ML researcher, there's a lot of cool things we've done but at the same time almost everything I see is vastly over hyped from papers to products. I think there's a kinda race to the bottom we've created and it's not helpful to any of us except maybe in the short term. Playing short term games isn't very smart, especially for companies like Google. Or maybe I completely misunderstand the environment we live in.But then again, with the discussions in this thread[0] maybe there's a lot of people so ethically bankrupt that they don't even know how what they're doing is deceptive. Which is an entirely different and worse problem.[0] <a href="https://news.ycombinator.com/item?id=38559582">https://news.ycombinator.com/item?id=38559582</a>

评论 #38598063 未加载

评论 #38597731 未加载

评论 #38597859 未加载

评论 #38597544 未加载

评论 #38602312 未加载

评论 #38597645 未加载

评论 #38598124 未加载

评论 #38599043 未加载

评论 #38601125 未加载

评论 #38597746 未加载

sheepscreek超过 1 年前

Thank you for creating this demo. This was the point I was trying to make when the Gemini launch happened. All that hoopla for no reason.Yes - GPT-4V is a beast. I’d even encourage anyone who cares about vision or multi-modality to give LLaVA a serious shot (<a href="https://github.com/haotian-liu/LLaVA">https://github.com/haotian-liu/LLaVA</a>). I have been playing with the 7B q5_k variant last couple of days and I am seriously impressed with it. Impressed enough to build a demo app/proof-of-concept for my employer (will have to check the license first or I might only use it for the internal demo to drive a point).

评论 #38597361 未加载

评论 #38600253 未加载

swyx超过 1 年前

haha yes it was entirely possible with gpt4v. literally just screenshot and feed in the images and text in chat format, aka “interleaved”. made something similar at a hackathon recently. (<a href="https://x.com/swyx/status/1722662234680340823" rel="nofollow noreferrer">https://x.com/swyx/status/1722662234680340823</a>). the bizarre thing is that google couldve done what you did, and we wouldve all been appropriately impressed, but instead google chose to make a misleading marketing video for the general public and leave the rest of us frustrated nerds to do the nasty work of having to explain why the technology isnt as seen on tv yet; making it seem somehow our faulti am curious about the running costs of something like this

评论 #38597801 未加载

dingclancy超过 1 年前

I am now convinced that Google Deepmind really had nothing in terms of SOTA LLMs. They were just bluffing.I remember when chatgpt was released Google was saying that they had much much better models that they are not releasing because they for AI Safety. Then theu released palm and palm 2 saying that it is time to release these models to beat ChatGPT. It was not a good model.The they hyped up Gemini, and if Gemini Ultra is the best they have, I am not convinced that they have a better model. So this is it.So in one year, we went from Google has to have the best model, they just do not want to release to they have the infrastructure and data and the talent to make the best model. Why they really had was nothing.

iamleppert超过 1 年前

I’ve recently been trying to actually use Google’s AI conversational translation app that was released awhile back and has many updates and iterations since.It’s completely unusable for real conversation. I’m actually in a situation where I could benefit from it and was excited to use it because I remember watching the demo and how natural it looked but was never able to actually try it myself.Now having used it, I went back and watched their original demo and I’m 100% convinced all or part of it was faked. There is just no way this thing ever worked. If they can’t manage to make conversational live translation work (which is a lot more useful than drawing a picture of a duck) I have high doubts about this new AI.Seems like the exact same situation to me. It’s insane to me how much nerve it must take to completely fake something like this.

评论 #38599769 未加载

adtac超过 1 年前

[tangential to this really cool demo] JPEG images being the only possible interface to GPT-4 feels wasteful. the human eye works the delta between "frames", not the image itself. I wonder if the next big step that would allow real-time video processing at high resolutions is to have the model's internal state operate on keyframes and deltas similar to how video codecs like MPEG work.

评论 #38597618 未加载

sibeliuss超过 1 年前

Lol at choosing the name Sagittarius, which is exactly across from Gemini in the Zodiac

评论 #38597634 未加载

zainhoda超过 1 年前

Wow, this is super cool! From the code it seems like the speech to text and text to speech are using the browser’s built-in features. I always forget those capabilities even exist!

razodactyl超过 1 年前

The latency is excusable as this is through the API. Inference on local infrastructure is almost instant so this demo would smoke everything else if this dude had access.

dvaun超过 1 年前

Great demo, I laughed at the final GPT response too.Honestly: it would be fun to self-host some code hooked up to a mic and speakers to let kids, or whoever, play around with GPT4. I’m thinking of doing this on my own under an agency[0] I’m starting up on the side. Seems like a no-brainer as an application.[0]: <a href="https://www.divinatetech.com" rel="nofollow noreferrer">https://www.divinatetech.com</a>

n8fr8too超过 1 年前

I had been working on an idea for an interface "Sorting Hat" system to help kids at schools know whether something was for trash, compost, or recycling. While I had been hacking on it for a bit, Greg's "demo" was much better integrated than what I could do, so thanks Greg!I did add ElevenLabs support to make it a little more snazzy sounding...So, here it is the "Compose/Trash/Recycle Sorting Hat, Built on Sagittarious" <a href="https://github.com/n8fr8/CompostSortingHatAI">https://github.com/n8fr8/CompostSortingHatAI</a>You can see a realtime, unedited YouTube demo video of my kid testing it out here: <a href="https://www.youtube.com/watch?v=-9Ya5rLj64Q" rel="nofollow noreferrer">https://www.youtube.com/watch?v=-9Ya5rLj64Q</a>

dingclancy超过 1 年前

I am now convinced that Google DeepMind really had nothing in terms of state-of-the-art language models (SOTA LLMs). They were just bluffing. I remember when ChatGPT was released; Google was saying that they had much better models they were not releasing due to AI safety. Then they released Palm and Palm 2, saying it's time to beat ChatGPT with these models. However, it was not a good model.They then hyped up Gemini, and if Gemini Ultra is the best they have, I am not convinced that they have a better model.Sundar's code red was genuinely alarming because they had to dig deep to make this Gemini model work, and they still ended up with a fake video. Even if Gemini was legitimate, it did not beat GPT-4 by leaps and bounds, and now GPT-5 is on the horizon, putting them a year behind. It makes me question if they had a secret powerful model all along

cylinder714超过 1 年前

Snader's Law: "Any sufficiently advanced technology is indistinguishable from a rigged demo."

iandanforth超过 1 年前

Looks like, again, this doesn't have GPT-4 processing video as much as a stack of video frames, concatenated and sent as a single image. But much closer to real!

评论 #38597529 未加载

评论 #38597344 未加载

评论 #38597479 未加载

评论 #38597450 未加载

评论 #38597329 未加载

razodactyl超过 1 年前

The part that really confuses me is the lack of a "*some sequences simulated" disclaimer.

评论 #38599822 未加载

ShamelessC超过 1 年前

Sad state of affairs for Google.

评论 #38600341 未加载

评论 #38597341 未加载

op00to超过 1 年前

Very cool!

jakderrida超过 1 年前

Lmao! So, presumably, they could have hired Greg to improvise almost the exact same demonstration, but with evidence it works. I don't know how much Greg costs, but I'll bet my ass it's less than the cost in investor sentiment after getting caught committing fraud. Not saying you're cheap. Just cheaper.

frays超过 1 年前

Thanks for sharing!

20 条评论

phire超过 1 年前

评论 #38597635 未加载

评论 #38598957 未加载

评论 #38597993 未加载

评论 #38597429 未加载

评论 #38599413 未加载

评论 #38597555 未加载

评论 #38599645 未加载

godelski超过 1 年前

评论 #38598063 未加载

评论 #38597731 未加载

评论 #38597859 未加载

评论 #38597544 未加载

评论 #38602312 未加载

评论 #38597645 未加载

评论 #38598124 未加载

评论 #38599043 未加载

评论 #38601125 未加载

评论 #38597746 未加载

sheepscreek超过 1 年前

评论 #38597361 未加载

评论 #38600253 未加载

swyx超过 1 年前

评论 #38597801 未加载

dingclancy超过 1 年前

iamleppert超过 1 年前

评论 #38599769 未加载

adtac超过 1 年前

评论 #38597618 未加载

sibeliuss超过 1 年前

Lol at choosing the name Sagittarius, which is exactly across from Gemini in the Zodiac

评论 #38597634 未加载

zainhoda超过 1 年前

Wow, this is super cool! From the code it seems like the speech to text and text to speech are using the browser’s built-in features. I always forget those capabilities even exist!

razodactyl超过 1 年前

The latency is excusable as this is through the API. Inference on local infrastructure is almost instant so this demo would smoke everything else if this dude had access.

dvaun超过 1 年前

n8fr8too超过 1 年前

dingclancy超过 1 年前

cylinder714超过 1 年前

Snader's Law: "Any sufficiently advanced technology is indistinguishable from a rigged demo."

iandanforth超过 1 年前

Looks like, again, this doesn't have GPT-4 processing video as much as a stack of video frames, concatenated and sent as a single image. But much closer to real!

评论 #38597529 未加载

评论 #38597344 未加载

评论 #38597479 未加载

评论 #38597450 未加载

评论 #38597329 未加载

razodactyl超过 1 年前

The part that really confuses me is the lack of a "*some sequences simulated" disclaimer.

评论 #38599822 未加载

ShamelessC超过 1 年前

Sad state of affairs for Google.

评论 #38600341 未加载

评论 #38597341 未加载

op00to超过 1 年前

Very cool!

jakderrida超过 1 年前

frays超过 1 年前

Thanks for sharing!