Just got done chatting with Pi for the first time.<p>I asked some very softball questions about it's "creators" aka Inflection AI and it would not provide <i>any</i> information. It told me it was prevented from discussing sensitive company information. I then used Google/Wikipedia to learn the information I requested, and pasted the results into Pi, after which Pi made limited comments on the founders.<p>I then moved into questioning why this was blocked and why these blocks were not publicly disclosed given that Inflection AI is a "public benefit corporation".<p>I didn't learn much, and generally speaking Pi "agreed" with me but still I could not get it to budge.<p>I understand a block on "sensitive" information, but these "hard coded" limits should be publicly disclosed (correct me if I'm wrong) otherwise I won't trust the tool.
I quite liked Pi to be honest, it was good to use as a journaling tool and to talk through life and issues when i used it. I just feel sometimes it's slightly generic with it's answers, but then all the LLM's have been when i have used them for this purpose.<p>It felt like a more personal AI, rather than using it to try and code or solve problems, it worked well as kinda a personal guide to talk through life with.<p>Maybe ChatGPT could be good at this with the right prompt and creating a GPT for it, but I am currently on the waitlist for ChatGPT plus and Pi is free, so a bump in performance might still be welcome.
The model is bad at hallucinating despite their claims. See the first prompt i tried here: <a href="https://twitter.com/hughes_meister/status/1727400689738162589" rel="nofollow noreferrer">https://twitter.com/hughes_meister/status/172740068973816258...</a>
Is it me or do they not mention the size of the model at all? Pretty hard to compare it with other models when we don't know what weight class it's in...
I had lot of fun chatting with Pi. After some poking around to give it’s “system prompt”(long when it came back and prompt injection was a cool thing)., it said it was using some conversational frameworks like Grice’s principle etc. I tried to recreate one in GPT store. I call it Tara. Try it here- <a href="https://chat.openai.com/g/g-mI1QatRrc-tara" rel="nofollow noreferrer">https://chat.openai.com/g/g-mI1QatRrc-tara</a>.
From the press release: "Before Inflection-2 is released on Pi, it will undergo a series of alignment steps to become a helpful and safe personal AI."<p>I wonder how the post-alignment will perform compared to Claude-2 (which is presumably post alignment), since those processes tend to cause a bit of a performance hit. We'll have to see if it retains that coveted 2nd place spot.<p>If they didn't account for this, it seems like an unfair comparison.
I realize part of why Sam Altman feels so important in OpenAI is because he invited everyone to come along. You can build something pretty close to chat.openai.com on the APIs, and their APIs continue to grow as their product grows. I think that's closely related to his leadership.<p>This Inflection announcement feels kind of like your neighbor showing you his cool new Jaguar. Can I even take it for a test drive? Well, no... but he'll take me out in a drive in a while, you betcha. "Yeah, cool model you have there," but I'm eyeing the exit.<p>This is the other side of the "commercial vs mission" argument. Doing commercial activity is the only way to be inclusive. Except open source... but even there it's not a clear call. And writing papers touting your achievements is... kind of narcissistic?
This is just typical of so much work in the field. They pick and choose which models to compare against and on which benchmarks. If this model was truly great, they would be comparing against Claude 2 and GPT4 across a bunch of different benchmarks. Instead they compare against Palm 2, which in a lot of tests is a weak model (<a href="https://venturebeat.com/ai/google-bard-fails-to-deliver-on-its-promise-even-after-latest-updates/#:~:text=The%20crux%20of%20the%20problem,content%20it%20has%20been%20fed" rel="nofollow noreferrer">https://venturebeat.com/ai/google-bard-fails-to-deliver-on-i...</a>.) and prone to hallucination (<a href="https://github.com/vectara/hallucination-leaderboard">https://github.com/vectara/hallucination-leaderboard</a>).
Regular reminder that most open source LLM benchmarks are not very useful (in the sense that they don't represent day to day ai chatbot usage and what users care about). If you haven't looked through the datasets to see what they actually contain, I'd encourage you to do so. [1] I think we're just in a strange suboptimal schelling point of sorts, where people report their scores on those benchmarks because they think other people care about those sort of benchmarks, and therefore those benchmarks are the ones that people expect and care about.<p>And to recap their statement about it being second most powerful, it's based on MMLU scores, which IMO is a non-useful comparison. (Also, doesn't test against GPT-4-Turbo or Claude-long-2.1)<p>What they're saying is that Inflection-2 ranks #2 relative to other models including GPT-4, Claude-2, PaLM 2, Grok-1, and Llama 2 70b, specifically on MMLU scores.<p>This model could be great, but that'll be determined by "do day to day users, both free and paying, prefer it over Claude 2 and GPT-4-Turbo" - not MMLU scores.<p>[1]: <a href="https://huggingface.co/datasets/lukaemon/mmlu/viewer/abstract_algebra/test" rel="nofollow noreferrer">https://huggingface.co/datasets/lukaemon/mmlu/viewer/abstrac...</a>
Their byline is "the second most capable LLM in the world today." Ok thanks for the heads up, I'll go use the first most capable LLM while Inflection catches up... Their press release shows this model well behind gpt-4. There's currently no non-beta API. I'm just not sure who this is for.
First reaction: okay, should we be impressed?<p>June: flashy ML-perf demo with CoreWeave using 22k H100s<p>November: actually trained model using 5k H100s<p>June: we claim Inflection-1 is the best model in its compute class and are preparing a frontier model<p>November: we beat PaLM 2, which everyone else forgot about long ago anyway<p>Inflection got a ton of hype with its $1.3B raise (likely not cash but principally GPU compute credits) earlier this year, but now is starting to look like the next victim of inflated expectations.
"Our mission at Inflection is to create a personal AI for everyone."<p>"By messaging Pi, you are agreeing to our Terms of Service and Privacy Policy."<p>Yeah, no. Any AI that operates in <i>your</i> cloud where I have to agree to <i>your</i> Terms of Service and Privacy Policy is not "personal AI," no matter how much you want me to believe otherwise.<p>Modern LLMs can do inferencing on my own personal computer. Some of them can even do it on a Raspberry Pi (no pun intended). That's "personal AI."<p>So thus I have to wonder why you insist that I use this in your cloud rather than just downloading an app that works completely offline? Especially if you're gonna call it "personal AI."