Oof, it started strong and then went with the "who created you" which shows no understanding of how LLMs work. (They don't know a thing about themselves, they will either regurgitate their prompt or hallucinate something, usually that they are ChatGPT since that is the most likely LLM to appear in training data.)
To some extent the "mystery" (and temporary free-as-in-beer-ness) of this model might be getting to me, but I think it's pretty interesting. Given the token throughput (250B this week) it's obvious there's a pretty major player behind the model, but why is it stealthed? Maybe there's something about the architecture or training that would put people off if it was public right off the bat? Maybe they're purely collecting usage/acceptance data and want unbiased users?<p>On the Aider Polyglot leaderboard it's ~middle of the leading pack, comparable to DeepSeek V3 and 3.5 Sonnet. I ran NoLi(teral)Ma(tching), an unsaturated long-context benchmark, on it and was impressed though:<p><pre><code> = Model =========== Base Score = 8K Context = 16K Context =
Quasar Alpha: >=97.8% 89.2% 85.1%
GPT-4o: 99.3% 89.2% 81.6%
Llama 3.3 70B: 97.3% 72.1% 59.5%
Gemini 1.5 Pro: 92.6% 63.9% 55.5%
Claude 3.5 Sonnet: 87.6% 61.7% 45.7%
Gemini 1.5 Flash: 84.7% 44.4% 35.5%
GPT-4o mini: 84.9% 32.6% 20.6%
Llama 3.1 8B: 76.7% 31.9% 22.6%
</code></pre>
It also performs well - slightly better than GPT-o1 - on the "hard" subset at 16K context with 62.8%. Latency is quite good as well.<p>More details: <a href="https://old.reddit.com/r/LocalLLaMA/comments/1ju1czn/quasar_alpha_on_nolima_16k_effective_context_best/" rel="nofollow">https://old.reddit.com/r/LocalLLaMA/comments/1ju1czn/quasar_...</a>
The most fun way I’ve seen users explore its origin is to give it a single period (”.”) as your first query. Only OpenAI answers in this way with a smiley at the end, and it’s probably a more certain way to check it than asking about its arch, because many will incorrectly answer OpenAI and GPT-4 due to (?) tainted training data as ChatGPT has been so much in the news and became a de facto LLM early.
The graph in the article about Quasar Alpha’s coding skill is taken from my tweet [0]. It shows QA’s results on the aider polyglot coding benchmark [1].<p>QA seems to be a skilled coder, and is very fast.<p>Aider supports Quasar Alpha as of v0.81, released last week.<p>[0] <a href="https://x.com/paulgauthier/status/1907996176605220995" rel="nofollow">https://x.com/paulgauthier/status/1907996176605220995</a><p>[1] <a href="https://aider.chat/docs/leaderboards/" rel="nofollow">https://aider.chat/docs/leaderboards/</a>
I ran an interesting benchmark/experiment yesterday, which did not do Quasar Alpha any favors (from best to worst, score is an average of four runs):<p><pre><code> "google/gemini-2.5-pro-preview-03-25" => 67.65
"anthropic/claude-3.7-sonnet:thinking" => 66.76
"anthropic/claude-3.7-sonnet" => 66.23
"deepseek/deepseek-r1:free" => 54.38
"google/gemini-2.0-flash-001" => 52.03
"openai/o3-mini" => 47.82
"qwen/qwen2.5-32b-instruct" => 44.78
"meta-llama/llama-4-maverick:free" => 42.87
"openrouter/quasar-alpha" => 40.27
"openai/chatgpt-4o-latest" => 37.94
"meta-llama/llama-3.3-70b-instruct:free" => 34.40
</code></pre>
The benchmark is a bit specific, but challenging. It's a prompt optimization task where the model iteratively writes a prompt, the prompt gets evaluated and scored from 0 to 100, and then the model can try again given the feedback. The whole process occurs in one conversation with the model, so it sees its previous attempts and their scores. In other words, it has to do Reinforcement Learning on the fly.<p>Quasar did barely better than 4o. I was also surprised to see the thinking variant of Sonnet not provide any benefit. Both Gemini and ChatGPT benefit from their thinking modes. Normal Sonnet 3.7 does do a lot of thinking in its responses by default though, even without explicit prompting, which seems to help it a lot.<p>Quasar was also very unreliable and frequently did not follow instructions. I had the whole process automated, and the automation would retry a request if the response was incorrect. Quasar took on average 4 retries of the first round before it caught on to what it was supposed to be doing. None of the other models had that difficulty and almost all other retries were the result of a model re-using an existing prompt.<p>Based on looking at the logs, I'd say only o3-mini and the models above it were genuinely optimizing. By that I mean they continued to try new things, tweak the prompts in subtle ways to see what it does, and consistently introspect on patterns it's observing. That enabled all of those models to continuously find better and better prompts. In a separate manual run I let Gemini 2.5 Pro go for longer and it was eventually able to get a prompt to a score of 100.<p>EDIT: But yes, to the article's point, Quasar was the fastest of all the models, hands down. That does have value on its own.
More evidence: it uses some fancy Unicode characters for punctuation like apostrophe, etc.<p>It's very annoying and I've only seen this in OpenAI models before (o3-mini).
I also believe the same! I asked a very specific question, on ChatGPT and Quasar, and well, ChatGPT offered me two models to choose, and one of them had a VERY similar answer as Quasar...<p><a href="https://x.com/vitor_dlucca/status/1908769236744384981" rel="nofollow">https://x.com/vitor_dlucca/status/1908769236744384981</a>
Here's a longer blog post I wrote on the same topic, with new updates daily:<p><a href="https://prompt.16x.engineer/blog/quasar-alpha-openai-stealth-model" rel="nofollow">https://prompt.16x.engineer/blog/quasar-alpha-openai-stealth...</a>