I really like the distinction between DeepSearch and DeepResearch proposed in this piece by Han Xiao: <a href="https://jina.ai/news/a-practical-guide-to-implementing-deepsearch-deepresearch/" rel="nofollow">https://jina.ai/news/a-practical-guide-to-implementing-deeps...</a><p>> DeepSearch runs through an iterative loop of searching, reading, and reasoning until it finds the optimal answer. [...]<p>> DeepResearch builds upon DeepSearch by adding a structured framework for generating long research reports<p>Given these definitions, I think DeepSearch is the more valuable and interesting pattern. It's effectively RAG built using tools in a loop, which is much more likely to answer questions effectively than more traditional RAG where there is only one attempt to find relevant documents to include in a single prompt to an LLM.<p>DeepResearch is a cosmetic enhancement that wraps the results in a "report" - it looks impressive but IMO is much more likely to lead to inaccurate or misleading results.<p>More notes here: <a href="https://simonwillison.net/2025/Mar/4/deepsearch-deepresearch/" rel="nofollow">https://simonwillison.net/2025/Mar/4/deepsearch-deepresearch...</a>
One of my co-workers joked at the time that "sure AlphaGO beat Lee Sedol at GO, but Lee has a much better self-driving algorithm."<p>I thought this was funny at the time, but I think as more time passes it does highlight the stark gulf that exists between the capability of the most advanced AI systems and what we expect as "normal competency" from the most average person.
Think this captures one of the bigger differences between what Open AI offers and what others offer using the same name. Funnily enough, Google's Gemini 2.0 Flash also has a native integration to google search[1]. They have not done it with their Thinking model. When they do we will have a good comparison.<p>One of the implications of OpenAI's DR is that frontier labs are more likely to train specific models for a bunch of tasks, resulting in the kind of quality wrappers will find hard to replicate. This is leading towards model + post training RL as a product, instead of keeping them separate from the final wrapper as product. Might be interesting times if the trajectory continues.<p>PS: There is also genspark MOA[2] which creates an indepth report on a given prompt using mixtures of agents. From what i have seen in 5-6 generations, this is very effective.<p>[1]: <a href="https://x.com/_philschmid/status/1896569401979081073" rel="nofollow">https://x.com/_philschmid/status/1896569401979081073</a> (i might be misunderstanding this, but this seems a native call instead of explicit)<p>[2]: <a href="https://www.genspark.ai/agents?type=moa_deep_research" rel="nofollow">https://www.genspark.ai/agents?type=moa_deep_research</a>
Are you telling me that AI's are starting to diverge and that we might get a combinatorial explosion of reasoning paths that will produce so many different agents that we won't know which one can actually become AGI?<p><a href="https://leehanchung.github.io/assets/img/2025-02-26/05-quadrants.png" rel="nofollow">https://leehanchung.github.io/assets/img/2025-02-26/05-quadr...</a>
DR is a nice way to gather information, when it works, and then do the real research yourself from a concentrated launching point. It helps me avoid ADD braining myself into oblivion every time I search the internet.
The fatal mistake is thinking that the LLM is now wiser for having done it. When someone does their research, they are now marginally more of an authority on that topic than everyone else in the room, all else being equal. But for LLMs, it's not like they have suddenly acquired more expertise on the subject now that they did this survey. So it's actually pretty shallow, not deep, research.
It's a cool capability, and a nifty way to concentrate information, but much deeper capabilities will be required to have models that not only truly synthesize all that information, but actively apply it to develop a thesis or further a research program.
Truthfully, I don't see how this is possible within the transformer architecture, with its lack of memory or genuine statefulness and therefore absence of persistent real time learning. But I am generally a big transformer skeptic.
> In natural language processing (NLP) terms, this is known as report generation.<p>I'm happy to see some acknowledgement of the world before LLMs. This is an old problem, and one I (or my team, really) was working on at the time of DALL-E & ChatGPT's explosion. As the article indicated, we deemed 3.5 unacceptable for Q&A almost immediately, as the failure rate was too high for operational reporting in such a demanding industry (legal). We instead employed SQuAD and polished up the output with an LLM.<p>These new reasoning models that effectively retrofit Q&A capabilities (an extractive task) onto a generative model are impressive, but I can't help but think that it's putting the cart before the horse and will inevitably give diminishing returns in performance. Time will tell, I suppose.
Its interesting it says Grok excels at report generation, because I've found myself asking it to give me answers in a table format, to make it easier to 'grok' at the output, since I'm usually asking it to give me comparisons I just can't do natively on Amazon or any other ecommerce site.<p>Funnily enough, Amazon will pick for you products to compare, but the compared items usually are terrible, and you can't just add whatever you want, or choose columns.<p>With Grok, I'll have it remove columns, add columns, shorten responses, so on and so forth.
As a user, I've found that researching the same topics in OpenAI Deep Research vs Perplexity's Deep Research results in "narrow and deep" vs "shallow and broad".<p>OpenAI tends to have something like 20 high quality sources selected and goes very deep in the specific topic, producing something like 20-50 pages of research in all areas and adjacent areas. It takes a lot to read but is quite good.<p>Perplexity tends to hit something like 60 or more sources, goes fairly shallow, answers some questions in general ways but is excellent at giving you the surface area of the problem space and thoughts about where to go deeper if needed.<p>OpenAI takes a lot longer to complete, perhaps 20x longer. This factors heavily into whether you want a surface-y answer now or a deep answer later.
I wet through this journey myself with Deep Search / Research
<a href="https://github.com/btahir/open-deep-research">https://github.com/btahir/open-deep-research</a><p>I think it really comes down to your own workflow. You sometimes want to be more imperative (select the sources yourself to generate a report) and sometimes more declarative (let a DFS/BFS algo go and split a query into subqueries and go down rabbit holes until some depth and then aggregate).<p>Been trying different ways of optimizing the former but I am fascinated by the more end to end flows systems like STORM do.
The primary issue with deep research tools are veracity and accurate source attribution. My issue with tools relying on DeepDeek R, for example, is the high hallucination rate.
It's amazing how these are the biggest information organizing platforms in the internet and yet they fail to find different words to describe their products.
This gives STORM a high mark but didn't seem to get great results from GPT Researcher which is the other open source project that was doing this before the recent flavor of the day DeepReasearch has become.<p>But there are so many ways to configure GPT Researcher for all kinds of budgets so I wonder if this comparison really pushed the output or just went with defaults and got default midranges for comparison.
Isn't this the worst possible case for an LLM? The integrity of the product is central to the value of it and the user is by definition unable to verify that integrity?
I noticed that these models very quickly start to underperform regular search like Perplexity Pro 3x. It might be somewhat thorough in how it goes line by line, but it's not very cognizant of good sources - you might ask for academic sources but if your social media slider is turned on, it will overwhelmingly favour Reddit.<p>You may repeat instructions multiple times, but it ignores them or fails to understand a solution
I think "deep research" is a misnomer, possibly deliberate. Research assumes the ability to determine quality, freshness, and veracity of the sources directly from their contents. It also quite often requires that you identify where the authors screwed up, lied, chose deliberately bad baselines, and omitted results in order to make their work "more impactful". You'd need AGI to do that. This is merely search - it will search for sources, summarize them for you, and write a report, but you then have to go in and _verify it's not bullshit_, which can take quite a bit of time and effort, if you care about the quality of the final result, which for big ticket questions you almost always do.