This actually feels like an amazing step in the right direction.<p>If AI can help spot obvious errors in published papers, it can do it as part of the review process. And if it can do it as part of the review process, authors can run it on their own work before submitting. It could massively raise the quality level of a lot of papers.<p>What's important here is that it's part of a process involving experts themselves -- the authors, the peer reviewers. They can easily dismiss false positives, but especially get warnings about statistical mistakes or other aspects of the paper that aren't their primary area of expertise but can contain gotchas.
Needs more work.<p>>> Right now, the YesNoError website contains many false positives, says Nick Brown, a researcher in scientific integrity at Linnaeus University. Among 40 papers flagged as having issues, he found 14 false positives (for example, the model stating that a figure referred to in the text did not appear in the paper, when it did). “The vast majority of the problems they’re finding appear to be writing issues,” and a lot of the detections are wrong, he says.<p>>> Brown is wary that the effort will create a flood for the scientific community to clear up, as well fuss about minor errors such as typos, many of which should be spotted during peer review (both projects largely look at papers in preprint repositories). Unless the technology drastically improves, “this is going to generate huge amounts of work for no obvious benefit”, says Brown. “It strikes me as extraordinarily naive.”
Don't forget that this is driven by present-day AI. Which means people will assume that it's checking for fraud and incorrect logic, when actually it's checking for self-consistency and consistency with training data. So it should be great for typos, misleading phrasing, and cross-checking facts and diagrams, but I would expect it to do little for manufactured data, plausible but incorrect conclusions, and garden variety bullshit (claiming X because Y, when Y only implies X because you have a reasonable-sounding argument that it ought to).<p>Not all of that is out of reach. Making the AI evaluate a paper in the context of a cluster of related papers might enable spotting some "too good to be true" things.<p>Hey, here's an idea: use AI for mapping out the influence of papers that were later retracted (whether for fraud or error, it doesn't matter). Not just via citation, but have it try to identify the no longer supported conclusions from a retracted paper, and see where they show up in downstream papers. (Cheap "downstream" is when a paper or a paper in a family of papers by the same team ever cited the upstream paper, even in preprints. More expensive downstream is doing it without citations.)
Perhaps our collective memories are too short? Did we forget what curl just went through with AI confabulated bug reports[1]?<p>[1]: <a href="https://daniel.haxx.se/blog/2024/01/02/the-i-in-llm-stands-for-intelligence/" rel="nofollow">https://daniel.haxx.se/blog/2024/01/02/the-i-in-llm-stands-f...</a>
Here are 2 examples from the Black Spatula project where we were able to detect major errors:
- <a href="https://github.com/The-Black-Spatula-Project/black-spatula-project/issues/3">https://github.com/The-Black-Spatula-Project/black-spatula-p...</a>
- <a href="https://github.com/The-Black-Spatula-Project/black-spatula-project/issues/10">https://github.com/The-Black-Spatula-Project/black-spatula-p...</a><p>Some things to note : this didn't even require a complex multi-agent pipeline. A single shot prompting was able to detect these errors.
If anyone is not aware of Retraction Watch, their implementations of "tortured phrases" was a revelation. And it has exposed some serious flaws. Like "vegetative electron microscopy". Some of the offending publications/authors have hundreds of papers.<p><a href="https://retractionwatch.com/2025/02/10/vegetative-electron-microscopy-fingerprint-paper-mill/" rel="nofollow">https://retractionwatch.com/2025/02/10/vegetative-electron-m...</a><p><a href="https://retractionwatch.com/2024/11/11/all-the-red-flags-scientific-reports-retracts-paper-sleuths-called-out-in-open-letter/" rel="nofollow">https://retractionwatch.com/2024/11/11/all-the-red-flags-sci...</a>
I'm extremely skeptical for the value in this. I've already seen wasted hours responding to baseless claims that are lent credence by AI "reviews" of open source codebases. The claims would have happened before but these text generators know how to hallucinate in the correct verbiage to convince lay people and amateurs and are more annoying to deal with.
It’s a nice idea, and I would love to be able to use it for my own company reports (spotting my obvious errors before sending them to my bosses boss)<p>But the first thing I noticed was the two approaches highlighted - one a small scale approach that does not publish first but approaches the authors privately - and the other publishes first, does not have human review and has <i>its own cryptocurrency</i><p>I don’t think anything quite speaks more about the current state of the world and the choices in our political space
I am using Jetbrain's AI to do code analysis (find errors).<p>While it sometimes spot something I missed it also gives a lot of confident 'advise' that is just wrong or not useful.<p>Current AI tools are still sophisticated search engines. They cannot reason or think.<p>So while I think it could spot some errors in research papers I am still very sceptical that it is useful as trusted source.
The role of LLMs in research is an ongoing, well, research topic of interest of mine. I think it's fine so long as a 1. a pair of human eyes has validated any of the generated outputs and 2. The "ownership rule": the human researcher is prepared to defend and own anything the AI model does on their behalf, implying that they have digested and understood it as well as anything else they may have read or produced in the course of conducting their research.
Rule #2 avoids this notion of crypto-plagiarism. If you prompted for a certain output, your thought in a manner of speaking was the cause of that output. If you agree with it, you should be able to use it.
In this case, using AI to fact check is kind of ironic, considering their hallucination issues. However infallibility is the mark of omniscience; it's pretty unreasonable to expect these models to be flawless. They can still play a supplementary role to the review process, a second line of defense for peer-reviewers.
The push for AI is about controlling the narrative. By giving AI the editorial review process, it can control the direction of science, media and policy. Effectively controlling the course of human evolution.<p>On the other hand, I'm fully supportive of going through ALL of the rejected scientific papers to look for editorial bias, censorship, propaganda, etc.
Great start but definitely will require supervision by experts in the fields. I routinely use Claude 3.7 to flag errors in my submissions. Here is a prompt I used yesterday:<p>“This is a paper we are planning to submit to Nature Neuroscience. Please generate a numbered list of significant errors with text tags I can use to find the errors and make corrections.”<p>It gave me a list of 12 errors of which Claude labeled three as “inconsistencies”, “methods discrepancies”. and “contradictions”. When I requested that Claude reconsider it said “You are right, I apologize” in each of these three instances.
Nonetheless it was still a big win for me and caught a lot of my dummheits.<p>Claude 3.7 running in standard mode does not use its context window very effectively. I suppose I could have demanded that Claude “internally review (wait: think again)” for each serious error it initially thought it had encountered. I’ll try that next time. Exposure of chain of thought would help.
This sounds way, way out of how LLMs work. They can't count the R's in strarwberrrrrry, but they can cross reference multiple tables of data? Is there something else going on here?
Reality check: yesnoerror, the only part of the article that actually seems to involve any published AI reviewer comments, is just checking arxiv papers. Their website claims that they "uncover errors, inconsistencies, and flawed methods that human reviewers missed." but arxiv is of course famously NOT a peer-reviewed journal. At best they are finding "errors, inconsistencies, and flawed methods" in papers that human reviewers haven't looked at.<p>Let's then try and see if we can uncover any "errors, inconsistencies, and flawed methods" on their website. The "status" is pure madeup garbage. There's no network traffic related to it that would actually allow it to show a real status. The "RECENT ERROR DETECTIONS" lists a single paper from today, but looking at the queue when you click "submit a paper" lists the last completed paper as the 21st of February. The front page tells us that it found some math issue in a paper titled "Waste tea as absorbent for removal of heavy metal present in contaminated water" but if we navigate to that paper[1] the math error suddenly disappears. Most of the comments are also worthless, talking about minor typographical issues or misspellings that do not matter, but of course they still categorize that as an "error".<p>It's the same garbage as every time with crypto people.<p>[1]: <a href="https://yesnoerror.com/doc/82cd4ea5-4e33-48e1-b517-5ea3e2c5f268" rel="nofollow">https://yesnoerror.com/doc/82cd4ea5-4e33-48e1-b517-5ea3e2c5f...</a>
As a researcher I say it is a good thing. Provided it gives a small number of errors that are easy to check, it is a no-brainer. I would say it is more valuable for authors though to spot obvious issues.
I don't think it will drastically change the research, but is an improvement over a spell check or running grammarly.
I know academics that use it to make sure their arguments are grounded, after a meaningful draft. This helps them in more clearly laying out their arguments, and IMO is no worse than the companies that used motivated graduate students for review the grammar and coherency of papers written by non-native language speakers.
AI tools are hopefully going to eat lots of manual scientific research. This article looks at error spotting, but you follow the path of getting better and better at error spotting to it's conclusion and you essentially reproduce the work entirely from scratch. So in fact AI study generation is really where this is going.<p>All my work could honestly be done instantaneously with better data harmonization & collection along with better engineering practices. Instead, it requires a lot of manual effort. I remember my professors talking about how they used to calculate linear regressions by hand back in the old days. Hopefully a lot of the data cleaning and study setup that is done now sounds similar to a set of future scientists who use AI tools to operate and check these basic programatic and statistical tasks.
I expect that for truly innovative research, it might flag the innovative parts of the paper as a mistake if they're not fully elaborated upon... E.g. if the author assumed that the reader possesses certain niche knowledge.<p>With software design, I find many mistakes in AI where it says things that are incorrect because it parrots common blanket statements and ideologies without actually checking if the statement applies in this case by looking at it from first principles... Once you take the discussion down to first principles, it quickly acknowledges its mistake but you had to have this deep insight in order to take it there... Some person who is trying to learn from AI would not get this insight from AI; instead they would be taught a dumbed-down, cartoonish, wordcel version of reality.
While I don't doubt that AI tools can spot some errors that would be tedious for humans to look for, they are also responsible for far more errors. That's why proper understanding and application of AI is important.
Recently I used one of the reasoning models to analyze 1,000 functions in a very well-known open source codebase. It flagged 44 problems, which I manually triaged. Of the 44 problems, about half seemed potentially reasonable. I investigated several of these seriously and found one that seemed to have merit and a simple fix. This was, in turn, accepted as a bugfix and committed to all supported releases of $TOOL.<p>All in all, I probably put in 10 hours of work, I found a bug that was about 10 years old, and the open-source community had to deal with only the final, useful report.
This could easily turn into a witch hunt [0], especially given how problematic certain fields have been, but I can't shake the feeling that it is still an interesting application and like the top comment said a step in the right direction.<p>[0] - Imagine a public ranking system for institutions or specific individuals who have been flagged by a system like this, no verification or human in the loop, just a "shit list"
In the not so far future we should have AIs that have read all the papers and other info in a field. They can then review any new paper as well as answering any questions in the field.<p>This then becomes the first sanity check for any paper author.<p>This should save a lot of time and effort, improve the quality of papers, and root out at least some fraud.<p>Don't worry, many problems will remain :)
I'm no member of the scientific community but I fear this project or another will go beyond math errors and eventually establish some kind of incontrovertible AI entity giving a go/nogo on papers. Ending all science in the process because publishers will love it.
This is both exciting and a little terrifying. The idea of AI acting as a "first-pass filter" for spotting errors in research papers seems like an obvious win. But the risk of false positives and potential reputational damage is real...
This is going to be amazing for validation and debugging one day, imagine having the fix PR get opened by the system for you with code to review including unit test to reproduce/fix the bug that caused the prod exception @.@
Wait, I threw out my black plastic spatula for nothing? So there was all this media noise about it - but not a single article came across my (many) newsfeeds about a mistake or a retraction.
AI tools revolutionizing research by spotting errors is a game-changer for academia. With the ability to detect inconsistencies, plagiarism, and even fabricated data, AI is enhancing the credibility of scientific studies. This is especially crucial in a time when the volume of published research is growing exponentially, making manual fact-checking nearly impossible.<p>However, despite the benefits, AI-driven tools are facing increasing restrictions. Many schools and universities, as well as some governments, have started blocking access to AI services, fearing academic dishonesty or misuse. If you still need access to these tools for legitimate research purposes, proxy services like NodeMaven can help bypass these restrictions, ensuring you stay connected to the latest advancements. I can drop you a link, it helped me a lot while writing my thesis.: <a href="https://nodemaven.com/" rel="nofollow">https://nodemaven.com/</a>
Perhaps this is a naive question from a non-academic, but why isn't deliberately falsifying data or using AI tools or photoshop to create images career-ending?<p>Wouldn't a more direct system be one in which journals refused submissions if one of the authors had committed deliberate fraud in a previous paper?
The low hanging fruit is to target papers cited in corporate media; NYT, WSJ, WPO, BBC, FT, The Economist, etc. Those papers are planted by politically motivated interlocutors and timed to affect political events like elections or appointments.<p>Especially those papers cited or promoted by well-known propagandists like Freedman of NYT, Eric Schmidt of Google or anyone on the take of George Soros' grants.
top two links at this moment are:<p>> AI tools are spotting errors in research papers: inside a growing movement (nature.com)<p>and<p>> Kill your Feeds – Stop letting algorithms dictate what you think (usher.dev)<p>so we shouldn't let the feed algorithms influence our thoughs, but also, AI tools need to tell us when we're wrong
I built this AI tool to spot "bugs" in legal agreements, which is harder than spotting errors in research papers because the law is open textured and self contradicting in many places. But no one seems to care about it on HN, Gladly, our early trial customers are really blown away by it.<p>Video demo with human wife narrating it: <a href="https://www.youtube.com/watch?v=346pDfOYx0I" rel="nofollow">https://www.youtube.com/watch?v=346pDfOYx0I</a><p>Cloudflare-fronted Live site (hopefully that means it can withstand traffi): <a href="https://labs.sunami.ai/feed" rel="nofollow">https://labs.sunami.ai/feed</a><p>Free Account Prezi Pitch: <a href="https://prezi.com/view/g2CZCqnn56NAKKbyO3P5/" rel="nofollow">https://prezi.com/view/g2CZCqnn56NAKKbyO3P5/</a>