The authors of the study are all from the chemistry department at the University of Kansas. Is this really the sort of paper they should be authoring? <a href="https://www.cell.com/cell-reports-physical-science/fulltext/S2666-3864(23)00200-X#%20" rel="nofollow">https://www.cell.com/cell-reports-physical-science/fulltext/...</a><p>The methodology is terrible. The prompting was as simple as : "Can you produce a 300- to 400-word summary on this topic: INSERT TOPIC HERE" where some example topics are:<p>A surprising fossil vertebrate<p>Stem cells remember insults<p>I can't see how that prompt is going to come up with anything comparable to the human text which is based on perspectives articles in Science.<p>And they don't report these numbers but I can from the tables.<p>Document false positive rate (human assigned as AI) 0%<p>Document false negative rate (AI assigned as human) 0%<p>Paragraph false positive rate (human assigned as AI) 14%<p>Paragraph false negative rate (AI assigned as human) 3%<p>In summary though this is a garbage tier study for entertainment only
> <i>Scientists are more likely to have a richer vocabulary and write longer paragraphs containing more diverse words than machines. They also use punctuation like question marks, brackets, semicolons more frequently than ChatGPT, except for speech marks used for quotations.</i><p>> <i>Real science papers also use more equivocal language – like "however", "but", "although" as well as "this" and "because".</i><p>You can prompt ChatGPT to write in that style. For example, I have a semi-standard prompt I often use:<p>“Don’t just give a list of information; write in engaging and well-written paragraph form with transitions. (It’s ok to use some bullet points or lists, but the entire piece should not be a list.) Don’t use emojis and don’t overuse any of the following words (or their conjugates or other tenses): ensure, enable, seamless, solution, crucial, critical, vital, invaluable, essential, welcome, game-changer, unleash, or streamline. You can use each once or twice at most. Vary sentence length to improve readability.”
I have a lot of doubts about this research both methodologically and in principle after skimming the paper.<p>1. All this is only on GPT-3.5.<p>2. "Accuracy" is an ambiguous term here. What we really want is sensitivity(how likely the test is to identify true positives) and specificity(same for true negatives). Maybe this is buried in the text somewhere but I couldn't find it.<p>3. They only did a very narrow niche, namely academic papers. Doing machine learning on a highly selective dataset is definitely easier, and the results may not generalise well.<p>4. I have a strong feeling human language will evolve towards being harder and harder to distinguish from GPT. That means these results could be highly sensitive to the time period from which the non-GPT writing was taken.
The title was originally 100% but the editor felt that to be too unbelievable and ratcheted it back to just greater than 99% , that way it covers 100, but looks fancier, there's a math equation in the title, it's gotta be legit!
There is no chance whatsoever that any tool will ever be able to reliably tell the difference between LLM and human content, and I can’t understand how anyone thinks such a thing is possible.<p>There's no mechanism of action for such a thing. The information would have to be encoded in the text and it isn't.<p>Honestly the best way to think about it is to invoke the infinite number of monkeys scenario since believing this <i>requires</i> you to disprove the infinite monkeys theorem.<p>Consider this thought experiment.<p>1) We will start with a piece of text that your detector is 100% certain was created by a GPT tool.<p>2) Now, prove that there is no way whatsoever for at least one human being to independently create this text.<p>If you can’t <i>prove</i> that, then your tool is bullshit.<p>cite: <a href="https://en.wikipedia.org/wiki/Infinite_monkey_theorem" rel="nofollow">https://en.wikipedia.org/wiki/Infinite_monkey_theorem</a>
AI to detect AI. I’m skeptical that AI will lead to the end of the world, but who knows. At this rate, the AIs will go to war on each other and we’ll just be casualties of the crossfire.
I honestly hope they don’t crack this problem, since it’s created a lovely existential crisis for term papers. It’s forcing long-overdue innovation in how we assess knowledge.
The trouble with percentages like this is that the last <1% are difficult to achieve but provide the most value. If you're a teacher looking to prevent cheating by students, you can't take the chance of falsely accusing one student out of 100.
With all these claims of ChatGPT content detection, it assumes that the person using ChatGPT is just taking the raw output and not doing any editing or verification. We are treading onto Ship of Theseus grounds here, but how much has to change before it's not important?<p>For example:<p>> "One of the biggest problems is that it assembles text from many sources and there isn't any kind of accuracy check – it's kind of like the game Two Truths and a Lie."<p>That's not a ChatGPT problem. It's an accuracy problem. If the output is edited to fix the accuracy, then it's no longer an issue.<p>Yes, blindly using output can be an issue, the issue is accuracy, not the method used to generate it. If ChatGTP output was 100% accurate, what would be the next problem?
Surely what matters more than detecting generated content is verifying that the information is true and consistent and limiting the quantity and length of submissions per verified user of a system.<p>That takes care of the quality of the submission. The other part is determining attribution. Why not just ask the human submitter to defend their work in a controlled environment? If they can, it's their paper now.<p>I don't see a difference between unassisted people turning in crap vs assisted people still turning in crap.
Look at the features used in their classifier:<p><a href="https://www.cell.com/action/showFullTableHTML?isHtml=true&tableId=tbl1&pii=S2666-3864%2823%2900200-X" rel="nofollow">https://www.cell.com/action/showFullTableHTML?isHtml=true&ta...</a><p>20 features. Most of them checking whether a single character or word appears. It seems to be pretty dumb tbh.<p>I'm also curious how they could possibly justify a single feature of
> contains “others” or “researchers”
What if I take chatgpt output and partially rewrite it in my own words, or add or mix in some original content? What result does this identification tool return?
The chance that LLMs won't be able to produce text in a certain style is extremely low.<p>What is true, is that by default ChatGPT doesn't produce the same style as academic papers, which seems unsurprising. You could also compare it to a 15 year old writing text messages and conclude that ChatGPT content is identifiable by that metric, which, again, seem unsurprising.
This is another reason to use OSS LLMs: the bar to entry is much higher, so they’re not targeted for countermeasures and they circumvent human pattern-matching.<p>You’re less likely to be “found out”, especially with there being LOTS of fine-tunes and even a couple fundamental models.<p>Though I don’t think this study is accurate, people are trying and they’re only targeting “Open”Ai’s products.
me to chatgpt:<p>> Here is some text. Please rewrite it as follows: 1. DO NOT change the content more than the bare minimum needed to comply with the instructions. 2. Rewrite to add at least one question mark, and one semicolon. Add a "however", "but", "although" and "because", and rewrite the sentences so it flows naturally and without changing the meaning. Here is the paragraph you should rewrite:<p>><p>> They believe their classifier is effective, because it homes in on a range of stylistic differences between human and AI writing. Scientists are more likely to have a richer vocabulary and write longer paragraphs containing more diverse words than machines. They also use punctuation like question marks, brackets, semicolons more frequently than ChatGPT, except for speech marks used for quotations.<p>ChatGPT output (GPT4):<p>> Although they have confidence in their classifier's effectiveness due to its ability to identify a variety of stylistic differences between human and AI writing, a question emerges; why is this believed to be the case? It's because scientists are generally more likely to utilize a richer vocabulary and construct longer paragraphs filled with a diverse set of words, but can this solely account for the difference?<p>><p>> They also employ punctuation, such as question marks, brackets, and semicolons, with a higher frequency compared to ChatGPT; however, an exception is the use of speech marks for quotations.<p>Of course their detector might well handle that too, but GPT4 is perfectly capable of writing text with very varying level of styles and vocabulary, and if their detection relies on that, odds are people will prompt their way around that quite easily before even looking at more advanced approaches like e.g. finetuning on scientific papers within the right field etc.
Do people really not understand that they’re just building a GAN (generative adversarial network) and the endgame of the arms race is that not even detector machines will be able to tell the diference within a margin of statistical error? Unless that is the goal, this just helps the machines in the long-term.
The real danger of this kind of work is that you'll have a shoddy product that misclassifies student/academic writings and people will get tagged as cheaters when they aren't.
It only works for a highly specific type of article that's found in scientific journals ('perspectives' articles), and at the paragraph level it was closer to 92% correct.
It sounds like they didn't do any follow up prompts. The first response from chatGPT is usually inadequate, but with some follow ups you can get a much better one.
What's the false positive rate?<p>It's already difficult enough to not be accused of being a bot instead of a human.<p>I do not want the future to be "rule by AI".