ChatGPT outperforms crowd-workers for text-annotation tasks

240 点作者 georgehill大约 2 年前

17 条评论

Suppose you have two classifiers, A and B, and some un-annotated data, D. You want to know how good is classifier B at annotating the data, compared to classifier A.One problem is that you don't have the ground truth for D. So you start by annotating D with the labels assinged by a third classifier, C:<pre><code> C(D) → D₁ </code></pre> Having thus established a modicum of "ground truth", ish, you proceed to annotate D with the two classifiers you are comparing, A and B:<pre><code> A(D) → D₂ B(D) → D₃ </code></pre> Then you compare the classifications D₂ and D₃ to D₁, and find that D₃ better approximates D₁ than D₂.What is the result of the experiment? I summarise it as follows:<pre><code> Classifier B better approximates the labelling of D by classifier C than classifier A. </code></pre> Now we can name the three classifiers as they were used in the experiments in the linked article:A: Human annotators.B: ChatGPTC: Human annotators.So the result of the paper is that, plugging in the names:<pre><code> ChatGPT better approximates the labelling of D by Human annotators than Human annotators. </code></pre> And that, is the finding of the paper.Which is clearly absurd and a cause to re-think methodology.

评论 #35341238 未加载

评论 #35337901 未加载

评论 #35342364 未加载

PostOnce大约 2 年前

Before you ask, because I was curious, from the paper:"For MTurk, we aimed to select the best available crowd-workers, notably by filtering for workers who are classified as “MTurk Masters” by Amazon, who have an approval rate of over 90%, and who are located in the US."

评论 #35335558 未加载

评论 #35335541 未加载

Imnimo大约 2 年前

My main take away here is that Turkers are terrible at some of these tasks. The "stance" task is, "Classify the tweet as having a positive stance towards Section 230, a negative stance, or a neutral stance.", and the Turkers accuracy was like 20%.Even in its best task, ChatGPT only got 75% accuracy.

评论 #35336911 未加载

评论 #35335637 未加载

评论 #35336243 未加载

评论 #35337740 未加载

Tepix大约 2 年前

It's weird, i would have expected the mturks' tasks to be of equal quality because they have outsourced their tasks to ChatGPT...

danpalmer大约 2 年前

At my previous job we had a human review stage in a data pipeline. 5-10 people at an outsourcing company in Bangladesh would review things via a simple web interface we provided for them. There were ~10 factors they were reviewing, all fixed options (no free text), but varying from 5 to 500 options per factor. It was all based on a few text fields and around 5 images.On the surface of it, I'd expect ChatGPT to do very well at this. It's simple text and images, not many options and theoretically very limited context.However the more I think about it the less sure I am. Firstly these weren't crowd-sourced reviews, they were trained reviewers, paid hourly not per review. Incentives were definitely in favour of the long term business relationship. Then there was the training doc, we maintained a vast disambiguation doc used to resolve things that were vague or could be interpreted multiple ways, this was constantly being revised. All necessary context should have been in that but it wasn't and reviewers definitely found patterns that worked and didn't. Lastly the reviewers were in a Slack channel where they would ask questions to their manager on our side, and while this might have only been ~1% of tasks, it was an important process.So maybe you could point ChatGPT at it and let it run, but the oversight process we had would still be necessary. The disambiguation doc would have been too long for ChatGPT's context at the moment, but that will likely change in the near future. Would the workflow be to keep tweaking the prompt to add special case after special case? How do you scale "do this, but not that, but add this, but..." in prompting, and would ChatGPT become as confused as a human after enough of that – I expect so given that it's only a language model and that's not effective communication.

评论 #35342477 未加载

评论 #35343004 未加载

winrid大约 2 年前

It does seem to work pretty well. I'm using it to analyze all US Congress bills:<a href="https://govscent.org/bill/USA/118hres190ih" rel="nofollow">https://govscent.org/bill/USA/118hres190ih</a>It extracts the topics and determines how on topic the bill is. Soon we're adding a topic browser and the homepage will have some fun stats :) it's all free.

评论 #35341799 未加载

jacknews大约 2 年前

So is this 'AI trains AI better than people can'?And presumably the better-trained AI will also be better again at training.I think I've seen this movie.

评论 #35335765 未加载

评论 #35342666 未加载

评论 #35336132 未加载

评论 #35335827 未加载

评论 #35335605 未加载

评论 #35335697 未加载

bumbledraven大约 2 年前

> We used the ChatGPT API with the ‘gpt-3.5-turbo’ version to classify the tweets.

sandGorgon大约 2 年前

Curious here - OpenAI talks a LOT about how RLHF (Reinforcement Learning Through Human Feedback) is core to how GPT is tuned. Including safety.Are we getting to the point where GPT will be tuned by GPT without the need for HF ?

评论 #35337151 未加载

评论 #35336734 未加载

评论 #35336832 未加载

tasubotadas大约 2 年前

Now you won't have to worry that those people are not paid enough.

dougb5大约 2 年前

"ChatGPT’s intercoder agreement exceeds that of both crowd-workers and trained annotators for all tasks." -- wait, what? However good ChatGPT is at approximating trained annotators for these Twitter tasks, it's an algorithm, so the level of simulated "inter-annotator agreement" is in the authors' control (in the case of GPT, via the temperature parameter, for which they try just two values, 0.2 and 1.)And why does this paper not make any effort to describe the wide range of annotation tasks for which this kind of simulated annotation is not a good idea -- for example, where you care about the subjective opinions of specific subgroups of people at specific times. And even for the tasks they mention, what about the risks of reinforcing biases by using a model's output to train new models? Good grief this paper is lazy!

whatever1大约 2 年前

On average, no doubt chatgpt will be great for annotation.But annotation is mostly needed at the boundaries. At the very edge cases where it is not clear if something is a dog or a shape that looks like a dog.I really doubt that a generic model like chat gpt can really help in these tail cases.

评论 #35337651 未加载

visarga大约 2 年前

In my experience GPT labeling is good, but it makes errors, about 10% in my case. Maybe it's better on average than a human, but not perfect for the task. I was doing open-ended schema matching, a hard task because of the vast number of filed names.

评论 #35336381 未加载

alphabetatheta大约 2 年前

I wonder how much we can extend this to image or video labelling tasks? this could really jeopardize the business moats of lot of data labelling startups and services.

评论 #35346115 未加载

joshxyz大约 2 年前

oh dear, i am foreseeing prompt writing mechanical turks in the future where their instance of ai got plugins tailored for their industry, task, and datasets.

karmasimida大约 2 年前

It should comes with no surprise to anyone. Text classification is an easy task.ChatGPT is definitely overqualified to perform this task.

评论 #35341907 未加载

评论 #35342896 未加载

sorz大约 2 年前

Is NLP a solved problem now?

评论 #35335336 未加载

评论 #35335634 未加载

评论 #35335400 未加载

评论 #35335532 未加载

评论 #35369986 未加载

评论 #35336391 未加载

评论 #35342106 未加载

评论 #35335332 未加载