AHAHAHAHAHA! There is approximately a 0% chance that the big companies paying for data annotation will be far-sighted enough to avoid LLM-automated labeling of their data, for several reasons:<p>1) it will work well, at first, and only become low-quality after they (and their budgets) have become accustomed to paying 1/20th as much for the service<p>2) even if they pay for "human" labeling, they will go for the low cost bid, in a far-away country, which will subcontract to an LLM service without telling them<p>3) "hey, we should pay more for this input, in order to avoid not-yet-seen quality problems in the future", has practically never won an argument in any large corporation ever. I won't say absolutely 0 times, but pretty close.<p>Long story short, the use of LLM's by Big Tech may be doomed. Much like how "SEO optimization" turns quickly into clickbait and link farms if there is not high-urgency and high-priority efforts to combat it, LLM's (and other trendy forms of AI that require lots of labeled input) will quickly turn sour and produce even less impressive results than they already do.<p>The current wave of "AI" hype looks set to succeed about as well as IBM Watson.
At work we were facing this dilemna. Our team is working on a model to detect fraud/scam messages, in production it needs to label ~500k messages a day at low cost. We wanted to train a basic gbt/BERT model to run locally but we considered using GPT-4 as an label source instead of our usual human labelers.<p>For us human labeling is suprisingly cheap, the main advantage of GPT-4 would be that it would be much faster, since scams are always changing we could general new labels regularly and be continuously retraining our model.<p>In the end we didn't go down that route, there were several problems:<p>- GPT-4 accuracy wasn't as good as human labelers. I believe this is because scam messages are intentionally tricky, and require a much more general understanding of the world compared to the datasets used in this article which feature simpler labeling problems. Also, I don't trust that there was no funny business going on in generating the results for this blog, since there is clear conflict of interest with the business that owns it.<p>- GPT-4 would be consistently fooled by certain types of scams whereas human annotators work off a consensus procedure. This could probably be solved in the future when there's a larger pool of other high-quality LLMs available, and we can pool them for consensus.<p>- Concern that some PII information gets accidentally sent to OpenAI, of course nobody trusts that those guys will treat our customers data with any level of appropriate ethics.
This will probably work as long as the material being annotated is similar to the material the LLM was trained on. When it encounters novel data (value) it will likely perform poorly.
I don't have experience with text/nlp problems, but some degree of automation/assistance in labeling is a fairly common practice in computer vision. If you have a certain task where the ML model gets you 90% there, then you can use that as a starting point and have a human fix the remaining 10%. (Of course, this should be done in a way that the overall effort is lower than labeling from scratch, which is partially a UI problem). If your model is so good that it completely outperforms humans (at least for now, before data drift kicks in) then that's a good problem to have, assume your model evaluation is sane.