TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

30% of Google's Emotions Dataset Is Mislabeled

334 点作者 echen将近 3 年前

27 条评论

nmfisher将近 3 年前
Anyone who&#x27;s dealt with any kind of human-annotated datasets would be familiar with these kind of errors. It&#x27;s hard enough to get good clean labels from motivated, native-English speaking annotators. Farm it out to low-paid non-native speakers, and these kind of issues are inevitable.<p>Annotation isn&#x27;t a low-skill&#x2F;low-cost exercise. It needs serious commitment and attention to detail, and ideally it&#x27;s not something you outsource (or if you do, you need an additional in-house validation pipeline to identify dirty labels).
评论 #32092757 未加载
评论 #32092803 未加载
评论 #32093412 未加载
评论 #32094716 未加载
评论 #32096410 未加载
评论 #32096076 未加载
评论 #32092960 未加载
评论 #32092852 未加载
评论 #32094382 未加载
kortex将近 3 年前
&gt; Hi dying, I&#x27;m dad! – mislabeled as NEUTRAL, likely because labelers don’t understand dad jokes<p>In their defense, what could be more True Neutral alignment than dad jokes? Nothing to gain but the quiet enjoyment of making the room groan and roll their eyes.<p>Really though, the issue here is context, but also the complexity of human communication. The sensitivity and tone highly depends on the situation. Clearly the preceding moment is someone stating &quot;I&#x27;m dying&quot;. But that itself is contextual. Are they literally facing mortality, merely inconvenienced and being hyperbolic, or laughing? If the former, is &quot;Hi Dying, I&#x27;m Dad&quot; being glib, to soften the blow of a dire confession, or being highly insensitive and poking fun in a serious moment? Is it in the context of a longer joke, which subverts the meanings yet again?<p>A lot of these comments are worse than useless without context. Reddit really likes improv-banter style humor in comment chains. One comment builds on another builds on another, all referencing in-jokes, and usually slathered in sarcasm.<p>Honestly Reddit comments are probably one of the worst sources to try to build a sentiment model from, from an engineering perspective.
评论 #32096778 未加载
NickRandom将近 3 年前
I was born and bred in the land of the Bard and yet I <i>also</i> mislabelled roughly the same 30% that they did. In my opinion that was mostly caused by the lack of context (eg, &#x27;Traps&#x27;)<p>As an example of the above, I assumed the &#x27;traps&#x27; one meant &quot;his mouth is so big that it shuts out the sun&quot; (aka an insult) since to &#x27;Shut your Trap&#x27; means to shut your mouth&#x2F; stop talking. Once there was a body-building context, I worked out that it was a reference to a person&#x27;s trapezoid muscles and therefore the sentiment was (most likely) Positive rather than the Negative&#x2F;Confrontational&#x2F;Sarcasm label that I would have first assigned it.<p>There are similar examples but that gives a rough idea about why context is important for sentiment data-set labelling.<p>But - #1 In a Mechanical-Turk setup - who has time to scan through paragraphs and #2 How far back to you go to get the full picture?<p>I don&#x27;t think you can so why not do it by hiring a temp for an in-house two week gig? Cheaper and you can directly monitor their performance. Win-Win
Beltalowda将近 3 年前
Let&#x27;s say you can label 2 comments a minute, you&#x27;d have to spend 3,625 work-hours to label comments, or about five people working full-time for a month. How much money did they save by using cheaper labour from India? Basically bugger all, and the money is wasted, too. Penny wise, pound foolish.
评论 #32093167 未加载
评论 #32094364 未加载
malikolivier将近 3 年前
I assumed I was quite fluent in English, even in slangs, having seen a fair share of both American and British movies.<p>Now that I see the examples given, I think I would have mislabeled most of them too, even if I were highly motivated to label them.<p>Though it&#x27;s normal for any language, it&#x27;s very interesting how English is variable between dialects and time periods when it comes to slang. There are so many regional slangs of which I cannot understand all the nuances.<p>A few examples from this dataset, that I would not have labeled correctly:<p>- daaaaaamn girl! – mislabeled as ANGER<p>- [NAME] wept. – mislabeled as SADNESS<p>- [NAME] is bae, how dare you. – mislabeled as ANGER<p>And don&#x27;t get me started on Australian&#x2F;NZ slang. It&#x27;s a completely different world.
评论 #32093991 未加载
评论 #32094444 未加载
评论 #32096910 未加载
评论 #32094333 未加载
nostrademons将近 3 年前
Reminds me of the stat I heard that <i>humans</i> are only 70% accurate at sentiment analysis, because different people will not agree on the appropriate sentiment label. That sets a theoretical limit on the effectiveness of machine-learning algorithms, because if humans can&#x27;t agree, then any product that needs to take an opinion is going to be wrong 30% of the time. (This is probably also why Big Tech companies are leaning so heavily into personalization.)<p>Also reminds me of when I asked a veteran therapist what the most surprising part of his job was, and it was:<p>1.) The variety of ways that different people perceive a given situation, and just how much neurodiversity is out there.<p>2.) How everybody <i>expects</i> that everyone else will see the situation exactly the same way they do.
评论 #32097357 未加载
评论 #32097030 未加载
scottlawson将近 3 年前
this is a genuinely great read. Author does a great job providing examples where context is critical, and explains how the dataset not only has labeling errors, an even deeper problem is how it models language in general.<p>Since words only have meaning within a context, your model should reflect that somehow.<p>What wasn&#x27;t really explored I&#x27;m this article was to what quantitative degree context sensitivity matters. The counter examples are great, but how can we measure the relationship between amount of context and labelling accuracy?
评论 #32092739 未加载
评论 #32092940 未加载
评论 #32095071 未加载
alx__将近 3 年前
Language is hard! Even I, a seasoned native internet dork, have trouble knowing if someone&#x27;s comment is sarcasm, irony, or something in between. Also, new phrases emerge all the time that turn a phrase on its head, and it has a different emotion.<p>How many feelings can you evoke with a simple, FUCK!
评论 #32093575 未加载
评论 #32095519 未加载
评论 #32096542 未加载
评论 #32116098 未加载
raverbashing将近 3 年前
&gt; let’s look at the labeling methodology described in the paper. To quote Section 3.3:<p>&gt; “Reddit comments were presented [to labelers] with no additional metadata (such as the author or subreddit).”<p>&gt; “All raters are native English speakers from India.”<p>This does not look good even on paper. No wonder the errors were abundant<p>Also a labeling system that has no entry for sarcasm is totally going to work guys!!1 &#x2F;s
评论 #32093196 未加载
stared将近 3 年前
Strange - GPT3 works well. I wrote a prompt:<p>Write an emotion that is expressed in a given image label.<p>Label: &quot;[label]&quot;<p>Emotion: [filled by GPT3]<p>Then, for &quot;you almost blew my fucking mind there.&quot; -&gt; &quot;Suprise&quot;, for &quot;hell yeah my brother&quot; -&gt; &quot;Pride&quot;, &quot;Nobody has the money to. What a joke&quot; -&gt; &quot;Anger&quot;. Though, to be fair, for &quot;Yay, cold McDonald&#x27;s. My favorite.&quot; it was &quot;Happiness&quot;. Still better than the crowdsourced human baseline.<p>Anger
throwuxiytayq将近 3 年前
Wow, this explains <i>a lot</i>. I wonder if they&#x27;re as inept when it comes to their search tech. The search result quality these days certainly speaks volumes.
评论 #32095791 未加载
评论 #32092735 未加载
COMMENT___将近 3 年前
I have three questions now:<p>* How much (per comment) are these &quot;native speakers from India&quot; paid?<p>* How many comments do they have to label in an hour (or in a minute)? I guess it&#x27;s more than 2 comments in a minute.<p>* What if the comment is sarcastic and this can only be understood from its context?
评论 #32094700 未加载
EarthLaunch将近 3 年前
&gt; LETS FUCKING GOOOOO<p>Could be either anger (let’s fight) or enthusiasm (let’s do it!). Hard problem.
评论 #32093079 未加载
评论 #32093176 未加载
bryanrasmussen将近 3 年前
Ok, but now this person has done a bunch of unpaid work for them, just to publish an article, and now they can write some easy scripts to label any occurrences of &#x27;daa+amn girl&#x27; as approval (etc. etc.) and in the end only 28% of the dataset will be mislabeled. The system works!
kyriefh将近 3 年前
i have some familiarity with sentiment &#x2F; intent detection in context heavy environments (gaming and VR) and absolutely agree that labeling is both a fundamental and very nuanced problem. an ML PhD was hired to work on toxicity detection, and a primary activity in his first several months was manually watching and labeling game replays - what a use of all that education!<p>there&#x27;s something to be said for utilizing community-based reporting as a form of expert labeling for integrity issues specifically, but that&#x27;s not a silver bullet and has its own baggage
评论 #32092979 未加载
NelsonMinar将近 3 年前
Indian English is just as valid as American English. The problem here is they used Indian English speakers to rate Reddit comments, most of which are using American English idioms.
评论 #32097440 未加载
estebarb将近 3 年前
At our university we do the clarification ourselves, usually with 3-5 classifiers per item. And it is surprising the high rate of disconformity between labelers (even in binary or 4-class classification). People in internet don&#x27;t always understand sarcasm, so maybe we need to benchmark humans at this task. But yeah, this and similar datasets (in Spanish, for example) have tons of misclassified texts.
whywhywhywhy将近 3 年前
No shock there really, complete waste of time using a data set that definitely requires good English fluency to understand the nuance and even understanding of the culture and memes of Reddit. You&#x27;re literally burning money by getting someone other than actual redditors to label it.
jasonlotito将近 3 年前
The author &quot;previously led AI, Data Science, and Human Computation orgs at Google, Facebook, and Twitter.&quot;<p>And is now writing an ad critical of the company he worked at, about an area he was involved in leading.<p>This is an interesting route to take in a career. Work for a company, make mistakes, move to another company, and use your old mistakes as a selling point for the new company.<p>I know this is a harsh take, but it doesn&#x27;t instill any confidence in the results here. What happens when mistakes happen at Surge? Are the people who made the mistakes going to be around to fix them, or are they going to jet off to another position where they once again talk about their previous failures.
Wizrad将近 3 年前
I work at a company that focuses on automating the data labelling process for computer vision. It is clear that generating massive amounts of labels, either by hand or automatically, without the ability to ensure a consistent level of quality across the dataset, is a problem. Which is why we are investing in automating the QA process for training data so mistakes like these don&#x27;t happen:<p><a href="https:&#x2F;&#x2F;blog.encord.com&#x2F;post&#x2F;automating-the-assessment-of-training-data-quality-with-encord" rel="nofollow">https:&#x2F;&#x2F;blog.encord.com&#x2F;post&#x2F;automating-the-assessment-of-tr...</a>
评论 #32094033 未加载
fprog将近 3 年前
Anyone interested in this application of machine learning might be interested to read <i>How Emotions are Made</i> by Dr. Lisa Feldman Barrett. She makes a compelling case that emotions cannot be reliably understood through facial expressions alone, and that context must be included to improve our own human accuracy at the task, let alone machine accuracy. While this article is about a textual dataset and so not an exact parallel, I think some of the same principles apply — namely that greater context is often needed to interpret an emotion from a message.
HWR_14将近 3 年前
I&#x27;ve worked on project with more difficult labeling, and we were able to get fairly accurate results. There are tons of standard practices that produce better results, so why did Google ignore them.
评论 #32093749 未加载
unbalancedevh将近 3 年前
This points out the much more ubiquitous problem of people simply misunderstanding one another. It&#x27;s very, very common for someone to post a comment intending to emphasize or convey one idea, but it gets interpreted as emphasizing or meaning something different, just because it&#x27;s read by a person different than who wrote it.<p>It&#x27;s not limited to Reddit comments, or even to written communication either. &quot;You&#x27;re ignoring me!&quot; &quot;No, I&#x27;m trying to give you space.&quot;
sebastianconcpt将近 3 年前
I can see how this issue will happen frequently, in diverse domains and abundantly.<p>In other words, the prediction is that the most likely outcome will be a lot of AI objects trained to be quite imbecile and will be optimal at that.<p>The danger is that real people might be assumed to be guilty of things due to AI trained and automated imbecility.<p>It&#x27;s an ethical problem for the AI community and product designers.
de6u99er将近 3 年前
You get what you paid for!
kache_将近 3 年前
Step 1: train a model to deduce emotion from vocal tone Step 2: use a model to transcribe the text Step 3: use it as inputs for an sentiment model<p>Use some int, peopel
xorcist将近 3 年前
&gt; (Who said you can&#x27;t be a professional memelord?)<p>Ah, so there&#x27;s hope for the kids after all!