TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

The Illustrated BERT: How NLP Cracked Transfer Learning

98 点作者 ghosthamlet超过 6 年前

5 条评论

danieldk超过 6 年前
I think that the work that is done on ELMo, BERT and others is great and useful. Unfortunately, there are many grandiose claims circling around these papers, such as the title of this blog post.<p>For example:<p><i>If we’re using this GloVe representation, then the word “stick” would be represented by this vector no-matter what the context was. “Wait a minute” said a number of NLP researchers (Peters et. al., 2017, McCann et. al., 2017, and yet again Peters et. al., 2018 in the ELMo paper ), “stick”” has multiple meanings depending on where it’s used. Why not give it an embedding based on the context it’s used in – to both capture the word meaning in that context as well as other contextual information?”. And so, contextualized word-embeddings were born.</i><p>This is blatantly false. Contextualized word representations have been around for a very long time. For example, the neural probabilistic language model proposed by Bengio et al., 2003 produces contextual word representations. There have been many papers about neural language models thereafter. However, the idea is even older, Schütze&#x27;s 1993 paper (Word Spaces) produces context-dependent word representations with subword units (n-grams).<p>Researchers have been well-aware for decades that ideally one would need context-sensitive representations and that representations such as those produced by word2vec or GloVe have this shortcoming. However, one of the reason that word2vec became so popular is that it is damn cheap to train [1] and that the possibility to pretrain on much larger corpora gave these simpler models an edge.<p>ELMO, BERT, and others (even though they differ quite a bit) spiritual successors of earlier neural language models that rely on newer techniques (BiDi LSTMs, convolutions over characters, transformers, etc.), larger amounts of data, and the availability of <i>much</i> faster hardware than we had one or two decades ago (e.g. BERT was trained on 64 TPU chips, or as Ed Grefenstette called it <i>blowing through a forest&#x27;s worth of GPU-time</i>).<p>Disclaimer: I have nothing against this work. I very much enjoyed the ELMo paper. I am just objecting to all the hype&#x2F;marketing out there.<p>[1] The skip-gram model with negative sampling is very similar to logistic regression, where one optimizes parameters of two vectors rather than just one weight vector.
评论 #18752783 未加载
PaulHoule超过 6 年前
Has anyone developed commercial applications based on word embeddings?<p>It&#x27;s clear that people are putting up better and better numbers on certain tasks that are widely shared, but for all I know these will always be a bridesmaid and never a bride when it comes to being useful for something.<p>Back in the 1970s it was clear that it wasn&#x27;t going to be easy to make rule-based parsers that were &quot;good enough&quot; but it seems that now the task has been defined down so that if you can do better than chance that&#x27;s a miracle. Thus people can kid themselves into thinking they are practicing what Thomas Kuhn called &quot;normal science&quot; since they are in the same shared reality even if it is a delusion.
评论 #18753773 未加载
评论 #18753758 未加载
评论 #18754301 未加载
评论 #18757528 未加载
andreyk超过 6 年前
See also &quot;NLP&#x27;s ImageNet moment has arrived&quot; (<a href="https:&#x2F;&#x2F;thegradient.pub&#x2F;nlp-imagenet&#x2F;" rel="nofollow">https:&#x2F;&#x2F;thegradient.pub&#x2F;nlp-imagenet&#x2F;</a>) by one of the researchers involved in the papers surveyed in this post.
评论 #18753626 未加载
deytempo超过 6 年前
It’s ironic that the study of language leads to the creation of a new one
julienfr112超过 6 年前
How is this related to fastText ?