Word vectors are awesome but you don’t need a neural network to find them

276 pointsby blopeurover 7 years ago

24 comments

tensorover 7 years ago

After reading this I'm left wondering why anyone should stop using word2vec. The article makes the point that you can produce word vectors using other techniques, in this case by computing probabilities of unigrams and skip-grams and running SVD.This is all well and good, but from an industry practitioner standpoint this doesn't explain why one would avoid using, or actually stop using word2vec.1. Several known good word2vec implementations exist, the complexity of the technique doesn't really matter as you can just pick one of these and use it.2. Pretrained word vectors produced from word2vec and newer algorithms exist for many languages.Why should someone stop using these and instead spend time implementing a simpler method that produces maybe good enough vectors? Being a simpler method isn't a reason in of itself.

评论 #15506010 未加载

评论 #15505215 未加载

评论 #15504874 未加载

jdonaldsonover 7 years ago

One other benefit of using word2vec-stle training is that you can also control the learning rate, and gracefully handle new training data.SVD must be done at-once, and you need to use sparse matrix abstractions for raw word vectors. The implementation and abstractions you use actually make it more complex than word2vec imho.Word2vec can train off pretty much any type of sequence. You can adjust the learning rate on the fly (to emphasize earlier/later events), stop or start incremental training, and with Doc2Vec you can train embeddings for more abstract tokens in a much more straightforward manner (doc ids, user ids, etc.)While word2vec embeddings are not always reproducible, it is much more stable with the addition of new training data. This is key if you want some stability in a production system over time.Also, somebody edited the title of the article, thanks! The original title of "Stop using word2vec" is click-bait FUD rubbish. I think in this case we're trying too hard to wring a good discussion out of a bad article.

staredover 7 years ago

Well, the thing that word2vec can (and should) be understood in terms of word coincidences (and pointwise mutual information) is important, but hardly new. I tried to explain it here: <a href="http://p.migdal.pl/2017/01/06/king-man-woman-queen-why.html" rel="nofollow">http://p.migdal.pl/2017/01/06/king-man-woman-queen-why.html</a>.There is a temptation to use just the word pair counts, skipping SVD, but it won't yield in the best results. Creating vectors not only compresses data, but also finds general patterns. This compression is super important for less frequent words (otherwise we get a lot of overfitting). See "Why do low dimensional embeddings work better than high-dimensional ones?" from <a href="http://www.offconvex.org/2016/02/14/word-embeddings-2/" rel="nofollow">http://www.offconvex.org/2016/02/14/word-embeddings-2/</a>.

评论 #15504799 未加载

arrmnover 7 years ago

Word embeddings are not just useful for text, they can be applied whenever you have relation between "tokens". You can use them to identifying nodes in graphs that belong to the same group[0]. Another, in my opinion, really interesting idea is to apply them to relational databases[1], you can simply ask for similar rows.It's a interesting article but the author didn't really provide good arguments why I should stop using w2v.[0] <a href="http://www.kdd.org/kdd2017/papers/view/struc2vec-learning-node-representations-from-structural-identity" rel="nofollow">http://www.kdd.org/kdd2017/papers/view/struc2vec-learning-no...</a> [1] <a href="https://arxiv.org/abs/1603.07185" rel="nofollow">https://arxiv.org/abs/1603.07185</a>

评论 #15510336 未加载

评论 #15505306 未加载

oh-kumudoover 7 years ago

> Word vectors are awesome but you don’t need a neural network – and definitely don’t need deep learning – to find themWord2vec is not deep learning (the skip-gram algorithm is basically a one matrix multiplication followed by softmax, there isn't even place for activation function, why is this deep learning?), and it is simple and efficient. And most of all, there is no overhead using word2vec, just a difference between pre-trained vectors or trained ones.I don't understand what this article tries to say.

评论 #15506451 未加载

make3over 7 years ago

<a href="https://github.com/facebookresearch/fastText" rel="nofollow">https://github.com/facebookresearch/fastText</a> this is Facebook's super efficient word2vec like implemention. i thought ppl might find it interesting

kuschkuover 7 years ago

So, where do I get premade versions of this included all words of the 28 largest languages? This is one of the most valuable properties of word2vec and co: Prebuilt versions for many languages, with every word of the dictionary in them.Once you have that, we can talk about actually replacing word2vec and similar solutions.

评论 #15504014 未加载

serveboyover 7 years ago

word2vec yields better representations than PMI-SVD. If you want a better explicit PMI matrix factorization, have a look at <a href="https://github.com/alexandres/lexvec" rel="nofollow">https://github.com/alexandres/lexvec</a> and the original paper. Explains why SVD performs poorly.If you are looking for word embedding for production use, checkout fasttext, lexvec, glove, or word2vec. Don't use the approach described in this article.

评论 #15504538 未加载

fnlover 7 years ago

SVD scales with the number of items cubed, w2v scales linearly. Typical real world vocabularies are 1-10M, not 10-100k. This article is FUD and best, and IMO, just plain BS.

评论 #15506263 未加载

评论 #15505929 未加载

rpedelaover 7 years ago

This is a great, simple explanation of word vectors. However I think the argument would have been stronger if there were numbers showing that this simplified method and word2vec are similarly accurate like the author claims.

kirillkhover 7 years ago

Asking as someone who barely has any clue in this field: is there a way to use this for full-text search, e.g. Lucene? I know from experience that for some languages (e.g. Herew) there are no good stemmers available out of the box, so can you easily build a stemmer/lemmatizer (or even something more powerful? [1]) on top of word2vec or fastText?[1] E.g., for each word in a document or a search string, it would generate not just its base form, but also a list of top 3 base forms that are different, but similar in meaning to this word's base form (where the meaning is inferred based on context).

评论 #15507008 未加载

hellrichover 7 years ago

One argument for SVD is the low reliability (as in results fluctuate with repeated experiments) of word2vec embeddings, which hampers (qualitative) interpretation of the resulting embedding spaces, see: <a href="http://www.aclweb.org/anthology/C/C16/C16-1262.pdf" rel="nofollow">http://www.aclweb.org/anthology/C/C16/C16-1262.pdf</a>

Piezoidover 7 years ago

Random projections methods are cheaper alternatives to SVD. For example you can bin contexts with a hash function and count collocations between word and binned contexts the same way this article does. Then apply weighting and SVD if you really want the top n principal components.What's nice with counting methods is that you can simply add matrices from different collections of documents.

Radimover 7 years ago

Article explaining this relationship between matrix factorizations (SVD) and word2vec [2014]:<a href="https://rare-technologies.com/making-sense-of-word2vec/" rel="nofollow">https://rare-technologies.com/making-sense-of-word2vec/</a>(also contains benchmark experiments with concrete numbers and Github code -- author here)

kevinalbertover 7 years ago

I must be missing something here - in step 3, PMI for x, y is calculated as:log( P(x|y) / ( P(x)P(y) ) )Because the skipgram probabilities are sparse, P(x|y) is often going to be zero, so taking the log yields negative infinity. The result is a dense PMI matrix filled (mostly) with -Inf.Should we be adding 1 before taking the log?

wodenokotoover 7 years ago

I thought w2v didn't have any hidden layers or non-linear activation function, making it essentially a linear regression.Do I need to reread some papers?

make3over 7 years ago

also, word2vec are super fast and work great. the text has no convincing argument on why not to use them, unless you don't want to learn basic neutral nets. even then, just use Facebook fast text : <a href="https://github.com/facebookresearch/fastText" rel="nofollow">https://github.com/facebookresearch/fastText</a>

justwantaccountover 7 years ago

I thought this article was going to talk about GloVe, which actually performs better than Google's word2vec without a neural network according to its paper, but I guess not.

anentropicover 7 years ago

one of the tables in the article mentions something called "word2tensor" but google doesn't throw up anythingexcept this tweet which seems to be from a conference <a href="https://twitter.com/ic/status/756918600846356480?lang=en" rel="nofollow">https://twitter.com/ic/status/756918600846356480?lang=en</a>does anyone have any info about it?

评论 #15507703 未加载

KasianFranksover 7 years ago

Here's one more reason:Word2Vec is based on an approach from Lawrence Berkeley National Lab posted in Bag of Words Meets Bags of Popcorn 3 years ago 2 "Google silently did something revolutionary on Thursday. It open sourced a tool called word2vec, prepackaged deep-learning software designed to understand the relationships between words with no human guidance. Just input a textual data set and let underlying predictive models get to work learning."“This is a really, really, really big deal,” said Jeremy Howard, president and chief scientist of data-science competition platform Kaggle. “… It’s going to enable whole new classes of products that have never existed before.” <a href="https://gigaom.com/2013/08/16/were-on-the-cusp-of-deep-learning-for-the-masses-you-can-thank-google-later/" rel="nofollow">https://gigaom.com/2013/08/16/were-on-the-cusp-of-deep-learn...</a>Spotify seems to be using it now: <a href="http://www.slideshare.net/AndySloane/machine-learning-spotify-madison-big-data-meetup" rel="nofollow">http://www.slideshare.net/AndySloane/machine-learning-spotif...</a> pg 34But here's the interesting part:Lawrence Berkeley National Lab was working on an approach more detailed than word2vec (in terms of how the vectors are structured) since 2005 after reading the bottom of their patent: <a href="http://www.google.com/patents/US7987191" rel="nofollow">http://www.google.com/patents/US7987191</a> The Berkeley Lab method also seems much more exhaustive by using a fibonacci based distance decay for proximity between words such that vectors contain up to thousands of scored and ranked feature attributes beyond the bag-of-words approach. They also use filters to control context of the output. It was also made part of search/knowledge discovery tech that won the 2008 R&D100 award <a href="http://newscenter.lbl.gov/news-releases/2008/07/09/berkeley-lab-wins-four-2008-rd-100-awards/" rel="nofollow">http://newscenter.lbl.gov/news-releases/2008/07/09/berkeley-...</a> & <a href="http://www2.lbl.gov/Science-Articles/Archive/sabl/2005/March/06-genopharm.html" rel="nofollow">http://www2.lbl.gov/Science-Articles/Archive/sabl/2005/March...</a>A search company that competed with Google called "seeqpod" was spun out of Berkeley Lab using the tech but was then sued for billions by Steve Jobs <a href="https://medium.com/startup-study-group/steve-jobs-made-warner-music-sue-my-startup-9a81c5a21d68#.jw76fu1vo" rel="nofollow">https://medium.com/startup-study-group/steve-jobs-made-warne...</a> and a few media companies <a href="http://goo.gl/dzwpFq" rel="nofollow">http://goo.gl/dzwpFq</a>We might combine these approaches as there seems to be something fairly important happening here in this area. Recommendations and sentiment analysis seem to be driving the bottom lines of companies today including Amazon, Google, Nefflix, Apple et al.<a href="https://www.kaggle.com/c/word2vec-nlp-tutorial/discussion/12349" rel="nofollow">https://www.kaggle.com/c/word2vec-nlp-tutorial/discussion/12...</a>

评论 #15506985 未加载

make3over 7 years ago

professionals don't use pretrained word2vec vectors in the really complex (like neural machine transnation) deep learning models anymore, they let the models train their own word embeddings directly, or let the models learn character level embeddings.

评论 #15504856 未加载

评论 #15504178 未加载

评论 #15510170 未加载

phy6over 7 years ago

I find these baiting titles tiresome, and I generally assume (even if it makes an ass out of me) that the author is splitting hairs or wants to grandstand on some inefficiency that most of us knew was there already. (I'm assuming if they had a real argument then it would have been in a descriptive title) With titles like these I'll go straight to the comments section before giving you any ad-revenue. This is HN, not Buzzfeed, and we deserve better than this.

评论 #15504314 未加载

评论 #15504592 未加载

JKirchartzover 7 years ago

Wake me up when you have a readily available libraries to implement this in multiple programming languages.

评论 #15503797 未加载

评论 #15507023 未加载

ytersover 7 years ago

This is excellent. This is the kind of machine learning we need, that provides understanding instead of "throw this NN at this lump of data and tweak parameters until the error is small enough."

24 comments

tensorover 7 years ago

评论 #15506010 未加载

评论 #15505215 未加载

评论 #15504874 未加载

jdonaldsonover 7 years ago

staredover 7 years ago

评论 #15504799 未加载

arrmnover 7 years ago

评论 #15510336 未加载

评论 #15505306 未加载

oh-kumudoover 7 years ago

评论 #15506451 未加载

make3over 7 years ago

kuschkuover 7 years ago

评论 #15504014 未加载

serveboyover 7 years ago

评论 #15504538 未加载

fnlover 7 years ago

SVD scales with the number of items cubed, w2v scales linearly. Typical real world vocabularies are 1-10M, not 10-100k. This article is FUD and best, and IMO, just plain BS.

评论 #15506263 未加载

评论 #15505929 未加载

rpedelaover 7 years ago

kirillkhover 7 years ago

评论 #15507008 未加载

hellrichover 7 years ago

Piezoidover 7 years ago

Radimover 7 years ago

kevinalbertover 7 years ago

wodenokotoover 7 years ago

I thought w2v didn't have any hidden layers or non-linear activation function, making it essentially a linear regression.Do I need to reread some papers?

make3over 7 years ago

justwantaccountover 7 years ago

I thought this article was going to talk about GloVe, which actually performs better than Google's word2vec without a neural network according to its paper, but I guess not.

anentropicover 7 years ago

评论 #15507703 未加载

KasianFranksover 7 years ago

评论 #15506985 未加载

make3over 7 years ago

评论 #15504856 未加载

评论 #15504178 未加载

评论 #15510170 未加载

phy6over 7 years ago

评论 #15504314 未加载

评论 #15504592 未加载

JKirchartzover 7 years ago

Wake me up when you have a readily available libraries to implement this in multiple programming languages.

评论 #15503797 未加载

评论 #15507023 未加载

ytersover 7 years ago

This is excellent. This is the kind of machine learning we need, that provides understanding instead of "throw this NN at this lump of data and tweak parameters until the error is small enough."