FastText – Library for fast text representation and classification

220 pointsby Dawny33almost 9 years ago

18 comments

koughalmost 9 years ago

Links to the relevant papers:Bag of Tricks for Efficient Text Classification: <a href="https://arxiv.org/abs/1607.01759v2" rel="nofollow">https://arxiv.org/abs/1607.01759v2</a>Enriching Word Vectors with Subword Information: <a href="https://arxiv.org/abs/1607.04606" rel="nofollow">https://arxiv.org/abs/1607.04606</a>Both fantastic papers. For those who aren't aware, Mikolov also helped create word2vec.One curious thing: this seems to use heirarchal softmax instead of the "negative sampling" described in their earlier paper <a href="http://arxiv.org/abs/1310.4546" rel="nofollow">http://arxiv.org/abs/1310.4546</a>, despite that paper reporting that "negative sampling" is more computationally efficient and of similar quality. Anyone know why that might be?

评论 #12228304 未加载

samfisher83almost 9 years ago

What exactly does it do?It says this: fastText is a library for efficient learning of word representations and sentence classification.What does that meant? Is for sentiment analysis?

评论 #12227647 未加载

评论 #12227936 未加载

评论 #12228579 未加载

评论 #12227651 未加载

sligalmost 9 years ago

I noticed that the C++ code has no comments whatsoever. Why would they do that? The code is clear enough and you can read the papers to figure it out or do they clean up comments before releasing internal code to the public?

评论 #12227619 未加载

评论 #12229447 未加载

misiti3780almost 9 years ago

The classification format is a bit confusing to me. Given a file that looks like this:Help - how to I format blocks of code/bash output in this editor ?`fastText josephmisiti$ cat train.tsv | head -n 2 1 1 A series of escapades demonstrating the adage that what is good for the goose is also good for the gander , some of which occasionally amuses but none of which amounts to much of a story . 1 2 1 A series of escapades demonstrating the adage that what is good for the goose 2Are they saying to reformat it like thiscat train.tsv | head -n 10 | awk -F '\t' '{print "__label__"$4 "\t" $3 }'`giving me`fastText josephmisiti$ cat train.tsv | head -n 10 | awk -F '\t' '{print "__label__"$4 "\t" $3 }' __label__1 A series of escapades demonstrating the adage that what is good for the goose is also good for the gander , some of which occasionally amuses but none of which amounts to much of a story . __label__2 A series of escapades demonstrating the adage that what is good for the goose __label__2 A series __label__2 A __label__2 series __label__2 of escapades demonstrating the adage that what is good for the goose __label__2 of __label__2 escapades demonstrating the adage that what is good for the goose __label__2 escapades __label__2 demonstrating the adage that what is good for the goose`

评论 #12227929 未加载

haddralmost 9 years ago

For supervised classification this tool is suitable when your dataset is large enough. I performed some tests with binary classification (twitter sentiment) on the corpus with ~7.000 samples and the result is not impressive (~0.77). Vowpal wabbit performes slightly better here, with almost the same training time.I'm looking forward to try it on some bigger datasets.I also wonder if is it possible to use separately trained word vector model for the supervised task?

评论 #12229120 未加载

评论 #12229075 未加载

评论 #12228771 未加载

jgrahamalmost 9 years ago

This might be a naïve question, but does anyone know if this is suitable for online classification tasks? All the examples in the paper ([2] in the readme) seemed to be for offline classification. I'm not terribly well versed in this area so I don't know if the techniques used here allow the model to be updated incrementally.

评论 #12228527 未加载

mendezaalmost 9 years ago

Can this be used to do automatic summarization? I have been really interested in that topic, and I've played with TextRank and LexRank, but they don't provide as meaningful summarizes as I would want.

评论 #12229106 未加载

评论 #12230032 未加载

评论 #12228367 未加载

Smerityalmost 9 years ago

Just to mirror what was said on the thread a month ago when the paper came out[1], if you're interested in FastText I'd strongly recommend checking out Vowpal Wabbit[2] and BIDMach[3].My main issue is that the FastText paper [7] only compares to other intensive deep methods and not to comparable performance focused systems like Vowpal Wabbit or BIDMach.Many of the features implemented in FastText have been existing in Vowpal Wabbit (VW) for many years. Vowpal Wabbit also serves as a test bed for many other interesting, but all highly performant, ideas, and has reasonable strong documentation. The command line interface is highly intuitive and it will burn through your datasets quickly. You can recreate FastText in VW with a few command line options[6].BIDMach is focused on "rooflining", or working out the exact performance characteristics of the hardware and aiming to maximize those[4]. While VW doesn't have word2vec, BIDMach does[5], and more generally word2vec isn't going to be a major slow point in your systems as word2vec is actually pretty speedy.To quote from my last comment in [1] regarding features:Behind the speed of both methods [VW and FastText] is use of ngrams^, the feature hashing trick (think Bloom filter except for features) that has been the basis of VW since it began, hierarchical softmax (think finding an item in O(log n) using a balanced binary tree instead of an O(n) array traversal) and using a shallow instead of deep model.^ Illustrating ngrams: "the cat sat on the mat" => "the cat", "cat sat", "sat on", "on the", "the mat" - you lose complex positional and ordering information but for many text classification tasks that's fine.[1]: <a href="https://news.ycombinator.com/item?id=12063296" rel="nofollow">https://news.ycombinator.com/item?id=12063296</a>[2]: <a href="https://github.com/JohnLangford/vowpal_wabbit" rel="nofollow">https://github.com/JohnLangford/vowpal_wabbit</a>[3]: <a href="https://github.com/BIDData/BIDMach" rel="nofollow">https://github.com/BIDData/BIDMach</a>[4]: <a href="https://github.com/BIDData/BIDMach/wiki/Benchmarks#Reuters_Data" rel="nofollow">https://github.com/BIDData/BIDMach/wiki/Benchmarks#Reuters_D...</a>[5]: <a href="https://github.com/BIDData/BIDMach/blob/master/src/main/scala/BIDMach/networks/Word2Vec.scala" rel="nofollow">https://github.com/BIDData/BIDMach/blob/master/src/main/scal...</a>[6]: <a href="https://twitter.com/haldaume3/status/751208719145328640" rel="nofollow">https://twitter.com/haldaume3/status/751208719145328640</a>[7]: <a href="https://arxiv.org/abs/1607.01759" rel="nofollow">https://arxiv.org/abs/1607.01759</a>

评论 #12229102 未加载

jjulianoalmost 9 years ago

I code something like this before for personal use, it allows me to evaluate my facebook/twitter status before posting online and classify them according to being "negative, sarcastic, positive, helpful" so that I can be careful on what I'm posting. I use bayesian filtering with trained words I gathered which contains negative, sarcastic, positive and helpful, then I use scoring to filter out what exactly the sentence means.

tcampalmost 9 years ago

How does this work with or replace other NLP solutions in the market. Is it only for training models or for actual interpretation as well.

merrellbalmost 9 years ago

The simultaneous training of word representations and a classifier seems like it ignores the typically much larger unsupervised portion of the corpus. Is there a way to train the word representations on the full-corpus and then apply this to the smaller classification training?

评论 #12281864 未加载

himavarshaalmost 9 years ago

This might be a naive question, but what should be the format of the training/test data? Is it like __label__1 John __label__2 Ram

d0100almost 9 years ago

As a side note, are dataset that have already been classified available for free anywhere?

riyadparvezalmost 9 years ago

Did they release any trained model like Google did for word2vec?

评论 #12230473 未加载

评论 #12227501 未加载

kwrobelalmost 9 years ago

Is it multi label text classification or only multi class?

评论 #12228183 未加载

aantixalmost 9 years ago

Was hoping for Java bindings as I'd like to try it out on a long running Map/Reduce classification job..

评论 #12228098 未加载

评论 #12227798 未加载

ali053944almost 9 years ago

drstrangevibesalmost 9 years ago

how fast is it? does it outperform tensorflow or torch-rnn?

评论 #12227683 未加载