Links to the relevant papers:<p>Bag of Tricks for Efficient Text Classification: <a href="https://arxiv.org/abs/1607.01759v2" rel="nofollow">https://arxiv.org/abs/1607.01759v2</a><p>Enriching Word Vectors with Subword Information: <a href="https://arxiv.org/abs/1607.04606" rel="nofollow">https://arxiv.org/abs/1607.04606</a><p>Both fantastic papers. For those who aren't aware, Mikolov also helped create word2vec.<p>One curious thing: this seems to use heirarchal softmax instead of the "negative sampling" described in their earlier paper <a href="http://arxiv.org/abs/1310.4546" rel="nofollow">http://arxiv.org/abs/1310.4546</a>, despite that paper reporting that "negative sampling" is more computationally efficient and of similar quality. Anyone know why that might be?
What exactly does it do?<p>It says this:
fastText is a library for efficient learning of word representations and sentence classification.<p>What does that meant? Is for sentiment analysis?
I noticed that the C++ code has no comments whatsoever. Why would they do that? The code is clear enough and you can read the papers to figure it out or do they clean up comments before releasing internal code to the public?
The classification format is a bit confusing to me. Given a
file that looks like this:<p>Help - how to I format blocks of code/bash output in this editor ?<p>`fastText josephmisiti$ cat train.tsv | head -n 2
1 1 A series of escapades demonstrating the adage that what is good for the goose is also good for the gander , some of which occasionally amuses but none of which amounts to much of a story . 1
2 1 A series of escapades demonstrating the adage that what is good for the goose 2<p>Are they saying to reformat it like this<p>cat train.tsv | head -n 10 | awk -F '\t' '{print "__label__"$4 "\t" $3 }'`<p>giving me<p>`fastText josephmisiti$ cat train.tsv | head -n 10 | awk -F '\t' '{print "__label__"$4 "\t" $3 }'
__label__1 A series of escapades demonstrating the adage that what is good for the goose is also good for the gander , some of which occasionally amuses but none of which amounts to much of a story .
__label__2 A series of escapades demonstrating the adage that what is good for the goose
__label__2 A series
__label__2 A
__label__2 series
__label__2 of escapades demonstrating the adage that what is good for the goose
__label__2 of
__label__2 escapades demonstrating the adage that what is good for the goose
__label__2 escapades
__label__2 demonstrating the adage that what is good for the goose`
For supervised classification this tool is suitable when your dataset is large enough. I performed some tests with binary classification (twitter sentiment) on the corpus with ~7.000 samples and the result is not impressive (~0.77). Vowpal wabbit performes slightly better here, with almost the same training time.<p>I'm looking forward to try it on some bigger datasets.<p>I also wonder if is it possible to use separately trained word vector model for the supervised task?
This might be a naïve question, but does anyone know if this is suitable for online classification tasks? All the examples in the paper ([2] in the readme) seemed to be for offline classification. I'm not terribly well versed in this area so I don't know if the techniques used here allow the model to be updated incrementally.
Can this be used to do automatic summarization? I have been really interested in that topic, and I've played with TextRank and LexRank, but they don't provide as meaningful summarizes as I would want.
Just to mirror what was said on the thread a month ago when the paper came out[1], if you're interested in FastText I'd strongly recommend checking out Vowpal Wabbit[2] and BIDMach[3].<p>My main issue is that the FastText paper [7] only compares to other intensive deep methods and not to comparable performance focused systems like Vowpal Wabbit or BIDMach.<p>Many of the features implemented in FastText have been existing in Vowpal Wabbit (VW) for many years. Vowpal Wabbit also serves as a test bed for many other interesting, but all highly performant, ideas, and has reasonable strong documentation. The command line interface is highly intuitive and it will burn through your datasets quickly. You can recreate FastText in VW with a few command line options[6].<p>BIDMach is focused on "rooflining", or working out the exact performance characteristics of the hardware and aiming to maximize those[4]. While VW doesn't have word2vec, BIDMach does[5], and more generally word2vec isn't going to be a major slow point in your systems as word2vec is actually pretty speedy.<p>To quote from my last comment in [1] regarding features:<p>Behind the speed of both methods [VW and FastText] is use of ngrams^, the feature hashing trick (think Bloom filter except for features) that has been the basis of VW since it began, hierarchical softmax (think finding an item in O(log n) using a balanced binary tree instead of an O(n) array traversal) and using a shallow instead of deep model.<p>^ Illustrating ngrams: "the cat sat on the mat" => "the cat", "cat sat", "sat on", "on the", "the mat" - you lose complex positional and ordering information but for many text classification tasks that's fine.<p>[1]: <a href="https://news.ycombinator.com/item?id=12063296" rel="nofollow">https://news.ycombinator.com/item?id=12063296</a><p>[2]: <a href="https://github.com/JohnLangford/vowpal_wabbit" rel="nofollow">https://github.com/JohnLangford/vowpal_wabbit</a><p>[3]: <a href="https://github.com/BIDData/BIDMach" rel="nofollow">https://github.com/BIDData/BIDMach</a><p>[4]: <a href="https://github.com/BIDData/BIDMach/wiki/Benchmarks#Reuters_Data" rel="nofollow">https://github.com/BIDData/BIDMach/wiki/Benchmarks#Reuters_D...</a><p>[5]: <a href="https://github.com/BIDData/BIDMach/blob/master/src/main/scala/BIDMach/networks/Word2Vec.scala" rel="nofollow">https://github.com/BIDData/BIDMach/blob/master/src/main/scal...</a><p>[6]: <a href="https://twitter.com/haldaume3/status/751208719145328640" rel="nofollow">https://twitter.com/haldaume3/status/751208719145328640</a><p>[7]: <a href="https://arxiv.org/abs/1607.01759" rel="nofollow">https://arxiv.org/abs/1607.01759</a>
I code something like this before for personal use, it allows me to evaluate my facebook/twitter status before posting online and classify them according to being "negative, sarcastic, positive, helpful" so that I can be careful on what I'm posting. I use bayesian filtering with trained words I gathered which contains negative, sarcastic, positive and helpful, then I use scoring to filter out what exactly the sentence means.
The simultaneous training of word representations and a classifier seems like it ignores the typically much larger unsupervised portion of the corpus. Is there a way to train the word representations on the full-corpus and then apply this to the smaller classification training?