TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Ask HN: Machine learning to classify tweets into topics – Where to start?

15 pointsby Sujanover 7 years ago
I want to build something that can classify tweets (from my timeline or lists) by topic. I can of course provide a dataset of tweets pre-classified for each variant, but then the software should be able to learn from that and apply that logic to new, future tweets.<p>Variant A) Take all tweets of a timeline and decide if one was &quot;relevant&quot; or &quot;irrelevant&quot;.<p>Variant B) Decide which of x topics a tweet belongs to.<p>Variant C) Group similar tweets together that are related.<p>Where to start? What are the correct terms to Google? What libraries or software should I look at?<p>(I am most comfortable with JS and PHP, but of course this is only semi relevant.)

9 comments

mindcrimeover 7 years ago
A) and B) are variations of a classification problem. You can use any kind of classifier algorithm &#x2F; approach. Naive Bayes, ANN, etc. C) is arguably closer to a clustering problem, but you have to figure out how to define the notion of distance &#x2F; density for the clustering.<p>Basically, from a &quot;what to google for&quot; perspective, I&#x27;d say read up on &quot;classification algorithms&quot; and &quot;clustering algorithms&quot;. The respective Wikipedia pages aren&#x27;t a bad place to start reading.<p><a href="https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Statistical_classification" rel="nofollow">https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Statistical_classification</a><p><a href="https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Cluster_analysis" rel="nofollow">https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Cluster_analysis</a>
syllogismover 7 years ago
<a href="https:&#x2F;&#x2F;github.com&#x2F;explosion&#x2F;spaCy&#x2F;blob&#x2F;develop&#x2F;examples&#x2F;training&#x2F;train_textcat.py" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;explosion&#x2F;spaCy&#x2F;blob&#x2F;develop&#x2F;examples&#x2F;tra...</a><p>You&#x27;ll need to `pip install spacy-nightly` -- it uses spaCy 2 alpha. Until spaCy 2 stabilises it&#x27;ll be a bit unstable, and unfortunately it&#x27;s not in the docs yet.<p>You&#x27;ll likely get better results by running `spacy download en_vectors_web_lg` and doing `nlp = spacy.load(&#x27;en_vectors_web_lg&#x27;)`. This will download word vectors, trained on an enormous web dump by Stanford NLP using their GloVe algorithm.<p>Once the model is trained you can do `nlp.to_disk(output_directory)`, and then run `spacy package &lt;model directory&gt; &lt;package directory&gt;`. This will setup the model data as a Python package, so that you can run `setup.py sdist`. You&#x27;ll then get a self-contained Python package that exposes a `load()` function, to give you back the `nlp` object. (Note that if you do base the model on the GloVe vectors the package will be enormous, like 1GB. Shrug?)<p>If you&#x27;re starting one step back and don&#x27;t have the data annotated yet, you might be interested in our annotation tool Prodigy. There&#x27;s a demo video of the text classification workflow here: <a href="https:&#x2F;&#x2F;www.youtube.com&#x2F;watch?time_continue=638&amp;v=5di0KlKl0fE" rel="nofollow">https:&#x2F;&#x2F;www.youtube.com&#x2F;watch?time_continue=638&amp;v=5di0KlKl0f...</a>
chasedehanover 7 years ago
JS and PHP won&#x27;t get you anywhere - you will need to use something else like R or python to start looking at it.<p>Check out this course: <a href="https:&#x2F;&#x2F;www.datacamp.com&#x2F;courses&#x2F;intro-to-text-mining-bag-of-words" rel="nofollow">https:&#x2F;&#x2F;www.datacamp.com&#x2F;courses&#x2F;intro-to-text-mining-bag-of...</a><p>This will help you figure out how to convert those words into variables which can be used for modeling.
itamarstover 7 years ago
<a href="http:&#x2F;&#x2F;www.nltk.org&#x2F;book&#x2F;ch06.html" rel="nofollow">http:&#x2F;&#x2F;www.nltk.org&#x2F;book&#x2F;ch06.html</a>
arrmnover 7 years ago
I&#x27;m currently working on something similar, these were my first two ideas:<p>Try to train your own word2vec model on a twitter dataset and then you could use the weighted tf-idf average of these vectors. You get a vector for each tweet, and tweets that are about the same topic should be next to each other. Then try clustering algorithms, you can use the cosine distance to find the nearest X tweets.<p>Second Idea would be to train doc2vec with twitter data.<p>Another worthwhile idea could be to use LDA, haven&#x27;t tried it myself
byoung2over 7 years ago
You could look at naive Bayesian classifiers or logistic regression classifiers. There are libraries for both in most languages and they are suited for your application.
mmikeffover 7 years ago
I&#x27;m lazy and would start with Textrazor or Alchemy Api
gerenukover 7 years ago
For topic modeling, take a look at gensim along with k-means. Also, you can use tf-idf to improve the accuracy.
kk58over 7 years ago
For tweets naive bayes with bow approach works very well.<p>You need to do a ton of preprocessing. Think text transformation..