TechEcho

9 comments

mindcrimeover 7 years ago

A) and B) are variations of a classification problem. You can use any kind of classifier algorithm / approach. Naive Bayes, ANN, etc. C) is arguably closer to a clustering problem, but you have to figure out how to define the notion of distance / density for the clustering.Basically, from a "what to google for" perspective, I'd say read up on "classification algorithms" and "clustering algorithms". The respective Wikipedia pages aren't a bad place to start reading.<a href="https://en.wikipedia.org/wiki/Statistical_classification" rel="nofollow">https://en.wikipedia.org/wiki/Statistical_classification</a><a href="https://en.wikipedia.org/wiki/Cluster_analysis" rel="nofollow">https://en.wikipedia.org/wiki/Cluster_analysis</a>

syllogismover 7 years ago

<a href="https://github.com/explosion/spaCy/blob/develop/examples/training/train_textcat.py" rel="nofollow">https://github.com/explosion/spaCy/blob/develop/examples/tra...</a>You'll need to `pip install spacy-nightly` -- it uses spaCy 2 alpha. Until spaCy 2 stabilises it'll be a bit unstable, and unfortunately it's not in the docs yet.You'll likely get better results by running `spacy download en_vectors_web_lg` and doing `nlp = spacy.load('en_vectors_web_lg')`. This will download word vectors, trained on an enormous web dump by Stanford NLP using their GloVe algorithm.Once the model is trained you can do `nlp.to_disk(output_directory)`, and then run `spacy package <model directory> <package directory>`. This will setup the model data as a Python package, so that you can run `setup.py sdist`. You'll then get a self-contained Python package that exposes a `load()` function, to give you back the `nlp` object. (Note that if you do base the model on the GloVe vectors the package will be enormous, like 1GB. Shrug?)If you're starting one step back and don't have the data annotated yet, you might be interested in our annotation tool Prodigy. There's a demo video of the text classification workflow here: <a href="https://www.youtube.com/watch?time_continue=638&v=5di0KlKl0fE" rel="nofollow">https://www.youtube.com/watch?time_continue=638&v=5di0KlKl0f...</a>

chasedehanover 7 years ago

JS and PHP won't get you anywhere - you will need to use something else like R or python to start looking at it.Check out this course: <a href="https://www.datacamp.com/courses/intro-to-text-mining-bag-of-words" rel="nofollow">https://www.datacamp.com/courses/intro-to-text-mining-bag-of...</a>This will help you figure out how to convert those words into variables which can be used for modeling.

itamarstover 7 years ago

arrmnover 7 years ago

I'm currently working on something similar, these were my first two ideas:Try to train your own word2vec model on a twitter dataset and then you could use the weighted tf-idf average of these vectors. You get a vector for each tweet, and tweets that are about the same topic should be next to each other. Then try clustering algorithms, you can use the cosine distance to find the nearest X tweets.Second Idea would be to train doc2vec with twitter data.Another worthwhile idea could be to use LDA, haven't tried it myself

byoung2over 7 years ago

You could look at naive Bayesian classifiers or logistic regression classifiers. There are libraries for both in most languages and they are suited for your application.

mmikeffover 7 years ago

I'm lazy and would start with Textrazor or Alchemy Api

gerenukover 7 years ago

For topic modeling, take a look at gensim along with k-means. Also, you can use tf-idf to improve the accuracy.

kk58over 7 years ago

For tweets naive bayes with bow approach works very well.You need to do a ton of preprocessing. Think text transformation..

9 comments

mindcrimeover 7 years ago

syllogismover 7 years ago

chasedehanover 7 years ago

itamarstover 7 years ago

arrmnover 7 years ago

byoung2over 7 years ago

You could look at naive Bayesian classifiers or logistic regression classifiers. There are libraries for both in most languages and they are suited for your application.

mmikeffover 7 years ago

I'm lazy and would start with Textrazor or Alchemy Api

gerenukover 7 years ago

For topic modeling, take a look at gensim along with k-means. Also, you can use tf-idf to improve the accuracy.

kk58over 7 years ago

For tweets naive bayes with bow approach works very well.You need to do a ton of preprocessing. Think text transformation..

Ask HN: Machine learning to classify tweets into topics – Where to start?

9 comments

Ask HN: Machine learning to classify tweets into topics – Where to start?

9 comments