Ask HN: Sentiment Analysis – how to handle biased word list lengths?

51 点作者 markovbling大约 10 年前

I've tried posting this on Stack Exchange but no luck so figured I might have more luck here:I'm implementing a simple sentiment analysis algorithm where the authors of the paper have a word list for positive and negative words and simply count the number of occurrences of each in the analysed document and give it a sentiment score the document with:sentiment = (#positive_matches - #negative_matches) / (document_word_count)This is normalising the sentiment score by document length BUT the corpus of negative words is 6 times larger than the positive word corpus (around 300 positive words and 1800 negative words) so by the measure above, the sentiment score will likely be negatively biased since there are more negative words to match than positive words.How can I correct for the imbalance in the length of the positive vs. negative corpuses?When I run calculate the above sentiment score, I get around 70% of my 2000 document set with negative sentiment scores BUT there is no a priori reason that my document set should be biased towards the negative and I would expect the true 'unobserved' sentiment of the documents to be approximately symmetrical with around half the documents positive and half negative.I need to somehow come up with a methodology that results in representative sentiment scores to remove the bias introduced by asymmetrical word lists.Any thoughts / ideas much appreciated :)

13 条评论

PaulHoule大约 10 年前

(1) sentiment analysis is the one area where bag of words really goes to die; there is a limit to how good the results you get will be and it won't be good.(2) the right way to do this is to train a probability estimator on your scores, that is, put +/- labels on some of your documents, then apply logistic regression.<a href="http://en.wikipedia.org/wiki/Logistic_regression" rel="nofollow">http://en.wikipedia.org/wiki/Logistic_regression</a>A lot of machine learning people think this is harder than it is and worry more about regularization, overfitting and such, but in the case of turning a score into a probability estimator you are (a) fitting a small number of variables and (b) if you have a lot of data and make a histogram you will ALWAYS get a logistic curve for any reasonable score, I think it has something to do with the central limit theorem.This seems to be one of the best kept secrets in machine learning. I used to be the bagman who supplied data to people at the Cornell CS department and we ran into a problem where there was an inbalance in the positive and negative set and in that case the 0 threshold for the SVM is not in the right place because it gets the wrong idea about the prior distribution and T Joachims told us to do the logistic regression trick.Also if you read the papers about IBM Watson they tried just about everything to fit probability estimators and wound up concluding that logistic regression "just works" almost all the time.

评论 #9083570 未加载

评论 #9082395 未加载

评论 #9085340 未加载

wiresurfer大约 10 年前

I see you have mentioned TF-IDF as something which you are planning to try. That should be interesting.The way I see it, (and i may very well be slightly off point) you have a corpus of 2000 docs 2 lists -> [Wpos] & [Wneg] with count[Wneg] a factor more than count[Wpos]if you compute a [0-1] normalized tf-idf score for each term in the set [Wpos] & [Wneg] and sum them up for all words in each of those two sets, you get a score proportional to the count of positive words & negative words. Normalized here would mean using relative frequencies, rather than absolute freq. [I prefer calling the latter term counts]This puts document_word_count based normalization out of picture and makes it implicit in the tf-idf step.Now you have Two numbers, Sum(Positive normalized TF-IDFs) and Sum(Negative Normalized TF-IDFs) which you can individually normalize for your list sizes, and then use the two scores for sentiment classification. A dirty hack, and somewhat inefficient if you don't maintain a reverse index.Second approach could be this. Use your Word List, both positive and negative, to do a Okapi BM25 scoring against your docs using the list as the query set. So you would get a BM25score for your docs. and you can use that to define sentiments.Corpus - D Di = Document in the corpus you want to classify Query1 = {set of positive words} Query2 = {set of negative words}PositiveScore = BM25(Query1, Di ) NegativeScore = BM25(Query2, Di )Some Combination to do classification. if Positive > Negative Score : call it positive!Just a thought. BM25 has some flexibility in tuning it for length normalizations. Check footnote.PS: There is the British National Corpus too for word frequencies :)[1]BM25 and normalizations. <a href="http://nlp.stanford.edu/IR-book/html/htmledition/okapi-bm25-a-non-binary-model-1.html" rel="nofollow">http://nlp.stanford.edu/IR-book/html/htmledition/okapi-bm25-...</a>

评论 #9082538 未加载

alexbecker大约 10 年前

You could use the word frequency lists at <a href="http://www.wordfrequency.info/" rel="nofollow">http://www.wordfrequency.info/</a> in order to normalize, e.g. add up the frequencies of the positive words and the negative words, and divide the number of matches by these frequencies.

评论 #9080471 未加载

QuantumRoar大约 10 年前

If I understood correctly, you are trying to get a sentiment that is always correct for single sentences but that can extrapolate word frequencies if they don't appear in your list. I.e. for large neutral documents you want it to be neutral, although your negative words match statistically more often.My intuition tells me that you can't really do both: Either use only those words in your dictionary and get the behaviour right for single sentences, or extrapolate as if both sets where the same size (weighted average, would be the easiest). By extrapolating you may assume, that for each positive match that you get, you'll miss other positive matches. That means you generally underestimate positive matches, compared to negative matches. This only works on large datasets.But really, how bad is it to get a sentiment of -0.17 for a single sentence? It tells you that it was a negative sentence but that you have a high chance that there was a positive word in there that you missed, which is what you need to implement to get neutral sentiment for large neutral documents.

wrath大约 10 年前

1. You can try using bi-grams or even tri-grams to make you word list a little more precise.2. Create a validation by manually identifying each review as positive or negative. Each time you modify your algorithm run it through your validation set and note the results in a spreadsheet. If you don't do that, you'll never know if and how you've improved the results. The bigger the validation set the better. Similarly, you can use part of your validation set as a training set into a classifier.3. Find a scale that works to bias your score. For example, I would try to bias your negative score using a log scale. The fewer negative words you have the more they are worth, the more you have the less they are worth.

评论 #9080635 未加载

dhammack大约 10 年前

If you think the true sentiment is symmetric, you can just change the decision threshold so that your algorithm answers positively about half the time. Just say positive when the sentiment is greater than the mean sentiment over your training set.

评论 #9080629 未加载

barneso大约 10 年前

Two simple things you could do:1. Insert each negative example six times into your training set (or weight negative examples accordingly, ie use #positive matches - 6 * #negative matches / (2 * positive word count) as your score2. Take your distribution of sentiment scores as calculated over held out data (or the training set itself, but be warned that this will skew your results), and calculate the mean and standard deviation. Normalize your results by subtracting the mean and dividing by the standard deviation. You can then say that positive sentiment is > 0 and negative sentiment < 0, with the absolute value being the strength of the classification.

评论 #9080140 未加载

bulte-rs大约 10 年前

Only using match-counts is imo a bit simplistic (don't get me wrong, simplistic can be good). Do you have any information like "how many times does this negatively annotated word occur in a document"; then you can use a simple calculation (like cosine-similarity) to calculate a measure of matching with said case.Also, consider using bigrams (i.e. word-pairs) to do sentiment matching which will make matching more precise.

评论 #9080822 未加载

peterhi大约 10 年前

Word counting can be pretty ropey but there are some things that you should check out. Of the 1800 negative words how many of them actually occurred in the documents?Or you could simply count negative words as 0.44 rather than 1 (800/1800 if those numbers are correct).Is "not" a negative word? This might be causing problems with things like "This cake is not bad" which has positive sentiment even if it has 2 negative words

评论 #9080131 未加载

ai_maker大约 10 年前

Do you have the gold standard labels of your dataset? Can you ensure that the amount of pos/neg labels is symmetrical?You can heuristically tune the weights of your lexicon to fit your intuition, but evidence is necessary to progress adequately.In case you find unbalanced amount of examples, apply an unbalanced effectiveness score like the F-measure to obtain a fair performance of your system.

评论 #9080493 未加载

var_eps大约 10 年前

Assuming there is no inherent bias in terms of sentiment and vocabulary, one approach would be to repeatedly randomly sample 300 negative words from the corpus and generate a vector of sentiments. You could then average the elements of the vector to get an average sentiment, or use another metric from basic stats. That could decrease the bias.

评论 #9080901 未加载

SergeyHack大约 10 年前

You can try to find a dataset that contains the equal number of positive and negative documents (sentences, etc.) and use it as the validation set. I.e. to tune your hyperparameters on it.In the simple case your hyperparameter can be α insentiment = (α * #positive_matches - #negative_matches) / (document_word_count)

amelius大约 10 年前

Out of interest, did you define some kind of measure by which you can test how well the chosen method performs?(There are a lot of suggestions here, so it would be nice if at least you could choose the "best" one)