I enjoyed reading this chart but I hope it doesn't reinforce the bias that some fans have that word complexity is the only way to tell if a rapper is good or not. There are several ways to judge the strength and weaknesses of a rapper. Complexity is one of them, flow is another. Story telling ability is also another very strong in indicator. The best rappers are able to bring a mix while some are just so strong in one area that they explode no matter if they are really weak in other areas.
This is fascinating. I'm only a recent listener of hip-hop (primarily because of Earl Sweatshirt and Odd Future) and I'm in awe of the vernacular.<p>And similarly, as a boredom exercise a few weeks ago I did some lexical analysis of the song Timber (the monstrosity was being constantly played on the radio at the time) and here's what I came out with:<p>"83.1% of the words in the lyrics are five letters or less, 58.9% are four letters or less. The lexical density (the number of unique words divided by the total number of words, multiplied by one-hundred) is 29.1%. There is only one word in the song which has three or more syllables. Eleven people were involved with the writing of the song, each of them capable of producing just nine unique words each."
Looked for Canibus near the top and wasn't surprised to find him 4th. If anyone hasn't heard of him, highly suggest listening to his older stuff such as his first Can-I-Bus, 2000 BC and Mic Club.<p>He raps about science and space all the time which is cool.<p>Here's an example of his ridiculous lyrics: <a href="http://rapgenius.com/Canibus-poet-laureate-infinity-lyrics" rel="nofollow">http://rapgenius.com/Canibus-poet-laureate-infinity-lyrics</a>
Many here seem to be interpreting vocabulary size as a signal for quality. When it comes to rap I completely disagree. Firstly, the repetition is rap's main ingredient. I read an article a while ago where researchers found that listening to a spoken phrase that is looped activates the same part of the brain as music, which helps explain this phenomenon.<p>Personally, if I want food for thought I read. Rap is not an intellectual pursuit. I've been perusing rappers on this list, and the top artists have not been good at all to my ears. It seems that the best rappers are in the middle, and being on either extreme is a negative signal.
> Shakespeare’s vocabulary: across his entire corpus, he uses 28,829 words, suggesting he knew over 100,000 words<p>Why does that suggest he knew over 100k words? Maybe it means he knew 28,829 and used all of them? Would he really know over 70,000 words he never used in his works? What would those 70,000 words be? Probably very obscure ones. How can you know that many obscure ones?
Its a nice touch including portmanteaus and 'incorrect' ebonics on the list (like "ery'day"), since authors like shakespeare, joyce and others took the same liberties with language. Arguably, that's how language develops and makes it interesting to study and think about. The OP could have easily stuck to words in the OED, kudos.
Really interesting, but not as representative as it should be. It's not clear why some have larger vocabulary than others. It could be using words like "zeitgeist" (in case of Aesop Rock) or some clever wordplay (I don't know much about hip-hop, so I can't find example for some artist from the list right off the bat, but I remember Marilyn Manson using word "gloominati" for instance) or pretty meaningless made up words like "schizzle" (in case of Snoop Dogg) or usual derivatives like "fuckedy fuck". Moreover, in many transcripts for hip-hop people write down words as they are pronounced, which can be pretty much distorted for some artists (which of course ideally shouldn't count as a "new word", but that's complicated, yeah).<p>While Aeson Rock and DMX are clearly extreme and not surprising at all, it's not that clear for some guys in the middle.<p>So, first off, for every data project sources should be provided, or at least more specific definition, how text was processed, tokenized, analyzed. Second, several more "data slices" should be provided, for instance <i>100 most used words which are unique for that artist compared to other artist in the list</i>.
Maybe this is just me, but it's a little unfair to compare to literary <i>texts</i>.<p>Humour me for a moment.<p>When an artist writes a song, he (or she) has constraints. Most rappers would like to rhyme the ends of their sentences. I know sometimes they don't (like poetry), but it's certainly pleasing to the ear to have that constraint. Artists endeavour to make their songs catchy, that's highly correlated with the gross sales of the product.<p>When an artist writes a novel, this constraint is not weighted quite as highly. I know Shakespeare wrote poetry, too, and to call me out on this comparison is entirely fair. That said, there's also an argument to be made for eye rhymes. Shakespeare used these a lot. Eye rhymes are words that don't rhyme aurally, but <i>do</i> rhyme visually. It's the story that pleases the reader, not necessarily its aural 'catchiness'. I probably made that word up. But Shakespeare made words up too. The point is, you knew what I meant.<p>At the end of the day these comparisons, while certainly <i>interesting</i>, should be taken with a pinch of salt. While I'm at it, this advice can easily be extrapolated to any dataset. Always understand there may be unknown correlations.
This looks at the first so many lyrics in each rapper's career. Aesop Rock came out with some weird stuff right off the bat. I wonder if some of these other rappers became more sophisticated over time. Maybe an average per song would be better, or average uniques per word, would be better.
For those who aren't familiar with Aesop Rock, I'd invite you to give him a listen sometime. His earlier albums, in particular, have been very influential to me in many ways. Both in my artistic and professional careers.<p>From comments on the conditions of the working man and the condition of feeling trapped in a "j-o-b"[1]:<p><pre><code> "Now we the American working population
Hate the fact that eight hours a day
Is wasted on chasing the dream of someone that isn't us
And we may not hate our jobs
But we hate jobs in general
That don't have to do with fighting our own causes
We the American working population
Hate the nine-to-five day-in day-out
When we'd rather be supporting ourselves
By being paid to perfect the pastimes
That we have harbored based solely on the fact
That it makes us smile if it sounds dope"
</code></pre>
To storytelling masterpieces regarding living and dreaming[2]:<p><pre><code> "Look, I've never had a dream in my life
Because a dream is what you wanna do, but still haven't pursued
I knew what I wanted and did it till it was done
So I've been the dream that I wanted to be since day one!"
</code></pre>
Aesop Rock takes language and linguistics to entirely different levels than one might expect from the single genre that is hip-hop. He even challenges himself and the listeners, playing fantastic word games, for instance re-using the letters L, S, and D in odd and rhythmical ways after a mention[3]:<p><pre><code> "Lazy summer days
Like some decrepit landshark dumb luck squad dog lurks sicker deluded
Last sturdy domino lean's secluded
Don't let stupid delusions lesson super-duty labor students
Dragnet lifer solutions
Daddy loved sloppy dimensions like son-daughter links
Such determinated lepers, successfully disheveled
Little soliders developed like serpents despite life sentence ducking
Lemmings
Some don't like sobriety's dirty lenses
Some do"
</code></pre>
And then there are just incredible gems that stick with you like[4]:<p><pre><code> "I don't flick neeedles like my sick friend
I don't march like Beetle Bailey through a quick trend
I don't frequent church's steeples on my weekend
And I don't comment if you formulate a weak Zen"
</code></pre>
There's a lot to explore from Aesop Rock. Should you find this type of hip-hop interesting, a decent place to start is with the label you can find these songs on, Definitive Jux[5]. Incredible talent has been on and off that label over the years. So much good stuff.<p>[1] - "9-5ers Anthem" - <a href="http://rapgenius.com/Aesop-rock-9-5ers-anthem-lyrics" rel="nofollow">http://rapgenius.com/Aesop-rock-9-5ers-anthem-lyrics</a><p>[2] - "No Regrets" - <a href="http://rapgenius.com/Aesop-rock-no-regrets-lyrics" rel="nofollow">http://rapgenius.com/Aesop-rock-no-regrets-lyrics</a><p>[3] - "The Greatest Pac-Man Victory in History" - <a href="http://rapgenius.com/Aesop-rock-the-greatest-pac-man-victory-in-history-lyrics" rel="nofollow">http://rapgenius.com/Aesop-rock-the-greatest-pac-man-victory...</a><p>[4] - "Save Yourself" - <a href="http://rapgenius.com/Aesop-rock-save-yourself-lyrics" rel="nofollow">http://rapgenius.com/Aesop-rock-save-yourself-lyrics</a><p>[5] - <a href="http://en.wikipedia.org/wiki/Definitive_Jux" rel="nofollow">http://en.wikipedia.org/wiki/Definitive_Jux</a>
OP: Did your analysis of MF DOOM include his work alongside Madlib as Madvillian or his various other pseudonyms (King Geedorah, Viktor Vaughn, etc.)?<p>I find it a little hard to believe he's not at least in the Wu Tang/Canibus/KK cluster, if not #1 overall.
Makes me very happy to see Aesop Rock in the number #1 spot. He isn't as underground as many people assume, still relatively unknown in the mainstream, but well known enough to sell records and sell-out shows. I wasn't a big fan of his 2012 release Skelethon, but the way he structures his lyrics and the meaning behind them means he never writes a bad lyric.<p>Interestingly Eminem whom I would have thought would rank pretty highly for his clever method of word bending and enunciation is only in the middle of the scale. Still a whole lot better than some of his counterparts, but still surprising. Another interesting thing to note is Eminem being grouped in the same league as the likes of Jay-Z, Rakim and Lupe Fiasco. With only a couple of hundred unique words separating them from one another.
I find it hilarious that DMX is dead last.<p>I've now got empirical evidence of what I always thought.<p>I think DMX rhymes words with themselves more than any rapper I've ever heard.
This is a great graph, but I think it would be neat if a y-axis was thrown in. My first thought was album sales or some other metric of popularity that help you find specific rappers quick instead of going through the huge bunch of little pics.
This reminds me of a PyCon talk from this year in analyzing rap lyrics with some basic NLP techniques<p><a href="http://pyvideo.org/video/2658/analyzing-rap-lyrics-with-python" rel="nofollow">http://pyvideo.org/video/2658/analyzing-rap-lyrics-with-pyth...</a><p>The author was trying to see if rappers are considered more hateful towards women by their usage of "bitch per song". The results are quite interesting.
This infographic doesn't take into account other rappers possibly copying earlier really influential artists, making the earlier influential artists rank lower. More generally, it would be cool to see this chart ranked by the amount of original words present in the first 35,000 lyrics <i>that were not present yet at the albums' time of publication</i>.
To put some perspective on this:
ryan@3G08:~/Desktop/bleh$ pdftotext David-Foster-Wallace-Infinite-Jest-v2.0.pdf
ryan@3G08:~/Desktop/bleh$ python dfw.py
size of vocabulary: 30725<p>The man passed Shakespeare by 1,896 words with that book.<p>code:<p><pre><code> import nltk
from nltk.stem import *
import string
raw = open("/home/ryan/Desktop/bleh/David-Foster-Wallace-Infinite-Jest-v2.0.txt",'rU').read()
exclude = set(string.punctuation)
raw = ''.join(ch for ch in raw if ch not in exclude)
raw = raw.lower()
tokens=nltk.word_tokenize(raw)
stemmer = PorterStemmer()
stemmed_tokens = set()
for token in tokens:
stemmed_tokens.add(stemmer.stem(token))
print "size of vocabulary:", len(set(stemmed_tokens))</code></pre>
I've been wanting to do some NLP on rap genius's corpus for ages. This is a great analysis. What I had thought of is write a program to detect ghostwriting. Rappers probably have some sort of lyrical 'DNA' in the construction of their verses. How often they use certain words, number of words per line, number of unique words per song, ratio of adjectives to nouns, that kind of thing. You could probably unmask some ghost-writing secrets.<p>Looking at the analysis here, it's interesting to see some clustering in the results. IMO the second cluster is the sweet spot: Wu Tang's excessive invention of vocabulary is cool but probably detracts from the poetic effect. Meanwhile rappers like 2Pac are just kind of boring IMO, at least going by their lyrics alone.
I'm a big fan of the project and the way it is presented. Not sure why Wu-Tang features so prominently but I guess I'm okay with that. Kool Keith should be broken down further into his constituent parts. I also would have thought the Beastie Boys would have run higher.
I would have been rather surprised not to see Aesop Rock fairly high up the list. I was reading the Rap Genius pages for a few of his tracks the other week and the sheer density of wordplay was fairly overwhelming.<p>It is rap for geeks though ;)
Greatly enjoyed the analysis but while I was reading it I felt a lot like this guy:<p><a href="https://www.youtube.com/watch?v=GKlDBi0cyIA" rel="nofollow">https://www.youtube.com/watch?v=GKlDBi0cyIA</a>
All the rappers listed seem to be American.<p>Whack this through your Bowers and Wilkins:<p><a href="https://www.youtube.com/watch?v=p_SQEUZomug" rel="nofollow">https://www.youtube.com/watch?v=p_SQEUZomug</a>
I think the only problem I see is that some rap groups are listed as rappers. For instance beastie boys, de la soul and wu tang are listed. So there is some collective vocabulary being compared to single rappers. That said this is cool and pretty telling. From what I could see it is probably loosely couple to the intelligence of the rappers listed. I will echo the sentiments about DMX here. Looks like some shock jock rappers definitely are low on the list (too short).
This is an interesting analysis.<p>I love the fact that E-40 is about on par with Shakespeare. I'm sure he would take it as a compliment to be called the modern day Shakespeare.
I keep getting this error, in Firefox and Chrome:<p><Error> <Code>AccessDenied</Code> <Message>Access Denied</Message> <RequestId>3CB1F41D7DFDC794</RequestId> <HostId> wHCPzEYPDsmkMJX+YIgjU40YPrGYytHrk5B44dApi7663NkQQI0RKx9A/6EX7Iph </HostId> </Error>
How about a 2d visualization with a sliding 10000 word window, with the y axis as unique words out of 10k and the x aaxis time. Are there cultural trends that are time dependent? Did young mc and Del use more words than contemporary artists? Did their trends as artists follow the global trend over time?
I wonder where things like classic rock / broadway musicals / opera / etc. fits on this spectrum.<p>I really appreciate including Shakespeare and Moby Dick on the spectrum, but I'd still like some more perspective. For that matter, I wonder how many unique words <i>I</i> use every day.
Just a note, those artists don't necessarily use all their vocabulary. Eminem for example clearly holds back on his vocabulary. Rap is as much an art as anything can be so there are all sorts of factors. Be careful what you might want to draw here other than curiousity.
I would love to see this analysis without filters. Who is <i>the</i> rapper with the largest vocabulary? What does the distribution look like at the top? Surely Antipop Consortium or MF DOOM have larger vocabularies than Aesop for instance.
I'm pretty sure E-40 scored so high because of all the made up words. He's highly regarded for being innovative and influential but you know for every piece of slang that stuck there's like ten that didn't.
Why Jedi Mind Tricks is not counted? He'd be the first in this list; <a href="https://www.youtube.com/watch?v=TlZgiK6FiO0" rel="nofollow">https://www.youtube.com/watch?v=TlZgiK6FiO0</a>
Not particularly surprised at the list. Aesop Rock, the whole Wu-tang Clan, and guys like Nas, Wale, all near the top. DMX and Too Short at the bottom...<p>Definitely comes out in their music...
Incredibly, a list about rapper vocabulary is missing anyone associated with nerdcore.<p>I'm interested to see where the likes of MC Frontalot, Wordburglar, YTCracker, etc. rank on that scale...
I'd really like to see this broken down by established vocabulary and made up vocabulary. I think that would really start to show who were the best lyricists on both ends. Rappers with a lot of made up words might be on the far left, and rappers with a lot of unique words that aren't made up would be on the far right. Both sides of the scale would show rapping talent on different dimensions. Influential rappers like E-40 who add new words to the vocabulary, and wordy rappers like Aes on the right who use a really dense and descriptive vocabulary.
Gotta wonder about the garbage-in factor of Rap Genius. From one randomly selected Aesop Rock cut:<p>"Please I want to donate my brain to the monstrous Panasonic profit"<p>I guess it could be. I always heard it as "monstrous Panasonic prophet." It would be in keeping with the previous lyric "Television, all hail grand pixelated god of
fantasy."
We might all be self-confessed <i>hackers</i>, but we'll never explicitly confess our adoration for the gloriousness of the genre that is <i>gangster rap</i>.
The estimate of vocabulary size here is based on the number of unique words used. This seems like it is strongly biased: if two artists have the same size vocabulary, but one has released more albums and thus used more words, that artist will probably have used more unique words. To underscore this point, the number of unique words used by Aesop Rock is half of the estimated vocabulary size of the average college student, although to be fair that estimate is the number of words that an individual can recognize, not the number of words they use. (Edit: the bias is somewhat mitigated by the fact that the same number of words is used to estimate the vocabulary for each artist, but the bias is not dependent on sample size alone but also upon the size of the artist's underlying vocabulary; see my comments below.)<p>The underlying problem is one of estimating the cardinality of a multinomial distribution given a fixed number of samples. In isolation this problem is ill-posed, since it is always possible that there is a word in a given lyricist's vocabulary that he uses with very low frequency and that is unlikely to appear in any sample, but with appropriate prior information it may be possible to obtain an accurate estimate.<p>This is not my field, but a brief Google Scholar search shows that there are several papers on estimating vocabulary size, or equivalently, estimating the number of species based on sampling. There is a somewhat dated review (<a href="http://cvcl.mit.edu/SUNSeminar/BungeFitzpatrick_1993.pdf" rel="nofollow">http://cvcl.mit.edu/SUNSeminar/BungeFitzpatrick_1993.pdf</a>) that details some methods of estimation (in this case, I believe we are in the domain of "infinite population, multinomial sample" with unequal class sizes). The paper notes that there is no unbiased estimator available without assumptions on the distribution of word use frequencies, but some of the proposed estimators may be more accurate than the naive estimate used here.