I did something similar with a 10m Tweet dataset couple of years ago:<p>* <a href="http://ktype.net/wiki/research:articles:progress_20110209#twitter_n-gram_results" rel="nofollow">http://ktype.net/wiki/research:articles:progress_20110209#tw...</a><p>* <a href="http://ktype.net/wiki/research:articles:progress_20110228?s#letter-pair_frequency_table" rel="nofollow">http://ktype.net/wiki/research:articles:progress_20110228?s#...</a><p>I would love to redo this analysis with newer Tweets but alas, don't know where to get a usable corpus. Any suggestions? My goal was to explore many of the concepts from norvig's <a href="http://norvig.com/spell-correct.html" rel="nofollow">http://norvig.com/spell-correct.html</a> using the Tweet dataset to build a better one-finger-keyboard and word-prediction engine for iOS.
Since the original data includes the year of publication, it would be interesting to see trends in these datasets over time, eg which words are becoming more/less popular, is average word length reducing over time, is the variety of words in common use increasing or decreasing, etc.
I thought this may help to design a new keyboard layout better (from scientific/statistic perspective) than Dvorak (<a href="https://en.wikipedia.org/wiki/Dvorak_Simplified_Keyboard" rel="nofollow">https://en.wikipedia.org/wiki/Dvorak_Simplified_Keyboard</a>) which is based on research more than 80 years ago.
It's not apparent to me what was used when calculating the frequency of bigrams, n-grams. Was this based on the overall dataset or the dictionary alone?<p>I believe it's beneficial to see a version that's based on the dictionary words alone as that would ensure no duplicate words exist to effect the n-grams, acting as a control group.
Service Temporarily Unavailable
The server is temporarily unable to service your request due to maintenance downtime or capacity problems. Please try again later.
Apache/1.3.42 Server at norvig.com Port 80
But "forschungsgemeinschaft" is German, and it's mentioned frequently: <i>at least 100,000 times each in the book corpus</i>.<p>I don't trust his corpus compares to the original English corpus.