TechEcho

10 comments

chimeover 12 years ago

I did something similar with a 10m Tweet dataset couple of years ago:* <a href="http://ktype.net/wiki/research:articles:progress_20110209#twitter_n-gram_results" rel="nofollow">http://ktype.net/wiki/research:articles:progress_20110209#tw...</a>* <a href="http://ktype.net/wiki/research:articles:progress_20110228?s#letter-pair_frequency_table" rel="nofollow">http://ktype.net/wiki/research:articles:progress_20110228?s#...</a>I would love to redo this analysis with newer Tweets but alas, don't know where to get a usable corpus. Any suggestions? My goal was to explore many of the concepts from norvig's <a href="http://norvig.com/spell-correct.html" rel="nofollow">http://norvig.com/spell-correct.html</a> using the Tweet dataset to build a better one-finger-keyboard and word-prediction engine for iOS.

评论 #5013544 未加载

评论 #5012446 未加载

评论 #5046000 未加载

martinpwover 12 years ago

Since the original data includes the year of publication, it would be interesting to see trends in these datasets over time, eg which words are becoming more/less popular, is average word length reducing over time, is the variety of words in common use increasing or decreasing, etc.

评论 #5013181 未加载

lispythonover 12 years ago

I thought this may help to design a new keyboard layout better (from scientific/statistic perspective) than Dvorak (<a href="https://en.wikipedia.org/wiki/Dvorak_Simplified_Keyboard" rel="nofollow">https://en.wikipedia.org/wiki/Dvorak_Simplified_Keyboard</a>) which is based on research more than 80 years ago.

评论 #5014027 未加载

评论 #5013318 未加载

sriramkover 12 years ago

On a tangential note, does anyone know what he's using to generate those bar graphs automatically from those tiny images? Nifty trick.

new-world-orderover 12 years ago

Mr Norvig, please share the code. I'm sure it's some interesting Lisp or Python.

评论 #5014784 未加载

jqueryinover 12 years ago

It's not apparent to me what was used when calculating the frequency of bigrams, n-grams. Was this based on the overall dataset or the dictionary alone?I believe it's beneficial to see a version that's based on the dictionary words alone as that would ensure no duplicate words exist to effect the n-grams, acting as a control group.

wangweijover 12 years ago

Many years ago I read an article claiming the order is etoanirsh... I still use it in hangman games.

评论 #5012786 未加载

评论 #5012755 未加载

jyhipuover 12 years ago

Service Temporarily Unavailable The server is temporarily unable to service your request due to maintenance downtime or capacity problems. Please try again later. Apache/1.3.42 Server at norvig.com Port 80

评论 #5013204 未加载

nodataover 12 years ago

But "forschungsgemeinschaft" is German, and it's mentioned frequently: at least 100,000 times each in the book corpus.I don't trust his corpus compares to the original English corpus.

评论 #5012283 未加载

评论 #5012392 未加载

ghubbardover 12 years ago

How do the new results compare to Mayzner's?

10 comments

chimeover 12 years ago

评论 #5013544 未加载

评论 #5012446 未加载

评论 #5046000 未加载

martinpwover 12 years ago

评论 #5013181 未加载

lispythonover 12 years ago

评论 #5014027 未加载

评论 #5013318 未加载

sriramkover 12 years ago

On a tangential note, does anyone know what he's using to generate those bar graphs automatically from those tiny images? Nifty trick.

new-world-orderover 12 years ago

Mr Norvig, please share the code. I'm sure it's some interesting Lisp or Python.

评论 #5014784 未加载

jqueryinover 12 years ago

wangweijover 12 years ago

Many years ago I read an article claiming the order is etoanirsh... I still use it in hangman games.

评论 #5012786 未加载

评论 #5012755 未加载

jyhipuover 12 years ago

评论 #5013204 未加载

nodataover 12 years ago

But "forschungsgemeinschaft" is German, and it's mentioned frequently: at least 100,000 times each in the book corpus.I don't trust his corpus compares to the original English corpus.

评论 #5012283 未加载

评论 #5012392 未加载

ghubbardover 12 years ago

How do the new results compare to Mayzner's?

English Letter Frequency Counts: Mayzner Revisited

10 comments

English Letter Frequency Counts: Mayzner Revisited

10 comments