TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

English Letter Frequency Counts: Mayzner Revisited

180 pointsby phenyleneover 12 years ago

10 comments

chimeover 12 years ago
I did something similar with a 10m Tweet dataset couple of years ago:<p>* <a href="http://ktype.net/wiki/research:articles:progress_20110209#twitter_n-gram_results" rel="nofollow">http://ktype.net/wiki/research:articles:progress_20110209#tw...</a><p>* <a href="http://ktype.net/wiki/research:articles:progress_20110228?s#letter-pair_frequency_table" rel="nofollow">http://ktype.net/wiki/research:articles:progress_20110228?s#...</a><p>I would love to redo this analysis with newer Tweets but alas, don't know where to get a usable corpus. Any suggestions? My goal was to explore many of the concepts from norvig's <a href="http://norvig.com/spell-correct.html" rel="nofollow">http://norvig.com/spell-correct.html</a> using the Tweet dataset to build a better one-finger-keyboard and word-prediction engine for iOS.
评论 #5013544 未加载
评论 #5012446 未加载
评论 #5046000 未加载
martinpwover 12 years ago
Since the original data includes the year of publication, it would be interesting to see trends in these datasets over time, eg which words are becoming more/less popular, is average word length reducing over time, is the variety of words in common use increasing or decreasing, etc.
评论 #5013181 未加载
lispythonover 12 years ago
I thought this may help to design a new keyboard layout better (from scientific/statistic perspective) than Dvorak (<a href="https://en.wikipedia.org/wiki/Dvorak_Simplified_Keyboard" rel="nofollow">https://en.wikipedia.org/wiki/Dvorak_Simplified_Keyboard</a>) which is based on research more than 80 years ago.
评论 #5014027 未加载
评论 #5013318 未加载
sriramkover 12 years ago
On a tangential note, does anyone know what he's using to generate those bar graphs automatically from those tiny images? Nifty trick.
new-world-orderover 12 years ago
Mr Norvig, please share the code. I'm sure it's some interesting Lisp or Python.
评论 #5014784 未加载
jqueryinover 12 years ago
It's not apparent to me what was used when calculating the frequency of bigrams, n-grams. Was this based on the overall dataset or the dictionary alone?<p>I believe it's beneficial to see a version that's based on the dictionary words alone as that would ensure no duplicate words exist to effect the n-grams, acting as a control group.
wangweijover 12 years ago
Many years ago I read an article claiming the order is etoanirsh... I still use it in hangman games.
评论 #5012786 未加载
评论 #5012755 未加载
jyhipuover 12 years ago
Service Temporarily Unavailable The server is temporarily unable to service your request due to maintenance downtime or capacity problems. Please try again later. Apache/1.3.42 Server at norvig.com Port 80
评论 #5013204 未加载
nodataover 12 years ago
But "forschungsgemeinschaft" is German, and it's mentioned frequently: <i>at least 100,000 times each in the book corpus</i>.<p>I don't trust his corpus compares to the original English corpus.
评论 #5012283 未加载
评论 #5012392 未加载
ghubbardover 12 years ago
How do the new results compare to Mayzner's?