TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

English Letter Frequency Counts: Mayzner Revisited

180 点作者 phenylene超过 12 年前

10 条评论

chime超过 12 年前
I did something similar with a 10m Tweet dataset couple of years ago:<p>* <a href="http://ktype.net/wiki/research:articles:progress_20110209#twitter_n-gram_results" rel="nofollow">http://ktype.net/wiki/research:articles:progress_20110209#tw...</a><p>* <a href="http://ktype.net/wiki/research:articles:progress_20110228?s#letter-pair_frequency_table" rel="nofollow">http://ktype.net/wiki/research:articles:progress_20110228?s#...</a><p>I would love to redo this analysis with newer Tweets but alas, don't know where to get a usable corpus. Any suggestions? My goal was to explore many of the concepts from norvig's <a href="http://norvig.com/spell-correct.html" rel="nofollow">http://norvig.com/spell-correct.html</a> using the Tweet dataset to build a better one-finger-keyboard and word-prediction engine for iOS.
评论 #5013544 未加载
评论 #5012446 未加载
评论 #5046000 未加载
martinpw超过 12 年前
Since the original data includes the year of publication, it would be interesting to see trends in these datasets over time, eg which words are becoming more/less popular, is average word length reducing over time, is the variety of words in common use increasing or decreasing, etc.
评论 #5013181 未加载
lispython超过 12 年前
I thought this may help to design a new keyboard layout better (from scientific/statistic perspective) than Dvorak (<a href="https://en.wikipedia.org/wiki/Dvorak_Simplified_Keyboard" rel="nofollow">https://en.wikipedia.org/wiki/Dvorak_Simplified_Keyboard</a>) which is based on research more than 80 years ago.
评论 #5014027 未加载
评论 #5013318 未加载
sriramk超过 12 年前
On a tangential note, does anyone know what he's using to generate those bar graphs automatically from those tiny images? Nifty trick.
new-world-order超过 12 年前
Mr Norvig, please share the code. I'm sure it's some interesting Lisp or Python.
评论 #5014784 未加载
jqueryin超过 12 年前
It's not apparent to me what was used when calculating the frequency of bigrams, n-grams. Was this based on the overall dataset or the dictionary alone?<p>I believe it's beneficial to see a version that's based on the dictionary words alone as that would ensure no duplicate words exist to effect the n-grams, acting as a control group.
wangweij超过 12 年前
Many years ago I read an article claiming the order is etoanirsh... I still use it in hangman games.
评论 #5012786 未加载
评论 #5012755 未加载
jyhipu超过 12 年前
Service Temporarily Unavailable The server is temporarily unable to service your request due to maintenance downtime or capacity problems. Please try again later. Apache/1.3.42 Server at norvig.com Port 80
评论 #5013204 未加载
nodata超过 12 年前
But "forschungsgemeinschaft" is German, and it's mentioned frequently: <i>at least 100,000 times each in the book corpus</i>.<p>I don't trust his corpus compares to the original English corpus.
评论 #5012283 未加载
评论 #5012392 未加载
ghubbard超过 12 年前
How do the new results compare to Mayzner's?