TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

Do 20 pages of a book gives you 90% of its words?

113 点作者 kiechu将近 8 年前

19 条评论

imron将近 8 年前
Although it sounds high, recognising 90% of words makes for a pretty horrible reading experience.<p>That&#x27;s 1 word in 10 that you don&#x27;t know (1-2 words per sentence), or assuming as you did in that post a page length of 300 words, then it&#x27;s 30 new words a page.<p>I actually recently wrote an article discussing the same phenomenon in Chinese [0]<p>Where to get a reasonable level of new characters (e.g. no more than 1 a page) you&#x27;d need to know 99.8% of the text on any page.<p>And the level of recognition required to be able to recognise and learn new words completely from context is about 98%. [1]<p>0: <a href="https:&#x2F;&#x2F;www.chinesethehardway.com&#x2F;article&#x2F;hsk-6-gets-you-halfway&#x2F;" rel="nofollow">https:&#x2F;&#x2F;www.chinesethehardway.com&#x2F;article&#x2F;hsk-6-gets-you-hal...</a><p>1: <a href="https:&#x2F;&#x2F;www.youtube.com&#x2F;watch?v=JbYMZZISPrU" rel="nofollow">https:&#x2F;&#x2F;www.youtube.com&#x2F;watch?v=JbYMZZISPrU</a>
评论 #14676254 未加载
评论 #14676238 未加载
评论 #14677536 未加载
评论 #14675561 未加载
pealco将近 8 年前
This doesn&#x27;t really address your teacher&#x27;s claim about having to look words up, though. What you want to look at is the distribution of low frequency words across the book. What do the plots look like when you remove proper nouns, functional words (e.g., &quot;the&quot;, &quot;and&quot;, prepositions) and, say, the top 1000 most frequent words in English?
评论 #14673878 未加载
评论 #14674559 未加载
评论 #14673915 未加载
twoodfin将近 8 年前
FWIW, <i>Ulysses</i> isn&#x27;t particularly incomprehensible. To the extent that it&#x27;s difficult to read, it&#x27;s much more the shifting narrative perspective, widely ranging references, and stream-of-consciousness rather than the vocabulary.<p>Take this typical section from the &quot;Lotus Eaters&quot; chapter, wherein Mr. Bloom is contemplating the origins of the wares in a tea shop:<p><i>So warm. His right hand once more more slowly went over again: choice blend, made of the finest Ceylon brands. The far east. Lovely spot it must be: the garden of the world, big lazy leaves to float about on, cactuses, flowery meads, snaky lianas they call them. Wonder is it like that. Those Cinghalese lobbing around in the sun, in dolce far niente. Not doing a hand&#x27;s turn all day. Sleep six months out of twelve. Too hot to quarrel. Influence of the climate. Lethargy. Flowers of idleness. The air feeds most. Azotes. Hothouse in Botanic gardens. Sensitive plants. Waterlilies. Petals too tired to. Sleeping sickness in the air.</i><p>Hard to be too confused by the imagery and mood in this passage.<p>Now, <i>Finnegans Wake</i>...
评论 #14673880 未加载
评论 #14673730 未加载
评论 #14673688 未加载
评论 #14673742 未加载
kabdib将近 8 年前
My mom, an english teacher, once went through my library of science fiction and analyzed it for reading level. I had the usual collection: Lots of Heinlein, Asimov, Niven, Andre Norton, etc.<p>Her assessment: Most of the material was about 8th grade level, based on word count.<p>From time to time I re-read one of those books, and run across pages where she had penciled-in notations and underlined words.
loeg将近 8 年前
&gt; we turned words to their basic forms (went to go, cars to car, jumps to jump etc.)<p>FYI, this is called stemming. <a href="https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Stemming" rel="nofollow">https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Stemming</a>
评论 #14674043 未加载
dri_ft将近 8 年前
For the record, Ulysses is at least a full order of magnitude more comprehensible than Joyce&#x27;s next book, Finnegans&#x27; Wake.<p>I&#x27;d also expect it to give a skewed response on a test of this kind because it is composed of a number of different sections, which vary considerably in their style. But maybe that&#x27;s the point of including it.
评论 #14673889 未加载
prashnts将近 8 年前
I think their teacher was referring to Zipfian Distribution[0]. I&#x27;ve seen this distribution hold on Wikipedia corpus, as well. Of course it&#x27;s empirical.<p>[0]: <a href="https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Zipf%27s_law" rel="nofollow">https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Zipf%27s_law</a>
jaclaz将近 8 年前
A nice, interesting idea, and experiment, thanks.<p>Not so casually the blue lines remind me of the one in the graph for the birthday problem:<p><a href="https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Birthday_problem" rel="nofollow">https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Birthday_problem</a>
bryanrasmussen将近 8 年前
The use of Eve&#x27;s diary doesn&#x27;t make any sense here, of course the distribution of words in a short story are going to be longer than in a book.<p>Ulysses is fair, but I would expect it and works of a similar caliber to be outliers.
评论 #14673646 未加载
kazinator将近 8 年前
As little as one character of almost any document will usually give you 100% of the binary symbols 0 and 1. Usually, the first character will do this, after which the rest of it is just mindless repetition.
nl将近 8 年前
This is good, interesting work. I wonder what the difference between stemming and lemmatization shows?<p>Edit: I see you are doing lemmatization now. Did you try just stemming?
Finch2192将近 8 年前
This doesn&#x27;t seem all that groundbreaking, it&#x27;s just an instance of Zipf&#x27;s law in action, is it not?
评论 #14674399 未加载
评论 #14673789 未加载
评论 #14674310 未加载
ihaveajob将近 8 年前
I bet this is not true for the Encyclopedia Britannica, by design.
js8将近 8 年前
I think this is a very useful idea - it could be used to &quot;rate&quot; the books for English learners to see how difficult they are.
al452将近 8 年前
&quot;incomprehensibility&quot;
评论 #14673629 未加载
zeep将近 8 年前
90% of the words is not 90% of the meaning... but I get your point.
flavio81将近 8 年前
Yes, if the book is 22 pages long!
oconnor0将近 8 年前
Not if it&#x27;s a dictionary!
评论 #14673883 未加载
rfrank将近 8 年前
I wonder how Pale Fire by Nabokov would look after this sort of analysis. For the unfamiliar, per wikipedia, &quot;Starting with the table of contents, Pale Fire looks like the publication of a 999-line poem in four cantos (&quot;Pale Fire&quot;) by the fictional John Shade with a Foreword, extensive Commentary, and Index by his self-appointed editor, Charles Kinbote. Kinbote&#x27;s Commentary takes the form of notes to various numbered lines of the poem. Here and in the rest of his critical apparatus, Kinbote explicates the poem surprisingly little. Focusing instead on his own concerns, he divulges what proves to be the plot piece by piece, some of which can be connected by following the many cross-references. Espen Aarseth noted that Pale Fire &quot;can be read either unicursally, straight through, or multicursally, jumping between the comments and the poem.&quot;[4] Thus although the narration is non-linear and multidimensional, the reader can still choose to read the novel in a linear manner without risking misinterpretation.&quot;
评论 #14673770 未加载