Do 20 pages of a book gives you 90% of its words?

113 pointsby kiechualmost 8 years ago

19 comments

imronalmost 8 years ago

Although it sounds high, recognising 90% of words makes for a pretty horrible reading experience.That's 1 word in 10 that you don't know (1-2 words per sentence), or assuming as you did in that post a page length of 300 words, then it's 30 new words a page.I actually recently wrote an article discussing the same phenomenon in Chinese [0]Where to get a reasonable level of new characters (e.g. no more than 1 a page) you'd need to know 99.8% of the text on any page.And the level of recognition required to be able to recognise and learn new words completely from context is about 98%. [1]0: <a href="https://www.chinesethehardway.com/article/hsk-6-gets-you-halfway/" rel="nofollow">https://www.chinesethehardway.com/article/hsk-6-gets-you-hal...</a>1: <a href="https://www.youtube.com/watch?v=JbYMZZISPrU" rel="nofollow">https://www.youtube.com/watch?v=JbYMZZISPrU</a>

评论 #14676254 未加载

评论 #14676238 未加载

评论 #14677536 未加载

评论 #14675561 未加载

pealcoalmost 8 years ago

This doesn't really address your teacher's claim about having to look words up, though. What you want to look at is the distribution of low frequency words across the book. What do the plots look like when you remove proper nouns, functional words (e.g., "the", "and", prepositions) and, say, the top 1000 most frequent words in English?

评论 #14673878 未加载

评论 #14674559 未加载

评论 #14673915 未加载

twoodfinalmost 8 years ago

FWIW, Ulysses isn't particularly incomprehensible. To the extent that it's difficult to read, it's much more the shifting narrative perspective, widely ranging references, and stream-of-consciousness rather than the vocabulary.Take this typical section from the "Lotus Eaters" chapter, wherein Mr. Bloom is contemplating the origins of the wares in a tea shop:So warm. His right hand once more more slowly went over again: choice blend, made of the finest Ceylon brands. The far east. Lovely spot it must be: the garden of the world, big lazy leaves to float about on, cactuses, flowery meads, snaky lianas they call them. Wonder is it like that. Those Cinghalese lobbing around in the sun, in dolce far niente. Not doing a hand's turn all day. Sleep six months out of twelve. Too hot to quarrel. Influence of the climate. Lethargy. Flowers of idleness. The air feeds most. Azotes. Hothouse in Botanic gardens. Sensitive plants. Waterlilies. Petals too tired to. Sleeping sickness in the air.Hard to be too confused by the imagery and mood in this passage.Now, Finnegans Wake...

评论 #14673880 未加载

评论 #14673730 未加载

评论 #14673688 未加载

评论 #14673742 未加载

kabdibalmost 8 years ago

My mom, an english teacher, once went through my library of science fiction and analyzed it for reading level. I had the usual collection: Lots of Heinlein, Asimov, Niven, Andre Norton, etc.Her assessment: Most of the material was about 8th grade level, based on word count.From time to time I re-read one of those books, and run across pages where she had penciled-in notations and underlined words.

loegalmost 8 years ago

> we turned words to their basic forms (went to go, cars to car, jumps to jump etc.)FYI, this is called stemming. <a href="https://en.wikipedia.org/wiki/Stemming" rel="nofollow">https://en.wikipedia.org/wiki/Stemming</a>

评论 #14674043 未加载

dri_ftalmost 8 years ago

For the record, Ulysses is at least a full order of magnitude more comprehensible than Joyce's next book, Finnegans' Wake.I'd also expect it to give a skewed response on a test of this kind because it is composed of a number of different sections, which vary considerably in their style. But maybe that's the point of including it.

评论 #14673889 未加载

prashntsalmost 8 years ago

I think their teacher was referring to Zipfian Distribution[0]. I've seen this distribution hold on Wikipedia corpus, as well. Of course it's empirical.[0]: <a href="https://en.wikipedia.org/wiki/Zipf%27s_law" rel="nofollow">https://en.wikipedia.org/wiki/Zipf%27s_law</a>

jaclazalmost 8 years ago

A nice, interesting idea, and experiment, thanks.Not so casually the blue lines remind me of the one in the graph for the birthday problem:<a href="https://en.wikipedia.org/wiki/Birthday_problem" rel="nofollow">https://en.wikipedia.org/wiki/Birthday_problem</a>

bryanrasmussenalmost 8 years ago

The use of Eve's diary doesn't make any sense here, of course the distribution of words in a short story are going to be longer than in a book.Ulysses is fair, but I would expect it and works of a similar caliber to be outliers.

评论 #14673646 未加载

kazinatoralmost 8 years ago

As little as one character of almost any document will usually give you 100% of the binary symbols 0 and 1. Usually, the first character will do this, after which the rest of it is just mindless repetition.

nlalmost 8 years ago

This is good, interesting work. I wonder what the difference between stemming and lemmatization shows?Edit: I see you are doing lemmatization now. Did you try just stemming?

Finch2192almost 8 years ago

This doesn't seem all that groundbreaking, it's just an instance of Zipf's law in action, is it not?

评论 #14674399 未加载

评论 #14673789 未加载

评论 #14674310 未加载

ihaveajobalmost 8 years ago

I bet this is not true for the Encyclopedia Britannica, by design.

js8almost 8 years ago

I think this is a very useful idea - it could be used to "rate" the books for English learners to see how difficult they are.

al452almost 8 years ago

"incomprehensibility"

评论 #14673629 未加载

zeepalmost 8 years ago

90% of the words is not 90% of the meaning... but I get your point.

flavio81almost 8 years ago

Yes, if the book is 22 pages long!

oconnor0almost 8 years ago

Not if it's a dictionary!

评论 #14673883 未加载

rfrankalmost 8 years ago

I wonder how Pale Fire by Nabokov would look after this sort of analysis. For the unfamiliar, per wikipedia, "Starting with the table of contents, Pale Fire looks like the publication of a 999-line poem in four cantos ("Pale Fire") by the fictional John Shade with a Foreword, extensive Commentary, and Index by his self-appointed editor, Charles Kinbote. Kinbote's Commentary takes the form of notes to various numbered lines of the poem. Here and in the rest of his critical apparatus, Kinbote explicates the poem surprisingly little. Focusing instead on his own concerns, he divulges what proves to be the plot piece by piece, some of which can be connected by following the many cross-references. Espen Aarseth noted that Pale Fire "can be read either unicursally, straight through, or multicursally, jumping between the comments and the poem."[4] Thus although the narration is non-linear and multidimensional, the reader can still choose to read the novel in a linear manner without risking misinterpretation."

评论 #14673770 未加载