Although it sounds high, recognising 90% of words makes for a pretty horrible reading experience.<p>That's 1 word in 10 that you don't know (1-2 words per sentence), or assuming as you did in that post a page length of 300 words, then it's 30 new words a page.<p>I actually recently wrote an article discussing the same phenomenon in Chinese [0]<p>Where to get a reasonable level of new characters (e.g. no more than 1 a page) you'd need to know 99.8% of the text on any page.<p>And the level of recognition required to be able to recognise and learn new words completely from context is about 98%. [1]<p>0: <a href="https://www.chinesethehardway.com/article/hsk-6-gets-you-halfway/" rel="nofollow">https://www.chinesethehardway.com/article/hsk-6-gets-you-hal...</a><p>1: <a href="https://www.youtube.com/watch?v=JbYMZZISPrU" rel="nofollow">https://www.youtube.com/watch?v=JbYMZZISPrU</a>
This doesn't really address your teacher's claim about having to look words up, though. What you want to look at is the distribution of low frequency words across the book. What do the plots look like when you remove proper nouns, functional words (e.g., "the", "and", prepositions) and, say, the top 1000 most frequent words in English?
FWIW, <i>Ulysses</i> isn't particularly incomprehensible. To the extent that it's difficult to read, it's much more the shifting narrative perspective, widely ranging references, and stream-of-consciousness rather than the vocabulary.<p>Take this typical section from the "Lotus Eaters" chapter, wherein Mr. Bloom is contemplating the origins of the wares in a tea shop:<p><i>So warm. His right hand once more more slowly went over again: choice blend, made of the finest Ceylon brands. The far east. Lovely spot it must be: the garden of the world, big lazy leaves to float about on, cactuses, flowery meads, snaky lianas they call them. Wonder is it like that. Those Cinghalese lobbing around in the sun, in dolce far niente. Not doing a hand's turn all day. Sleep six months out of twelve. Too hot to quarrel. Influence of the climate. Lethargy. Flowers of idleness. The air feeds most. Azotes. Hothouse in Botanic gardens. Sensitive plants. Waterlilies. Petals too tired to. Sleeping sickness in the air.</i><p>Hard to be too confused by the imagery and mood in this passage.<p>Now, <i>Finnegans Wake</i>...
My mom, an english teacher, once went through my library of science fiction and analyzed it for reading level. I had the usual collection: Lots of Heinlein, Asimov, Niven, Andre Norton, etc.<p>Her assessment: Most of the material was about 8th grade level, based on word count.<p>From time to time I re-read one of those books, and run across pages where she had penciled-in notations and underlined words.
> we turned words to their basic forms (went to go, cars to car, jumps to jump etc.)<p>FYI, this is called stemming. <a href="https://en.wikipedia.org/wiki/Stemming" rel="nofollow">https://en.wikipedia.org/wiki/Stemming</a>
For the record, Ulysses is at least a full order of magnitude more comprehensible than Joyce's next book, Finnegans' Wake.<p>I'd also expect it to give a skewed response on a test of this kind because it is composed of a number of different sections, which vary considerably in their style. But maybe that's the point of including it.
I think their teacher was referring to Zipfian Distribution[0]. I've seen this distribution hold on Wikipedia corpus, as well. Of course it's empirical.<p>[0]: <a href="https://en.wikipedia.org/wiki/Zipf%27s_law" rel="nofollow">https://en.wikipedia.org/wiki/Zipf%27s_law</a>
A nice, interesting idea, and experiment, thanks.<p>Not so casually the blue lines remind me of the one in the graph for the birthday problem:<p><a href="https://en.wikipedia.org/wiki/Birthday_problem" rel="nofollow">https://en.wikipedia.org/wiki/Birthday_problem</a>
The use of Eve's diary doesn't make any sense here, of course the distribution of words in a short story are going to be longer than in a book.<p>Ulysses is fair, but I would expect it and works of a similar caliber to be outliers.
As little as one character of almost any document will usually give you 100% of the binary symbols 0 and 1. Usually, the first character will do this, after which the rest of it is just mindless repetition.
This is good, interesting work. I wonder what the difference between stemming and lemmatization shows?<p>Edit: I see you are doing lemmatization now. Did you try just stemming?
I wonder how Pale Fire by Nabokov would look after this sort of analysis. For the unfamiliar, per wikipedia, "Starting with the table of contents, Pale Fire looks like the publication of a 999-line poem in four cantos ("Pale Fire") by the fictional John Shade with a Foreword, extensive Commentary, and Index by his self-appointed editor, Charles Kinbote. Kinbote's Commentary takes the form of notes to various numbered lines of the poem. Here and in the rest of his critical apparatus, Kinbote explicates the poem surprisingly little. Focusing instead on his own concerns, he divulges what proves to be the plot piece by piece, some of which can be connected by following the many cross-references. Espen Aarseth noted that Pale Fire "can be read either unicursally, straight through, or multicursally, jumping between the comments and the poem."[4] Thus although the narration is non-linear and multidimensional, the reader can still choose to read the novel in a linear manner without risking misinterpretation."