After wrestling a bit with the web and terminology.
Here are some quickly put together stats ( errors and omissions included ).
Code:
VoynichTT
Total words: 37759
Vocabulary : 8078
Hapax : 5571
68.8% of vocab is hapax 6.7 words per hapax Totalwords/Vocab ratio 6.64:1
-----------------------------------------------------------------------------------------------
la divina commedia di dante alighieri
Total words: 97344
Vocabulary : 19893
Hapax : 13750
69.1% of vocab is hapax 7.0 words per hapax Totalwords/Vocab ratio 4.89:1
--------------------------------------------------------------------------------------------------
Naturalis Historia books 1-4 pliny the elder (Thayer)
Total words: 35562
Vocabulary : 12596
Hapax : 8898
70.6% of vocab is hapax 3.9 words per hapax Totalwords/Vocab ratio 2.82:1
--------------------------------------------------------------------------------------------------
The Adventures of Tom Sawyer M.Twain
Total words: 71748
Vocabulary : 7578
Hapax : 3739
49.0% of vocab is hapax 19.1 words per hapax Totalwords/Vocab ratio 9.46:1
--------------------------------------------------------------------------------------------------
Tale of 2 cities C.Dickens
Total words: 136561
Vocabulary : 10137
Hapax : 4590
45.2% of vocab is hapax 29.7 words per hapax Totalwords/Vocab ratio 13.47:1
Here you can see Plinys' Natural History, an encyclopedic work has lots of hapax and almost every 3rd word is a new addition to the vocabulary.
Whereas Dickens 'Tale Of Two Cities' is at the other end of the scale ( perhaps explaining some of his appeal )
where is a new word is introduced only every 13.4 words.
'Tale Of Two Cities' is a popular book with stats comparable to 'The Adventures of Tom Sawyer'.
Dickens also has the lowest percentage of hapax but nearly half of this books vocabulary are still unique words ( hapax legomena ).
Genre could be an influence on these numbers as noted by Alin_J and bi3mw.
Interestingly we can see that Dante has the closest numbers to the VoynichMS. The divisions of The Divine Comedy perhaps affecting the
statistics of that text in a similar manner to the way the 6 sections of the VMS are possibly culpable for its ( The VMS's ) stats.
Further Investigation Required as noted by Emma May Smith.
Hapax legomena are the other side of the coin to the concept of 'word types that occur more than once' as noted by bi3mw.
Ref:https://en.wikipedia.org/wiki/Hapax_legomenon
And this looks quite interesting as well.
You are not allowed to view links.
Register or
Login to view.
Edit: 24/06/20 bi3mw pointed out the Dante stats are wrong..see thread page4.