The Voynich Ninja

Full Version: [split] Percentage of word types that occur more than once
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Pages: 1 2 3 4
Thread split. 

I think Emma's remarks must certainly be taken into account. Word count must be the same for each text (this is exactly the same as with TTR research) and ideally should be limited to one VM section / dialect.
(23-06-2020, 01:12 PM)Stephen Carlson Wrote: You are not allowed to view links. Register or Login to view.5571 is more reasonable, but that's not the number on Rene's page. Where is the 5571 coming from?



It comes from total word types - non unique word types: 8078 - 2507 = 5571


You are not allowed to view links. Register or Login to view.



The VMS text file is discussed You are not allowed to view links. Register or Login to view.
(23-06-2020, 02:55 PM)bi3mw Wrote: You are not allowed to view links. Register or Login to view.
(23-06-2020, 01:12 PM)Stephen Carlson Wrote: You are not allowed to view links. Register or Login to view.5571 is more reasonable, but that's not the number on Rene's page. Where is the 5571 coming from?
It comes from total word types - non unique word types: 8078 - 2507 = 5571
Oh, you're referring to hapax legomena. I'm not talking about those, but that number is also surprising high in comparison with natural language texts.
(21-06-2020, 08:33 PM)bi3mw Wrote: You are not allowed to view links. Register or Login to view.During a study of the VMS, I found that only a relatively small percentage of Word Types occur more than once. Can anyone confirm this ?

31.034909631% of all Word Types in the VMS occur more than once.
44.928611163% of all Word Types in the comparison text ( Regimen Sanitatis ) occur more than once.


You are not allowed to view links. Register or Login to view.

You are not allowed to view links. Register or Login to view.

If using the 101 transliteration (Glen Claston), the percentage of unique words (word types) that occurred more than once in the Voynich manuscript is about 28%. So, 72% of the word-types in the VMS are hapax legomenas.
(23-06-2020, 03:57 PM)Stephen Carlson Wrote: You are not allowed to view links. Register or Login to view.Oh, you're referring to hapax legomena. I'm not talking about those, but that number is also surprising high in comparison with natural language texts.

Yeah, that's kind of the flip side of my observation in the opening post.


Thanks @Alin_J, I would not have thought that the value is even lower.
(23-06-2020, 04:16 PM)Alin_J Wrote: You are not allowed to view links. Register or Login to view.
(21-06-2020, 08:33 PM)bi3mw Wrote: You are not allowed to view links. Register or Login to view.During a study of the VMS, I found that only a relatively small percentage of Word Types occur more than once. Can anyone confirm this ?

31.034909631% of all Word Types in the VMS occur more than once.
44.928611163% of all Word Types in the comparison text ( Regimen Sanitatis ) occur more than once.


You are not allowed to view links. Register or Login to view.

You are not allowed to view links. Register or Login to view.


If using the 101 transliteration (Glen Claston), the percentage of unique words (word types) that occurred more than once in the Voynich manuscript is about 28%. So, 72% of the word-types in the VMS are hapax legomenas.


But then again, this is IMO nothing unusual for natural language texts. For example, the corresponding number for the Finnish translation of Hamlet (total word-count: 23448 tokens), is 26%, i.e. 74% hapax legomenas.
(23-06-2020, 11:16 AM)Stephen Carlson Wrote: You are not allowed to view links. Register or Login to view.
(22-06-2020, 04:07 PM)ReneZ Wrote: You are not allowed to view links. Register or Login to view.For these two points, I can recommend to look at Table 3 on You are not allowed to view links. Register or Login to view. , that shows a great spread in the number of unique words (word types).
From the linked page:
Quote:A representative number of word types may be 9,000 - 10,000.
This strikes me on the high side, at least for certain languages and genres.

It's a kind of obvious test, but as anyone compared the unique word count of the VM to that of other works of various languages? This page here You are not allowed to view links. Register or Login to view. puts the number of words per unique words of 7 different English-language novels as between 9 and 16.5. If I understand the VM stats right, it comes in between 3.6 and 4.3, depending on the transcription. It seems that number of unique words in the denominator is about three or four times too high, but I'm curious about non-English works.


This is also not unusual, at least not for non-English works... I found that the Swedish novel Inferno by August Strindberg has a number of 4.38 (total number of words about 46 000).
(23-06-2020, 04:30 PM)Alin_J Wrote: You are not allowed to view links. Register or Login to view.But then again, this is IMO nothing unusual for natural language texts.
Hmm, I would have rather thought that a ratio of 45% / 55% is the "normal case" in longer texts, but surely it depends strongly on the text genre and language.
(23-06-2020, 05:02 PM)bi3mw Wrote: You are not allowed to view links. Register or Login to view.
(23-06-2020, 04:30 PM)Alin_J Wrote: You are not allowed to view links. Register or Login to view.But then again, this is IMO nothing unusual for natural language texts.
Hmm, I would have rather thought that a ratio of 45% / 55% is the "normal case" in longer texts, but surely it depends strongly on the text genre and language.

Yeah, it does seem to vary a lot depending on both language and on type of work, e.g. fiction, encyclopedic, poetic etc.
After wrestling a bit with the web and terminology.
Here are some quickly put together stats ( errors and omissions included ).

Code:
VoynichTT
Total words:  37759
Vocabulary :  8078
Hapax :      5571 
                68.8% of vocab is hapax  6.7 words per hapax  Totalwords/Vocab ratio 6.64:1
-----------------------------------------------------------------------------------------------
la divina commedia di dante alighieri
Total words:  97344               
Vocabulary :  19893           
Hapax :      13750                                       
                69.1% of vocab is hapax  7.0 words per hapax  Totalwords/Vocab ratio 4.89:1
--------------------------------------------------------------------------------------------------
Naturalis Historia books 1-4 pliny the elder (Thayer)
Total words:  35562                     
Vocabulary :  12596                         
Hapax :        8898                                 
                70.6% of vocab is hapax  3.9 words per hapax  Totalwords/Vocab ratio 2.82:1
--------------------------------------------------------------------------------------------------
The Adventures of Tom Sawyer  M.Twain
Total words:  71748                                                                           
Vocabulary :  7578                                               
Hapax :        3739                                                     
                  49.0% of vocab is hapax  19.1 words per hapax  Totalwords/Vocab ratio 9.46:1
--------------------------------------------------------------------------------------------------
Tale of 2 cities        C.Dickens                                                                           
Total words:  136561
Vocabulary :  10137
Hapax :        4590
                  45.2% of vocab is hapax  29.7 words per hapax    Totalwords/Vocab ratio 13.47:1

Here you can see Plinys' Natural History, an encyclopedic work has lots of hapax and almost every 3rd word is a new addition to the vocabulary.
Whereas Dickens 'Tale Of Two Cities' is at the other end of the scale ( perhaps explaining some of his appeal )
where is a new word is introduced only every 13.4 words.
'Tale Of Two Cities' is a popular book with stats comparable to 'The Adventures of Tom Sawyer'.
Dickens also has the lowest percentage of hapax but nearly half of this books vocabulary are still unique words ( hapax legomena ).
Genre could be an influence on these numbers as noted by Alin_J and bi3mw.

Interestingly we can see that Dante has the closest numbers to the VoynichMS. The divisions of The Divine Comedy perhaps affecting the
statistics of that text in a similar manner to the way the 6 sections of the VMS are possibly culpable for its ( The VMS's ) stats.
Further Investigation Required as noted by Emma May Smith.

Hapax legomena are the other side of the coin to the concept of 'word types that occur more than once' as noted by bi3mw.

Ref:https://en.wikipedia.org/wiki/Hapax_legomenon

And this looks quite interesting as well.
You are not allowed to view links. Register or Login to view.

Edit: 24/06/20 bi3mw pointed out the Dante stats are wrong..see thread page4.
Pages: 1 2 3 4