The Voynich Ninja

Full Version: [split] Percentage of word types that occur more than once
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Pages: 1 2 3 4
During a study of the VMS, I found that only a relatively small percentage of Word Types occur more than once. Can anyone confirm this ?

31.034909631% of all Word Types in the VMS occur more than once.
44.928611163% of all Word Types in the comparison text ( Regimen Sanitatis ) occur more than once.


You are not allowed to view links. Register or Login to view.

You are not allowed to view links. Register or Login to view.
From voynichese.com which uses T.Takahashi transliteration, though not sure which version.
You are not allowed to view links. Register or Login to view.

Vocabulary: 8078
Hapax legomenas: 5571
MoreThanOnce:    ( 8078 - 5571 ) = 2507
( 100 / 8078 ) *2507 = 31.034909631096806%  words that occur more than once.


From regimem_sanitatis_corpus (not sure where i got this from) my parser.
Vocabulary: 5399
Hapax legomenas: 2978
More Than Once:  2421
( 100 / 5399 ) *2421 = 44.84163734024819% words that occur more than once.

Close Wink
I imagine that ive somehow lost a few words in processing the regimem text file.  Sad
Havent yet installed OpenOffice so cant check your files.
Addendum: I used Torsten Timm`s generator to create text.

46,51611658%  of all Word Types in the generated text occur more than once.


You are not allowed to view links. Register or Login to view.
The size of the vocabulary and the number of one token words might be misaligned with the apparent length of the text. The language of each scribe/section varies so that we're effectively looking at multiple texts with only a fraction of the words from the manuscript as a whole. Given that one token word types will be more common in shorter texts, we need to compare each section with a text of equal length.
(21-06-2020, 09:55 PM)RobGea Wrote: You are not allowed to view links. Register or Login to view.From voynichese.com
You read my mind, speaking of voynichese. I use it regularly, but the only thing I can do is display a word in color and see its frequency. Can we search for two neighboring words at the same time?
There are several reasons why the results for the Voynich text may be skewed.
We don't know, of course, what is the level of impact of each of these possible reasons, through one could try to simulate this if one wanted.

1. there are errors in the transliteration. These are likely to create (some) additional unique words
2. specifically, word spaces are likely to be misinterpreted. this has an impact on the word count.

For these two points, I can recommend to look at Table 3 on You are not allowed to view links. Register or Login to view. , that shows a great spread in the number of unique words (word types).

3. The comparison texts are from printed sources, and do not contain errors. Any errors in the handwritten Voynich text would increase the number unique words (even when transliterated correctly).
(22-06-2020, 04:07 PM)ReneZ Wrote: You are not allowed to view links. Register or Login to view.I can recommend to look at Table 3 on You are not allowed to view links. Register or Login to view.
The number of unique words being between 8400 and 9800, my reading speed - 1-2 words per week, it will take me between 90 and 190 years. I'd better go on a diet if I want to get over it.  Cry
(22-06-2020, 04:07 PM)ReneZ Wrote: You are not allowed to view links. Register or Login to view.For these two points, I can recommend to look at Table 3 on You are not allowed to view links. Register or Login to view. , that shows a great spread in the number of unique words (word types).
From the linked page:
Quote:A representative number of word types may be 9,000 - 10,000.
This strikes me on the high side, at least for certain languages and genres.

It's a kind of obvious test, but as anyone compared the unique word count of the VM to that of other works of various languages? This page here You are not allowed to view links. Register or Login to view. puts the number of words per unique words of 7 different English-language novels as between 9 and 16.5. If I understand the VM stats right, it comes in between 3.6 and 4.3, depending on the transcription. It seems that number of unique words in the denominator is about three or four times too high, but I'm curious about non-English works.
Maybe I have an error in thinking somewhere, but I get a value of 6.8 for the VMS:

Unique words VMS: 5571

VMS words total: 37886

Words per unique word 37886 / 5571 = 6.8006


P.S: I probably only come up with 8078 Word Types because all uppercase letters have been converted to lowercase.
(23-06-2020, 12:24 PM)bi3mw Wrote: You are not allowed to view links. Register or Login to view.Unique words VMS: 5571
5571 is more reasonable, but that's not the number on Rene's page. Where is the 5571 coming from?
Pages: 1 2 3 4