The Voynich Ninja

Full Version: Vocabulary size by Illustration Type
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Pages: 1 2
Vocabulary size by Illustration Type    (Using slightly modified ZL2a transcription, uncertain spaces as spaces)
Any and all Errors are mine, the folllowing description sounds more complicated than it is Smile .

In the EVA format there is a variable $I for Illustration type.
The Herbal Type was further split into 2 types, Herbal_a and Herbal_b  following LisaFaginDavis allocation of folios by Scribe.
Herbal_a is defined as having EVA $I = H and its folio is ascribed to Scribe_1.
Herbal_b is defined as having EVA $I = H and its folio is ascribed to any Scribe except Scribe_1.

Here the words in the folios of the same Illustration type were collected giving a total word count for each of the 9 types.

Within each type, replicated words were removed, creating a set of words where each word is counted once, this is the vocabulary of that Illustration type, the  'type_vocab'.

Then for each word in the  'type_vocab',  if that word apppeared in any the other 8 type_vocab's , the word was removed. creating an  'unshared_vocab'

The  'type_vocab'  contains the words that appear once or more in folios that have the same Illustration type.
The  'unshared_vocab'  contains words that appear once or more ONLY in folios that have the same Illustration type.

Any word that appears in more than one  'type_vocab'  is removed completely.
For instance the word  'daiin'  appears in several  'type_vocab's  and because of that it does not appear in any of the  'unshared_vocab's.

Key: Herbal_a ( Ha );  Herbal_b ( Hb );  Stars ( S );  Balneo ( B );  Pharma ( P );  Astro ( A );  Zodiac ( Z );  Text ( T  );  Cosmo ( C ).
Code:
Type,  total_words,  type_vocab,  unshared_vocab,    unshared_vocab as % of type_vocab,      Rank
Ha,     8054,          2516,           1460,            % 58.028                              R1
Hb,     3522,          1353,            474,            % 35.033                              R8
S,     10851,          3072,           1662,            % 54.101                              R2
B,      6376,          1471,            618,            % 42.012                              R4
P,      2555,          1132,            472,            % 41.696                              R5
A,       876,           611,            238,            % 38.952                              R7
Z,      1291,           767,            343,            % 44.719                              R3
T,      3108,          1279,            448,            % 35.027                              R9
C,      2213,          1101,            436,            % 39.600                              R6

Observations:
-HerbalA has the most unshared words, as expected because it is CurrierA.
-Pharma is also CurrierA so its position at R5 is unexpected.
-Stars at R2 is 10% higher than the next rank, an anomaly with no obvious explanation.
Speculations:
One possibility is the Stars section is discussing something that is outside the range of the rest of the text.
(03-07-2022, 03:30 PM)RobGea Wrote: You are not allowed to view links. Register or Login to view.-Stars at R2
R3 ?
(03-07-2022, 03:30 PM)RobGea Wrote: You are not allowed to view links. Register or Login to view.The Herbal Type was further split into 2 types
Shouldn't the text pages also be split between A and B?
(03-07-2022, 03:30 PM)RobGea Wrote: You are not allowed to view links. Register or Login to view.Observations:
-HerbalA has the most unshared words, as expected because it is CurrierA.
-Pharma is also CurrierA so its position at R5 is unexpected.
-Stars at R2 is 10% higher than the next rank, an anomaly with no obvious explanation.

You should be aware that you compare dictionary sizes. If you take the text size into account the numbers tell a slightly different story:
Code:
Type,  total_words,  type_vocab,  unshared_vocab,  type_vocab as % of total_words, unshared_vocab as % of total_words
Herbal (A),     8054,       2516,           1460,            % 31.2                      % 18.2
Pharma (A),     2555,       1132,            472,            % 44.3                      % 18.5
Astro,           876,        611,            238,            % 69.7                      % 27.2
Zodiac,         1291,        767,            343,            % 59.4                      % 26.6
Cosmo,          2213,       1101,            436,            % 49.8                      % 19.7
Text,           3108,       1279,            448,            % 41.2                      % 14.4
Herbal (B),     3522,       1353,            474,            % 38.4                      % 13.5
Stars (B),     10851,       3072,           1662,            % 28.3                      % 15.3
Bio (B),        6376,       1471,            618,            % 23.1                      %  9.7
Thanks Torsten,
doing it that way does indeed show a different story.
Thats pretty interesting, i need to think this over.
Between the two views:
- ignoring sample text length
- dividing by sample text length
both views are imperfect. It is not clear to me which one of the two is the more indicative one.

Certainly, dictionary size increases very non-linearly with sample text size.

It is only a problem when text lenghts are significantly different. That is the case here of course.
Hmmmm.....
Never ask a brother if he is a professor of mathematics and computer science.

Educate yourself and ask again later.


You are not allowed to view links. Register or Login to view.
Can you list the unique vords of Herbal A and determine a frequency of use for those that were used multiple times within that section? In other words, is there a specific set of vords that uniquely define the "topics" of Herbal A.

Can the herbals be combined and then combined with pharma of look for unique common terms not found in other parts of the VMs? If the whole botany and Pharma bit were all about leaves, then that would be a shared term probably not used in cosmic and zodiac parts.

If terms unique to Herbal A are used multiple times, is each use similar or unique? And also in the combinations?
(04-07-2022, 10:26 PM)R. Sale Wrote: You are not allowed to view links. Register or Login to view.Can you list the unique vords of Herbal A and determine a frequency of use for those that were used multiple times within that section? In other words, is there a specific set of vords that uniquely define the "topics" of Herbal A.

Usually such vords are rarely used.

Herbal A (1426 unshared word types Note: The text samples were taken from Takahashi's transliteration)
Code:
1262 vords only occur once
 107 x two times
  28 x three times
  12 x four times
   8 x five times
   1 x six times
   1 x eight times
   2 x nine times ('dsho', 'cthom')
   3 x 13 times ('qotchol', 'cthaiin', 'choiin')
   1 x 14 times ('qotchor')
   1 x 15 time ('ctho')

Quire 13 Bio (634 unshared word type)
Code:
581 vords only occur once
 38 x two times
  6 x three times
  4 x four times
  1 x five times
  1 x six times
  2 x seven times ('qoly', 'rshedy')
  1 x ten times ('qolchedy')

Quire 20 Stars (1663 unshared word types)
Code:
1496 vords only occur once
 120 x two times
  27 x three times
  12 x four times
   4 x five times
   2 x 6 times ('chedam', 'oteal')
   2 x 8 times ('lkeeey', 'oteeey')
(03-07-2022, 03:30 PM)RobGea Wrote: You are not allowed to view links. Register or Login to view.Vocabulary size by Illustration Type
As I don't usually do statistical calculations, I find it hard to follow: what precise point should this calculation of unshared words clarify? I couldn't find an explanation before the presentation of the results.
(04-07-2022, 10:26 PM)R. Sale Wrote: You are not allowed to view links. Register or Login to view.Can the herbals be combined and then combined with pharma of look for unique common terms not found in other parts of the VMs? If the whole botany and Pharma bit were all about leaves, then that would be a shared term probably not used in cosmic and zodiac parts.

If you combine Herbal A with Herbal B or Herbal A with Pharma no word stands out. If illustrations do indicate topics, some common terms specific to a particular type of illustration or topic should exist. However such terms doesn't exist.

Pharma A (459 unshared word types)
Code:
438 x occurs only once
 19 x two times
  2 x three times ('olchor','shockhey')

Herbal (A) + Pharma (A) (1940 unshared word types)
Code:
1700 x occurs only once
 154 x two times
  47 x three times
  16 x four times
  11 x five times
   2 x six times (unique for Herbal A + Pharma: 'ctheody')
   1 x seven times (unique for Herbal A + Pharma: 'dom')
   1 x eight times
   2 x nine times
   3 x 13 times
   1 x 14 times
   1 x 15 times

Herbal B (419 unshared word types)
Code:
408 x occurs only once
 10 x two times
  1 x four times ('chekedy')

Herbal A + Herbal B (1887 unshared word types)
Code:
1670 x occurs only once
 146 x two times
  35 x three times
  17 x four times
   9 x five times
   1 x six times
   2 x eight times (unique for Herbal A + B: 'tchody')
   2 x nine times
   3 x 13 times
   1 x 14 times
   1 x 15 times
Pages: 1 2