03-07-2022, 03:30 PM
Vocabulary size by Illustration Type (Using slightly modified ZL2a transcription, uncertain spaces as spaces)
Any and all Errors are mine, the folllowing description sounds more complicated than it is
.
In the EVA format there is a variable $I for Illustration type.
The Herbal Type was further split into 2 types, Herbal_a and Herbal_b following LisaFaginDavis allocation of folios by Scribe.
Herbal_a is defined as having EVA $I = H and its folio is ascribed to Scribe_1.
Herbal_b is defined as having EVA $I = H and its folio is ascribed to any Scribe except Scribe_1.
Here the words in the folios of the same Illustration type were collected giving a total word count for each of the 9 types.
Within each type, replicated words were removed, creating a set of words where each word is counted once, this is the vocabulary of that Illustration type, the 'type_vocab'.
Then for each word in the 'type_vocab', if that word apppeared in any the other 8 type_vocab's , the word was removed. creating an 'unshared_vocab'
The 'type_vocab' contains the words that appear once or more in folios that have the same Illustration type.
The 'unshared_vocab' contains words that appear once or more ONLY in folios that have the same Illustration type.
Any word that appears in more than one 'type_vocab' is removed completely.
For instance the word 'daiin' appears in several 'type_vocab's and because of that it does not appear in any of the 'unshared_vocab's.
Key: Herbal_a ( Ha ); Herbal_b ( Hb ); Stars ( S ); Balneo ( B ); Pharma ( P ); Astro ( A ); Zodiac ( Z ); Text ( T ); Cosmo ( C ).
Observations:
-HerbalA has the most unshared words, as expected because it is CurrierA.
-Pharma is also CurrierA so its position at R5 is unexpected.
-Stars at R2 is 10% higher than the next rank, an anomaly with no obvious explanation.
Speculations:
One possibility is the Stars section is discussing something that is outside the range of the rest of the text.
Any and all Errors are mine, the folllowing description sounds more complicated than it is

In the EVA format there is a variable $I for Illustration type.
The Herbal Type was further split into 2 types, Herbal_a and Herbal_b following LisaFaginDavis allocation of folios by Scribe.
Herbal_a is defined as having EVA $I = H and its folio is ascribed to Scribe_1.
Herbal_b is defined as having EVA $I = H and its folio is ascribed to any Scribe except Scribe_1.
Here the words in the folios of the same Illustration type were collected giving a total word count for each of the 9 types.
Within each type, replicated words were removed, creating a set of words where each word is counted once, this is the vocabulary of that Illustration type, the 'type_vocab'.
Then for each word in the 'type_vocab', if that word apppeared in any the other 8 type_vocab's , the word was removed. creating an 'unshared_vocab'
The 'type_vocab' contains the words that appear once or more in folios that have the same Illustration type.
The 'unshared_vocab' contains words that appear once or more ONLY in folios that have the same Illustration type.
Any word that appears in more than one 'type_vocab' is removed completely.
For instance the word 'daiin' appears in several 'type_vocab's and because of that it does not appear in any of the 'unshared_vocab's.
Key: Herbal_a ( Ha ); Herbal_b ( Hb ); Stars ( S ); Balneo ( B ); Pharma ( P ); Astro ( A ); Zodiac ( Z ); Text ( T ); Cosmo ( C ).
Code:
Type, total_words, type_vocab, unshared_vocab, unshared_vocab as % of type_vocab, Rank
Ha, 8054, 2516, 1460, % 58.028 R1
Hb, 3522, 1353, 474, % 35.033 R8
S, 10851, 3072, 1662, % 54.101 R2
B, 6376, 1471, 618, % 42.012 R4
P, 2555, 1132, 472, % 41.696 R5
A, 876, 611, 238, % 38.952 R7
Z, 1291, 767, 343, % 44.719 R3
T, 3108, 1279, 448, % 35.027 R9
C, 2213, 1101, 436, % 39.600 R6
Observations:
-HerbalA has the most unshared words, as expected because it is CurrierA.
-Pharma is also CurrierA so its position at R5 is unexpected.
-Stars at R2 is 10% higher than the next rank, an anomaly with no obvious explanation.
Speculations:
One possibility is the Stars section is discussing something that is outside the range of the rest of the text.