The Voynich Ninja

Full Version: TF-IDF values of categories
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Most Common Words by Category (Top 20):
Text (T): [('shedy', 36), ('chedy', 34), ('or', 31), ('daiin', 27), ('dar', 26), ('aiin', 26), ('qokar', 24), ('shey', 23), ('ar', 22), ('chey', 22), ('ol', 21), ('qokaiin', 20), ('dain', 17), ('chol', 15), ('chdy', 14), ('okar', 14), ('qokal', 13), ('qokeey', 13), ('otar', 13), ('otal', 13)]
Herbal (H): [('daiin', 432), ('chol', 196), ('chor', 141), ('s', 119), ('dar', 106), ('shol', 105), ('or', 102), ('dy', 94), ('chy', 89), ('cthy', 88), ('dain', 82), ('sho', 78), ('aiin', 71), ('chey', 67), ('ol', 65), ('shor', 61), ('okaiin', 57), ('chedy', 56), ('chdy', 55), ('shy', 54)]
Cosmo ©: [('ol', 31), ('ar', 26), ('daiin', 25), ('or', 25), ('dar', 22), ('aiin', 19), ('o', 18), ('y', 16), ('dal', 16), ('r', 15), ('oteey', 15), ('otedy', 14), ('al', 13), ('l', 12), ('chedy', 12), ('qokal', 12), ('okar', 11), ('otody', 11), ('shedy', 11), ('d', 10)]
Stars/Recipes (S): [('chedy', 179), ('qokeey', 162), ('qokeedy', 132), ('aiin', 129), ('ar', 125), ('qokaiin', 123), ('daiin', 119), ('shedy', 108), ('chey', 107), ('qokain', 102), ('okaiin', 96), ('okeey', 93), ('al', 89), ('otaiin', 79), ('ol', 78), ('okain', 70), ('shey', 67), ('cheey', 65), ('qokedy', 60), ('oteey', 57)]
Astro (A): [('s', 13), ('aiin', 12), ('ar', 11), ('daiin', 11), ('dair', 9), ('dar', 9), ('dy', 8), ('oteey', 7), ('shey', 6), ('okeey', 6), ('okol', 6), ('dal', 5), ('chol', 5), ('okey', 5), ('shes', 5), ('sar', 4), ('air', 4), ('chy', 4), ('cho', 4), ('or', 4)]
Zodiac (Z): [('oteey', 20), ('al', 17), ('aiin', 15), ('oteody', 15), ('otaiin', 13), ('ar', 12), ('otey', 11), ('oteos', 11), ('otar', 10), ('daiin', 8), ('okeey', 8), ('oty', 7), ('otal', 7), ('dar', 6), ('otaly', 6), ('am', 6), ('oteedy', 6), ('okal', 6), ('okaly', 6), ('o', 5)]
Bio/Balneo (B): [('shedy', 186), ('chedy', 159), ('qokain', 157), ('qokedy', 154), ('ol', 152), ('qokeedy', 146), ('qokal', 98), ('qokaiin', 81), ('qol', 78), ('qokeey', 77), ('daiin', 75), ('shey', 72), ('chey', 64), ('qoky', 53), ('lchedy', 50), ('dal', 50), ('dar', 49), ('otedy', 48), ('qotedy', 46), ('qokar', 43)]
Pharma (P): [('daiin', 87), ('chol', 37), ('cheol', 33), ('okeol', 30), ('ol', 28), ('or', 27), ('qokeol', 24), ('chor', 23), ('aiin', 22), ('qokol', 22), ('okeey', 22), ('dal', 21), ('qokeey', 21), ('s', 21), ('okol', 20), ('cheor', 20), ('dar', 19), ('chey', 19), ('sheol', 18), ('shey', 18)]

TF-IDF Top Words by Category (Top 20):
Text (T): ['shedy', 'ol', 'or', 'chedy', 'aiin', 'ar', 'dar', 'daiin', 'chey', 'qokar', 'shey', 'qokaiin', 'chol', 'dal', 'dain', 'chdy', 'al', 'ykaiin', 'qokal', 'okar']
Herbal (H): ['daiin', 'chol', 'chor', 'dy', 'or', 'shol', 'dar', 'aiin', 'chy', 'cthy', 'dain', 'ol', 'sho', 'ar', 'shor', 'chey', 'qotchy', 'cho', 'chaiin', 'cthor']
Cosmo ©: ['ar', 'ol', 'or', 'aiin', 'dar', 'daiin', 'dy', 'al', 'ch', 'dal', 'otedy', 'qokal', 'otody', 'oteey', 'chol', 'okal', 'okody', 'otodar', '171', '170']
Stars/Recipes (S): ['aiin', 'chedy', 'qokeedy', 'ar', 'qokeey', 'qokaiin', 'al', 'shedy', 'daiin', 'qokain', 'chey', 'ol', 'okaiin', 'okeey', 'otaiin', 'shey', 'lchedy', 'okain', 'cheey', 'or']
Astro (A): ['ar', 'aiin', 'daiin', 'dar', 'dy', 'dair', 'okeo', 'oteey', 'shey', 'chol', 'dal', 'okeey', 'or', 'okol', 'cho', 'shes', 'chocfhy', 'chy', 'okey', 'chos']
Zodiac (Z): ['aiin', 'ar', 'al', 'oteey', 'oteody', 'oteos', 'otar', 'otaiin', 'otey', 'daiin', 'ch', 'otaly', 'dy', 'okaly', 'am', 'otal', 'oteeos', 'oteotey', 'okeey', 'air']
Bio/Balneo (B): ['shedy', 'ol', 'qokeedy', 'qokedy', 'qokain', 'chedy', 'qol', 'qokal', 'qokaiin', 'shey', 'daiin', 'lchedy', 'qokeey', 'qoky', 'chey', 'or', 'qotedy', 'dal', 'otedy', 'dar']
Pharma (P): ['daiin', 'ol', 'chol', 'cheol', 'or', 'aiin', 'okeol', 'chor', 'qokeol', 'qokol', 'dal', 'dar', 'chey', 'okeey', 'cheody', 'okol', 'qokeey', 'cheor', 'dol', 'qokeody']


Please give your comments Smile, I've being using ai a lot so I wouldn't necessarily trust this result too much.
Hi Addsamuels,
why  are the two lists different?
Text (T): [('shedy', 36), ('chedy', 34), ('or', 31),
Text (T): ['shedy', 'ol', 'or',

The Herbal pages are often split into HA and HB on the basis of Currier languages (or scribe 1 vs other scribes).

For some validation, you can compare your results with the table here:
You are not allowed to view links. Register or Login to view. (it's in the "hidden text"). Unluckily, I don't remember the details (transliteration used, uncertain spaces,...)

EDIT what are words '171', '170' in Cosmo?
(17-02-2025, 09:11 AM)MarcoP Wrote: You are not allowed to view links. Register or Login to view.Hi Addsamuels,
why  are the two lists different?
Text (T): [('shedy', 36), ('chedy', 34), ('or', 31),
Text (T): ['shedy', 'ol', 'or',
This is because of the first is simply the frequency of the word with n-count. The second is the TF-IDF frequency ((which is the Term-frequency, Inverse-Document-Frequency score). I calculated it through. This is a "a measure of importance of a word to a document in a collection or corpus, adjusted for the fact that some words appear more frequently in general."
(17-02-2025, 09:11 AM)MarcoP Wrote: You are not allowed to view links. Register or Login to view.
The Herbal pages are often split into HA and HB on the basis of Currier languages (or scribe 1 vs other scribes).
Indeed see: 
Number of Texts Per (X, Y) Category:
T A: 31 texts
H A: 1255 texts
H B: 378 texts
C C: 172 texts
S A: 80 texts
H C: 7 texts
T B: 257 texts
A C: 240 texts
Z C: 335 texts
B B: 861 texts
C B: 229 texts
P A: 459 texts
S B: 1084 texts
T C: 1 texts

Most Common Words by (X, Y) Category (Top 20):
T A: [('chol', 7), ('dain', 6), ('daiin', 6), ('shol', 3), ('or', 3), ('shody', 3), ('sho', 3), ('d', 3), ('cthy', 3), ('shey', 3), ('kor', 2), ('cthar', 2), ('okan', 2), ('chor', 2), ('ckhey', 2), ('r', 2), ('chey', 2), ('dar', 2), ('kaiin', 2), ('fachys', 1)]
H A: [('daiin', 361), ('chol', 188), ('chor', 135), ('s', 100), ('shol', 94), ('cthy', 85), ('chy', 81), ('sho', 77), ('dain', 72), ('dy', 67), ('dar', 60), ('shor', 55), ('chey', 50), ('or', 47), ('cthol', 45), ('shy', 44), ('qotchy', 43), ('ol', 39), ('cthor', 37), ('qokchy', 37)]
H B: [('daiin', 70), ('chedy', 55), ('or', 55), ('aiin', 49), ('chdy', 49), ('dar', 46), ('qokedy', 40), ('ar', 39), ('chckhy', 35), ('shedy', 33), ('okaiin', 30), ('okar', 30), ('dy', 26), ('ol', 26), ('qokar', 26), ('okedy', 25), ('otedy', 23), ('okal', 22), ('cheky', 20), ('s', 19)]
C C: [('o', 15), ('y', 14), ('r', 14), ('dal', 14), ('dar', 13), ('l', 12), ('daiin', 12), ('ar', 11), ('s', 10), ('dy', 10), ('al', 10), ('d', 9), ('okeey', 9), ('ol', 8), ('k', 7), ('x', 7), ('chey', 7), ('chol', 7), ('v', 6), ('okar', 6)]
S A: [('ar', 15), ('okal', 12), ('dal', 10), ('daiin', 9), ('qokal', 9), ('otal', 7), ('aiin', 7), ('qokar', 7), ('chal', 6), ('qokaly', 6), ('qokeey', 6), ('al', 5), ('qokaiin', 5), ('chol', 5), ('okaly', 5), ('s', 4), ('otar', 4), ('sheey', 4), ('cheal', 4), ('qoky', 4)]
H C: [('otaim,dam', 1), ('alam', 1), ('cphy', 1), ('fchecfhy', 1), ('dy', 1), ('dchepain', 1), ('shety', 1), ('qopy', 1), ('fol', 1), ('chpdy', 1), ('daiin', 1), ('sheek', 1), ('l,ody', 1), ('yteo', 1), ('qop[s:r]', 1), ('air', 1), ('cheot[ee:ch]y', 1), ('dal[o:a]m', 1), ('ytal', 1), ('cheot', 1)]
T B: [('shedy', 36), ('chedy', 34), ('or', 28), ('aiin', 26), ('dar', 24), ('qokar', 24), ('daiin', 21), ('ol', 21), ('ar', 21), ('chey', 20), ('shey', 20), ('qokaiin', 20), ('okar', 14), ('qokal', 13), ('chdy', 13), ('qokeey', 13), ('otar', 13), ('otal', 13), ('okedy', 11), ('dal', 11)]
A C: [('s', 13), ('aiin', 12), ('ar', 11), ('daiin', 11), ('dair', 9), ('dar', 9), ('dy', 8), ('oteey', 7), ('shey', 6), ('okeey', 6), ('okol', 6), ('dal', 5), ('chol', 5), ('okey', 5), ('shes', 5), ('sar', 4), ('air', 4), ('chy', 4), ('cho', 4), ('or', 4)]
Z C: [('oteey', 20), ('al', 17), ('aiin', 15), ('oteody', 15), ('otaiin', 13), ('ar', 12), ('otey', 11), ('oteos', 11), ('otar', 10), ('daiin', 8), ('okeey', 8), ('oty', 7), ('otal', 7), ('dar', 6), ('otaly', 6), ('am', 6), ('oteedy', 6), ('okal', 6), ('okaly', 6), ('o', 5)]
B B: [('shedy', 186), ('chedy', 159), ('qokain', 157), ('qokedy', 154), ('ol', 152), ('qokeedy', 146), ('qokal', 98), ('qokaiin', 81), ('qol', 78), ('qokeey', 77), ('daiin', 75), ('shey', 72), ('chey', 64), ('qoky', 53), ('lchedy', 50), ('dal', 50), ('dar', 49), ('otedy', 48), ('qotedy', 46), ('qokar', 43)]
C B: [('ol', 23), ('or', 23), ('ar', 15), ('aiin', 15), ('otedy', 13), ('daiin', 13), ('shedy', 10), ('otody', 9), ('oteedy', 9), ('dar', 9), ('oteey', 9), ('chedy', 9), ('otar', 7), ('otodar', 6), ('shedaiin', 6), ('qokal', 6), ('chdy', 6), ('okedy', 6), ('odaiin', 6), ('yteey', 6)]
P A: [('daiin', 87), ('chol', 37), ('cheol', 33), ('okeol', 30), ('ol', 28), ('or', 27), ('qokeol', 24), ('chor', 23), ('aiin', 22), ('qokol', 22), ('okeey', 22), ('dal', 21), ('qokeey', 21), ('s', 21), ('okol', 20), ('cheor', 20), ('dar', 19), ('chey', 19), ('sheol', 18), ('shey', 18)]
S B: [('chedy', 179), ('qokeey', 156), ('qokeedy', 132), ('aiin', 122), ('qokaiin', 118), ('ar', 110), ('daiin', 110), ('shedy', 107), ('chey', 105), ('qokain', 100), ('okaiin', 95), ('okeey', 92), ('al', 84), ('otaiin', 77), ('ol', 77), ('okain', 67), ('shey', 66), ('cheey', 64), ('qokedy', 60), ('oteey', 57)]
T C: [('oror', 1), ('sheey', 1)]

TF-IDF Top Words by (X, Y) Category (Top 20):
T A: ['chol', 'dain', 'daiin', 'cthar', 'shody', 'or', 'cthy', 'eo', 'okan', 'shol', 'shey', 'ckhey', 'chy', 'sho', 'dar', 'chey', 'kor', 'dlo', 'ase', 'okchoy']
H A: ['daiin', 'chol', 'chor', 'dy', 'cthy', 'shol', 'chy', 'dain', 'sho', 'shor', 'cthor', 'cho', 'qotchy', 'cthol', 'shy', 'dar', 'ol', 'or', 'chaiin', 'chey']
H B: ['aiin', 'chedy', 'or', 'daiin', 'ar', 'qokedy', 'chdy', 'dy', 'dar', 'chckhy', 'shedy', 'ol', 'okar', 'okaiin', 'okedy', 'otedy', 'qokar', 'kedy', 'qokaiin', 'kar']
C C: ['ar', 'dy', 'dal', 'al', 'dar', '171', '170', 'okeey', 'ol', 'daiin', '172', '169', 'chol', 'chey', 'dair', 'aiir', 'qokal', 'air', 'aiin', 'oteody']
S A: ['ar', 'okal', 'aiin', 'qokal', 'dal', 'qokaly', 'qokair', 'otal', 'al', 'daiin', 'okaly', 'chaly', 'alam', 'dalar', 'qokar', 'chol', 'chal', 'tal', 'or', 'qokeey']
H C: ['cheot', 'cphy', 'fchecfhy', 'qokd', 'dchepain', 'chpdy', 'yteo', 'toeedy', 'sheek', 'otaim', 'chokeody', 'alam', 'shckhdy', 'fol', 'yfchey', 'dy', 'shety', 'qopy', 'qop', 'ykchedy']
T B: ['shedy', 'ol', 'chedy', 'aiin', 'or', 'ar', 'dar', 'qokar', 'qokaiin', 'shey', 'chey', 'daiin', 'al', 'okar', 'qokedy', 'dal', 'qokeedy', 'chdy', 'qokeey', 'otal']
A C: ['aiin', 'ar', 'daiin', 'dy', 'dar', 'dair', 'okeo', 'oteey', 'shes', 'shey', 'okeey', 'okol', 'cho', 'chol', 'or', 'dal', 'okey', 'chocfhy', 'chy', 'daiir']
Z C: ['aiin', 'al', 'ar', 'oteey', 'oteody', 'oteos', 'otar', 'otaiin', 'otey', 'otaly', 'daiin', 'ch', 'okaly', 'dy', 'oteo', 'am', 'oteeos', 'otal', 'oteotey', 'air']
B B: ['shedy', 'qokeedy', 'qokedy', 'ol', 'chedy', 'qokain', 'qol', 'qokal', 'qokaiin', 'shey', 'qokeey', 'lchedy', 'qoky', 'qotedy', 'daiin', 'chey', 'otedy', 'qoteedy', 'or', 'dal']
C B: ['or', 'ol', 'aiin', 'ar', 'otedy', 'ch', 'daiin', 'otodar', 'dar', 'otody', 'shedy', 'oteedy', 'dy', 'chedy', 'oteey', 'al', 'am', 'shedaiin', 'odaiin', 'otchedy']
P A: ['daiin', 'ol', 'chol', 'cheol', 'okeol', 'aiin', 'or', 'qokeol', 'chor', 'qokol', 'okeey', 'cheor', 'qokeey', 'dar', 'chey', 'dol', 'sheol', 'okol', 'dal', 'shey']
S B: ['chedy', 'qokeedy', 'aiin', 'qokeey', 'qokaiin', 'ar', 'al', 'shedy', 'qokain', 'chey', 'ol', 'daiin', 'okaiin', 'okeey', 'qokedy', 'otaiin', 'shey', 'lchedy', 'otedy', 'okain']
T C: ['oror', 'sheey', 'zepchy', 'keeeyd', 'keeo', 'keeoal', 'keeochy', 'keeod', 'keeoda', 'keeodaiin', 'keeodal', 'keeodar', 'keeodchey', 'keeodol', 'keeody', 'keeokechy', 'keeol', 'keeols', 'keeolshey', 'keeor']
(17-02-2025, 09:11 AM)MarcoP Wrote: You are not allowed to view links. Register or Login to view.
For some validation, you can compare your results with the table here:
You are not allowed to view links. Register or Login to view. (it's in the "hidden text"). Unluckily, I don't remember the details (transliteration used, uncertain spaces,...)
I can't really find the table
(17-02-2025, 09:11 AM)MarcoP Wrote: You are not allowed to view links. Register or Login to view.
EDIT what are words '171', '170' in Cosmo?
I'm afraid I don't know how these numbers are arising. They are possibly coming from extended-voynich-alphabet characters.
Thanks, and hopefully the added table gives more insight. I also think my You are not allowed to view links. Register or Login to view. table may also be of interest.
Unfortunately many people have made similar lists of word frequencies and have all been left baffled by the results.

Here are some more facts to baffle you. In quire 13 20% of the text consists of just nine words. Five of these are 4o- words. Three are -c89 words. In quire 20 20% of the text is made up of sixteen words. Even if you examine separately the pages that form the two language clusters of that quire 20 you get much the same list.

Yet the pages in both quires are classed as being in ‘B’ language. If there really was some sort of textual link ( same regional language, same cypher, same code ) between the two quires it would be expected that there would be more commonality in the top words.


Cumulative word frequencies - quire 13
You are not allowed to view links. Register or Login to view.
 
Cumulative word frequencies - quire 20
You are not allowed to view links. Register or Login to view.
(18-02-2025, 10:42 AM)dashstofsk Wrote: You are not allowed to view links. Register or Login to view.Here are some more facts to baffle you. In quire 13 20% of the text consists of just nine words. Five of these are 4o- words. Three are -c89 words. In quire 20 20% of the text is made up of sixteen words. Even if you examine separately the pages that form the two language clusters of that quire 20 you get much the same list.

While the many similar words are a peculiar feature of the Voynich ms, I don’t think that the 9 most frequent words making up 20% of the text is unusual.
According to You are not allowed to view links. Register or Login to view., the two most common English words ‘the’ and ‘of’ make up 10.5% of the corpus examined. Unless I miscounted something, in King James bible, the top 4 words cover 20.7% of the text.

the 8.09
and 6.53
of 4.38
to 1.72

I also agree that one would expect the most frequent words to be reasonably consistent across different texts. This puzzling feature makes it impossible toYou are not allowed to view links. Register or Login to view. as the frequent words that occur in all sections, which would be a very convenient attack avenue, under the assumption that Voynichese words correspond to words in a natural language.
(18-02-2025, 09:19 PM)MarcoP Wrote: You are not allowed to view links. Register or Login to view.the two most common English words ‘the’ and ‘of’ make up 10.5%

I have some different frequencies for the English language, from the Sonnets of one William Shakespeare.

In the English language, yes, the top words are frequent. This is because that language is analytical, not inflected, and has to rely on prepositions to qualify meaning. Words do not change their endings to reflect changes in case. The Sonnets list has mostly prepositions, and few nouns. Generally in languages top words are top because they have special necessity and meaning.

Now have a look at the top words for quire 2. Slightly different to those for quires 13 and 20. 20% of the quire 2 text is of eight words. So is the text in VMS in an analytical structure? Three natural languages? Three cyphers or codes? This is the baffling part.


Word frequencies - Sonnets
You are not allowed to view links. Register or Login to view.

Cumulative word frequencies - quire 2
You are not allowed to view links. Register or Login to view.
I agree! Statistics for poetry can indeed be very different from those of ordinary language.
The motive must be: we must know and we will know
Whatever language the scribes themselves spoke was most likely to be a gendered language, which can add a lot of similarities between nouns. If they cut out functional words for efficiency (but kept gender for some reason), I could see nouns like “herb”, “water”, “star” etc dominating and many sharing similarities through gender tags.
(04-03-2025, 12:12 AM)zachary.kaelan Wrote: You are not allowed to view links. Register or Login to view.Whatever language the scribes themselves spoke was most likely to be a gendered language, which can add a lot of similarities between nouns. If they cut out functional words for efficiency (but kept gender for some reason), I could see nouns like “herb”, “water”, “star” etc dominating and many sharing similarities through gender tags.
Latin, French, German, Italian + slavic languages all have Gender for nouns. These are the likeliest candidates. I do think words like chor means root or plant. It's used in two labels by plants and it's very common in Herbal A (and B)