The Voynich Ninja

quimqu

Can you you explain the communities a bit more. It sounds a bit drastic for 38k tokens compared to the other Corpus's I'm guesing. If the communities are the core logic of the language is that abnormal for a text of that size?

(20-03-2026, 08:43 PM)oeesordy Wrote: You are not allowed to view links. Register or Login to view. quimqu

Can you you explain the communities a bit more. It sounds a bit drastic for 38k tokens compared to the other Corpus's I'm guesing. If the communities are the core logic of the language is that abnormal for a text of that size?

Hi,

I first build a graph where words are connected if their distance is ≤2 (less or equal at two changes in terms of characters). That gives a large connected component. Then I run a community detection algorithm on that graph.

So a community is just a dense cluster inside the Levenshtein ≤2 network. Words inside a community are very similar to each other, but not necessarily all within distance 1. Some can be at distance 2, or even further apart if they are connected through intermediate words. Note thatt distance is not Levenshtein; distance are the steps to cgo from a token to another).

I wouldn’t say communities are the “core logic” of the language. They are just clusters of words that are very similar in form. In natural languages, those clusters usually come from things like inflection, derivation, or spelling variation.

What is interesting in the Voynich is not that communities exist, but how large and dominant a few of them are. A big part of the vocabulary falls into just a few very dense clusters. So communities are more like a reflection of the system’s structure, not the underlying logic itself.

(19-03-2026, 03:40 PM)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view.[
(18-03-2026, 06:02 PM)ReneZ Wrote: You are not allowed to view links. Register or Login to view.Anyway, this question is: given that daiin is the most frequent word in the overall text, is the reason that it appears most because:
- this is just the most frequent word so it appears most frequenty

I believe that it is indeed the most common word by itself, in all Herbal pages (A and B) as well as in the Starred Parags secton. In fact is one of few words that has about the same frequency of occurrence in both Herbal-A and Herbal-B:

How are you counting it in Stars? I have /chedy/ occurring with more instances than /daiin/ there.

(20-03-2026, 10:54 PM)tavie Wrote: You are not allowed to view links. Register or Login to view.How are you counting [daiin] in Stars? I have /chedy/ occurring with more instances than /daiin/ there.

You are right, sorry. There, daiin is 7th, with only ~63% of the occurrences of chedy.

(But at least I am consistent -- I made that same mistake before... Big Grin

)

All the best, --stolfi

(19-03-2026, 03:40 PM)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view.I believe that it is indeed the most common word by itself, in all Herbal pages (A and B) as well as in the Starred Parags secton. In fact is one of few words that has about the same frequency of occurrence in both Herbal-A and Herbal-B:

This image is interesting in regards to "aiin", for the most part we see the most frequent words being shorter except for words containing "aiin".
Obviously language isn't just statistics and I'm sure languages across the world vary and depends on topic (and a million other factors), however I can't think of any English 1 or 2 letter words that are less common than a 5 letter word of any sort, probably 4 too, 3 probably has "the" etc

.. but anyway, the point I was (badly) arriving at, is I suppose this would support the idea that "aiin" (and maybe "ain") are meant to be less letters than EVA maps them to. It's not a new suggestion and I think many people think that, but just this image shows why that might be the case quite well I thought.

quimqu

Regarding what you said about communities, I cannot help, but notice since the manuscript's large portion of it is dedicated to the herbal side. Could there be a connection to the communities for constant repeated language of remedies for medical woes since the communities presence is so low. The communities are large and dominant for how few there are would be if it were about consistent remedies with little difference in each remedy. Just the measures in concoctions and plants involved more than likely one plant per picture. This being kind of constant theme could cause this I'm speculating of course. It might be worth looking into.

Something of research value for an attempt at equivalence. Maybe if this thought is correct is to run a community test on a manuscript written on a language that uses old Latin. The text to text would be an herbal for medical remedies alone that is fairly long and see if you get a some what similar community reading. Maybe the text of the Voynich is invented compressed Latin?

(20-03-2026, 11:56 PM)Bluetoes101 Wrote: You are not allowed to view links. Register or Login to view.however I can't think of any English 1 or 2 letter words that are less common than a 5 letter word of any sort, probably 4 too, 3 probably has "the" etc

Again, never assume anything about a language, not even your own native tongue. Always check. You will often be surprised.

From The English Physitian (aka "Culpeper's Herbal"), 1652:

4691 of 10598 the 1318 with 661 which
3230 in 7019 and 1068 that 618 being
2432 to 1066 are 840 them 564 leavs
2364 it 1034 for 683 also 463 other
2334 or 650 but 622 they 405 green
2116 is 552 all 570 this 353 water
1100 as 442 not 477 good 331 juyce
779 be 421 use 465 herb 327 place
649 by 392 any 460 seed 306 taken
511 at 387 you 438 into 299 those
403 if 252 one 413 many 297 their
369 so 224 two 406 from 290 pains
356 an 219 may 385 very 254 white
337 on 190 hot 377 like 242 about
309 up 170 set 368 root 236 roots
200 do 152 put 361 time 229 drunk
153 no 145 old 338 smal 227 round
141 al 137 out 338 some 220 every
108 he 133 oyl 333 long 211 after
102 my 128 dry 314 have 197 stone
42 me 120 his 314 much 189 parts
29 we 115 yet 313 made 187 forth
27 us 108 red 313 more 173 liver
26 ad 97 man 286 wine 170 sores
24 am 84 end 268 used 160 blood
10 go 83 cut 267 upon 160 three
9 ly 83 way 264 hath 149 there
3 de 71 top 253 than 142 belly
3 eg 71 was 216 doth 138 under
2 ar 70 its 214 both 134 sides
2 oh 63 let 213 when 129 great
2 ox 62 sun 210 away 129 stalk
1 dr 62 too 209 same 125 mouth
1 et 61 our 208 head 120 small
1 il 55 wel 202 body 119 these
1 nd 50 own 187 such 115 urine
1 ne 48 say 181 help 112 flegm
1 od 48 wil 179 will 110 honey
1 ye 47 can 173 most 109 edges
... ... ...

(Some of the 2-letter words like ''al' and 'ly' must be hyphenation artifacts, others like 'ad' may be from Latin quotations.)

All the best, --stolfi

This still shows shorter words are much more prevalent than longer words.
"Which" for example would be 17th overall though 1st in 5 letter words. Having "daiin" 1st overall then would be strange.
Obviously other languages, I'd pick German as I know a little, have different properties. On average short words in German (from what I remember) are longer than English words, but even so, whatever length the short words are they are probably much more frequent than longer words. So I still think the question of why a 5 letter word is most frequent is interesting, and probably also suggests it is not 5 letters in length.

(22-03-2026, 01:20 AM)Bluetoes101 Wrote: You are not allowed to view links. Register or Login to view.suggests it is not 5 letters in length

The GC transliteration has daiin as being only three characters. It is because iin is so frequent and almost always as a suffix that GC considered it to be one element. Likewise iir. ii almost never appears unless it is followed by n or r.

( Why is it that people don't seem to be making more use of this transliteration? )

(18-03-2026, 06:02 PM)ReneZ Wrote: You are not allowed to view links. Register or Login to view.Anyway, this question is: given that daiin is the most frequent word in the overall text, is the reason that it appears most because:
- this is just the most frequent word so it appears most frequenty
- this is the word that most frequently results from a small (or zero) change from the recent words.

The frequency of daiin does not seem to be consistent across the language B sections. In fact most of the top words are not uniformly top. I tried to show this earlier [ You are not allowed to view links. Register or Login to view. ].

oeesordy

quimqu

tavie

Jorge_Stolfi

Bluetoes101

oeesordy

Jorge_Stolfi

Bluetoes101

dashstofsk

dashstofsk