The Voynich Ninja

Full Version: About the generation of similar words
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Pages: 1 2 3 4 5 6 7 8
At quimqu

Can you you explain the communities a bit more.  It sounds a bit drastic for 38k tokens compared to the other Corpus's I'm guesing.  If the communities are the core logic of the language is that abnormal for a text of that size?
(20-03-2026, 08:43 PM)oeesordy Wrote: You are not allowed to view links. Register or Login to view.At quimqu

Can you you explain the communities a bit more.  It sounds a bit drastic for 38k tokens compared to the other Corpus's I'm guesing.  If the communities are the core logic of the language is that abnormal for a text of that size?

Hi,

I first build a graph where words are connected if their distance is ≤2 (less or equal at two changes in terms of characters). That gives a large connected component. Then I run a community detection algorithm on that graph.

So a community is just a dense cluster inside the Levenshtein ≤2 network. Words inside a community are very similar to each other, but not necessarily all within distance 1. Some can be at distance 2, or even further apart if they are connected through intermediate words. Note thatt distance is not Levenshtein; distance are the steps to cgo from a token to another).

I wouldn’t say communities are the “core logic” of the language. They are just clusters of words that are very similar in form. In natural languages, those clusters usually come from things like inflection, derivation, or spelling variation.

What is interesting in the Voynich is not that communities exist, but how large and dominant a few of them are. A big part of the vocabulary falls into just a few very dense clusters. So communities are more like a reflection of the system’s structure, not the underlying logic itself.
(19-03-2026, 03:40 PM)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view.[
(18-03-2026, 06:02 PM)ReneZ Wrote: You are not allowed to view links. Register or Login to view.Anyway, this question is: given that daiin is the most frequent word in the overall text, is the reason that it appears most because:
- this is just the most frequent word so it appears most frequenty

I believe that it is indeed the most common word by itself, in all Herbal pages (A and B) as well as in the Starred Parags secton.  In fact is one of few words that has about the same frequency of occurrence in both Herbal-A and Herbal-B:


How are you counting it in Stars?  I have /chedy/ occurring with more instances than /daiin/ there.
(20-03-2026, 10:54 PM)tavie Wrote: You are not allowed to view links. Register or Login to view.How are you counting [daiin] in Stars?  I have /chedy/ occurring with more instances than /daiin/ there.

You are right, sorry.  There, daiin is 7th, with only ~63% of the occurrences of chedy.

(But at least I am consistent -- I made that same mistake before... Big Grin )

All the best, --stolfi
(19-03-2026, 03:40 PM)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view.I believe that it is indeed the most common word by itself, in all Herbal pages (A and B) as well as in the Starred Parags secton.  In fact is one of few words that has about the same frequency of occurrence in both Herbal-A and Herbal-B:

This image is interesting in regards to "aiin", for the most part we see the most frequent words being shorter except for words containing "aiin".
Obviously language isn't just statistics and I'm sure languages across the world vary and depends on topic (and a million other factors), however I can't think of any English 1 or 2 letter words that are less common than a 5 letter word of any sort, probably 4 too, 3 probably has "the" etc

.. but anyway, the point I was (badly) arriving at, is I suppose this would support the idea that "aiin" (and maybe "ain") are meant to be less letters than EVA maps them to. It's not a new suggestion and I think many people think that, but just this image shows why that might be the case quite well I thought.
At quimqu

Regarding what you said about communities, I cannot help, but notice since the manuscript's large portion of it is dedicated to the herbal side. Could there be a connection to the communities for constant repeated language of remedies for medical woes since the communities presence is so low.  The communities are large and dominant for how few there are would be if it were about consistent remedies with little difference in each remedy.  Just the measures in concoctions and plants involved more than likely one plant per picture.  This being kind of constant theme could cause this I'm speculating of course.  It might be worth looking into.  

Something of research value for an attempt at equivalence. Maybe if this thought is correct is to run a community test on a manuscript written  on a language that uses old Latin.  The text to text would be an herbal for medical remedies alone that is fairly long and see if you get a some what similar community reading. Maybe the text of the Voynich is invented compressed Latin?
(20-03-2026, 11:56 PM)Bluetoes101 Wrote: You are not allowed to view links. Register or Login to view.however I can't think of any English 1 or 2 letter words that are less common than a 5 letter word of any sort, probably 4 too, 3 probably has "the" etc

Again, never assume anything about a language, not even your own native tongue.  Always check. You will often be surprised.

From The English Physitian (aka "Culpeper's Herbal"), 1652:

   4691 of  10598 the 1318 with  661 which
   3230 in   7019 and 1068 that  618 being
   2432 to   1066 are  840 them  564 leavs
   2364 it   1034 for  683 also  463 other
   2334 or    650 but  622 they  405 green
   2116 is    552 all  570 this  353 water
   1100 as    442 not  477 good  331 juyce
    779 be    421 use  465 herb  327 place
    649 by    392 any  460 seed  306 taken
    511 at    387 you  438 into  299 those
    403 if    252 one  413 many  297 their
    369 so    224 two  406 from  290 pains
    356 an    219 may  385 very  254 white
    337 on    190 hot  377 like  242 about
    309 up    170 set  368 root  236 roots
    200 do    152 put  361 time  229 drunk
    153 no    145 old  338 smal  227 round
    141 al    137 out  338 some  220 every
    108 he    133 oyl  333 long  211 after
    102 my    128 dry  314 have  197 stone
     42 me    120 his  314 much  189 parts
     29 we    115 yet  313 made  187 forth
     27 us    108 red  313 more  173 liver
     26 ad     97 man  286 wine  170 sores
     24 am     84 end  268 used  160 blood
     10 go     83 cut  267 upon  160 three
      9 ly     83 way  264 hath  149 there
      3 de     71 top  253 than  142 belly
      3 eg     71 was  216 doth  138 under
      2 ar     70 its  214 both  134 sides
      2 oh     63 let  213 when  129 great
      2 ox     62 sun  210 away  129 stalk
      1 dr     62 too  209 same  125 mouth
      1 et     61 our  208 head  120 small
      1 il     55 wel  202 body  119 these
      1 nd     50 own  187 such  115 urine
      1 ne     48 say  181 help  112 flegm
      1 od     48 wil  179 will  110 honey
      1 ye     47 can  173 most  109 edges
                      ...            ...              ...

(Some of the 2-letter words like ''al' and 'ly' must be hyphenation artifacts, others like 'ad' may be from Latin quotations.)

All the best, --stolfi
This still shows shorter words are much more prevalent than longer words.
"Which" for example would be 17th overall though 1st in 5 letter words. Having "daiin" 1st overall then would be strange. 
Obviously other languages, I'd pick German as I know a little, have different properties. On average short words in German (from what I remember) are longer than English words, but even so, whatever length the short words are they are probably much more frequent than longer words. So I still think the question of why a 5 letter word is most frequent is interesting, and probably also suggests it is not 5 letters in length.
(22-03-2026, 01:20 AM)Bluetoes101 Wrote: You are not allowed to view links. Register or Login to view.suggests it is not 5 letters in length

The GC transliteration has  daiin as being only three characters. It is because  iin is so frequent and almost always as a suffix that GC considered it to be one element. Likewise  iirii almost never appears unless it is followed by  n or  r.

( Why is it that people don't seem to be making more use of this transliteration? )
(18-03-2026, 06:02 PM)ReneZ Wrote: You are not allowed to view links. Register or Login to view.Anyway, this question is: given that daiin is the most frequent word in the overall text, is the reason that it appears most because:
- this is just the most frequent word so it appears most frequenty
- this is the word that most frequently results from a small (or zero) change from the recent words.


The frequency of  daiin does not seem to be consistent across the language B sections. In fact most of the top words are not uniformly top. I tried to show this earlier [ You are not allowed to view links. Register or Login to view. ].
Pages: 1 2 3 4 5 6 7 8