The Voynich Ninja

Pages: 1 2 3 4 5 6

(01-11-2025, 11:08 AM)Rafal Wrote: You are not allowed to view links. Register or Login to view.Are these groups something more subtle and abstract? How do they work?

The apparent contradiction comes from what level of regularity is being measured.

When I say that Voynich words form tightly bound groups I don’t mean literally repeated sentences or phrases, as you wrote. Instead, the regularity appears in the network structure of word co-occurrence (remember, in this thread I always talk about networks, links between words).

I built a graph where each node is a word (token) and its edges link words that often appear next to each other (at the 5 word window). The Voynich graph splits into small, densely connected clusters. Inside each cluster, the same few words tend to appear near each other over and over, but not necessarily in identical sequences. Between clusters, links are rare. That gives the graph high modularity, meaning strong internal cohesion but little cross-talk between groups.

So the “groups” are not obvious to the naked eye, because they’re combinatorial patterns, not repeated sentences. For example, a cluster might contain words like {chol, shol, qokedy, qokeedy, shedy}, which often occur in similar local contexts, even if never in the exact same order.

(01-11-2025, 11:08 AM)Rafal Wrote: You are not allowed to view links. Register or Login to view.Does your method allow to find and show some concrete examples of such regularity?

Yes, let me work with it and come back with examples.

OK, so here are the results with Rohon, Old Portuguese, Portuguese and Chinese. I added also a part of the English version of the Materia Medica of Dioscorides.

(Jorge: I think the Portugese texts are too short and that's why they look a bit as outliers (they have only about 4000 words, and I usually work between 20k and 50k) (high modularity, low H1 entropy...) Even if I normalize the KPI, if the grafs are so different in size, I think they generate these differences.)

In the following plots I removed the Voynich versions (A, B, shuffled, CUVA) and the generated text, as we have the results in my previous post. So you can compare results with natural language texts and Voynich EVA.

[attachment=11942][attachment=11941][attachment=11940][attachment=11939][attachment=11938][attachment=11937]

(01-11-2025, 11:08 AM)Rafal Wrote: You are not allowed to view links. Register or Login to view.Does your method allow to find and show some concrete examples of such regularity?

Hi Rafal,

here are the highest weighted subgraphs (or frequent token sets) (jus showing the top 15 members each):

Community 0: ['chdaiin', 'chdal', 'chdam', 'qokam', 'okody', 'opar', 'okair', 'okeeol', 'olkey', 'chs', 'qokeeey', 'qokshedy', 'pchedy', 'qotchedy', 'qokair']
Community 1: ['dcheey', 'keeey', 'opaiin', 'qopchey', 'keeody', 'qoteody', 'chotaiin', 'shedaiin', 'ched', 'qopchedy', 'opchdy', 'kal', 'keey', 'otey', 'pcheol']
Community 2: ['oram', 'shod', 'olkam', 'yshey', 'aral', 'oaiin', 'okshy', 'daim', 'otair', 'araiin', 'qoeedy', 'olkaiin', 'chekaiin', 'oraiin', 'kain']
Community 3: ['okary', 'opal', 'chain', 'chkain', 'chedal', 'okees', 'otody', 'ykar', 'ytchey', 'otchedy', 'tshedy', 'ytar', 'ytor', 'odar', 'cthey']
Community 4: ['do', 'ykol', 'chockhy', 'teody', 'ctheol', 'okeol', 'aldy', 'dary', 'okeos', 'ykeody', 'dl', 'ral', 'chky', 'sh', 'sheos']
Community 5: ['chom', 'tchor', 'cphol', 'teol', 'cthody', 'oees', 'kchor', 'sheod', 'ky', 'ty', 'dan', 'shocthy', 'qoteol', 'shee', 'sy']
Community 6: ['oteeos', 'yteody', 'oain', 'sam', 'shes', 'oteody', 'oteos', 'shodaiin', 'kair', 'qokeeo', 'shar', 'shedal', 'tal', 'rol', 'cheal']
Community 7: ['cheeody', 'qokeeody', 'qokeody', 'laiin', 'okeeody', 'qody', 'shdar', 'qockhey', 'doiin', 'oldy', 'qockhy', 'pchedar', 'sheody', 'qokchey', 'ockhy']

The following are hubs with very strong neighbors:

otchol → ['chom', 'cphol', 'ctho', 'cthol', 'dchey', 'qopchy']
shodaiin → ['ctho', 'qokcho', 'qokor', 'rol', 'sam', 'yteody']
kchy → ['dchol', 'dchy', 'otchy', 'qotchol', 'ykchy', 'ytol']
chaiin → ['chan', 'kchey', 'qotchol', 'shaiin', 'ykain']
sam → ['oain', 'shes', 'shodaiin', 'yteody']
do → ['chky', 'qotor', 'sh', 'ykol']
chdaiin → ['chdal', 'ckhy', 'qokchdy', 'ykar']
cthody → ['ctho', 'oees', 'qokchol', 'teol']
araiin → ['lkaiin', 'olkaiin', 'opchedy', 'qoeedy']
chan → ['chaiin', 'char', 'cthor', 'tchor']
oteody → ['oar', 'oteos', 'ykeeody', 'ykeody']
shaiin → ['chaiin', 'cthar', 'cthy', 'soiin']
kchol → ['choky', 'ctheey', 'otchor', 'ytchy']
otody → ['okees', 'okody', 'otchdy', 'qokair']
tchey → ['shes', 'shky', 'tal', 'yty']
otchedy → ['odar', 'sheos', 'tshedy', 'ytchedy']
odar → ['cheodaiin', 'okees', 'olkaiin', 'otchedy']
ary → ['cham', 'okees', 'otair', 'sal']

There are very few triades (triangles in the graph) and no cliques (more than 3 tokens). They are strong, cohesive triads. NPMI values 1.244, 1.000, 0.976 indicate frequent, above-chance co-occurrence.

1.244 ('sam', 'shodaiin', 'yteody')
1.000 ('chom', 'cthol', 'otchol')
0.976 ('ldy', 'qopchdy', 'shear')

(01-11-2025, 11:54 AM)quimqu Wrote: You are not allowed to view links. Register or Login to view.Hello Jorge, I work with words separated with a dot.

That's okay; the encoding does not use dot as a letter.

You are sure that the other special characters [µß^~+`'°] are not deleted, remapped, or interpreted as separators at some point?

Note that chaos happens if a program reads a iso-latin-1 file (like my samples) with those non-ascii characters, expecting it to be Unicode in utf-8 (which is practically the norm nowadays)...

All the best, --stolfi

(01-11-2025, 02:32 PM)quimqu Wrote: You are not allowed to view links. Register or Login to view.Jorge: I think the Portugese texts are too short and that's why they look a bit as outliers

OK, I can provide longer samples.

The three previous samples were from the same novel in three languages/encodings, but taking alternating chapters, so that there was no overlap between the three samples. Shall I do the same, but with longer samples? Or should I just send you the three versions of the whole novel?

All the best, --srolfi

Quimqu, if it is possible, could you comment the results of Rohonc Codex? That would be particularly interesting for me.

It seems quite extreme on some dimensions but I'm not sure how to interpret these results.

(Yesterday, 01:59 PM)Rafal Wrote: You are not allowed to view links. Register or Login to view.Quimqu, if it is possible, could you comment the results of Rohonc Codex? That would be particularly interesting for me.

It seems quite extreme on some dimensions but I'm not sure how to interpret these results.

I asume that each number separated by a space is a "word2. With the results, the Rohon text behaves very differently from real writing. When you study how its words appear next to each other, the pattern looks too regular. In normal languages, words mix in many ways. Some are common, some are rare, and together they make a large and uneven web of connections. Rohon’s web is much smaller and tighter. The same words keep returning in the same short sequences.

If you read it as language, it feels like a few phrases are being repeated over and over with small changes. Word pairs often appear in both directions, which almost never happens in natural texts. A few very common words link almost everything, while the rest depend on them. That gives the whole structure a rigid, formula-like shape instead of the messy balance seen in human speech or writing.

This kind of pattern might come from a cipher, a ritual formula, or simply an invented system that imitates language. Even compared with the Voynich manuscript, which already looks unnatural, Rohon’s structure is too neat and too repetitive. It has the shape of writing but not the behaviour of real words.

Here are the main KPI results commented:

Vocabulary size (nodes): Rohon has far fewer distinct words than normal texts. Its network is small, meaning the same words are used again and again instead of a wide vocabulary.

Clustering coefficient ©: It is much higher than usual. Words that appear together tend to form tight closed groups, showing fixed repeated phrases instead of flexible grammar.

Reciprocity: It is about three times higher than in normal languages. That means word pairs appear very often in both directions (“A B” and “B A”), which is rare in natural texts.

Type–Token Ratio (TTR): Very low. Few unique words compared to total word count. It confirms strong repetition.

Assortativity (assort_degree): Negative and stronger than average. Rare words link mostly to very common ones, giving the network a hub-and-spoke structure rather than the balanced pattern of natural language.

Betweenness centrality and eigenvector concentration: Both are high. A few central words control most connections, showing a rigid core instead of a spread of shared links.

Path length (L): Shorter than normal. Words are close to each other through those central hubs, another sign of repetitive structure.

Entropy measures (H_mean, H1): They show the text has moderate unpredictability but still less variety than human writing.

Thanks!

From my experience with Rohonc Codex I can confirm your observations. Yet, I am totally convinced that it is a meanigful text.

Some comments:

Rohon has far fewer distinct words than normal texts
True, it is constructed language without synonyms and with limited vocabulary. It also has no declension at all.

Words that appear together tend to form tight closed groups, showing fixed repeated phrases instead of flexible grammar.
Yes, it is repetitive. Some sentences/patterns like "Jesus said to disciples" or "Written by NN in Nth chapter" are repeated many times, some longer fragments are also repeated.

That means word pairs appear very often in both directions (“A B” and “B A”), which is rare in natural texts.
Yes, from my observation it has a loose word order. But some natural languages (like Slavic ones) also have loose word order so I guess it really depends on language.

A few central words control most connections, showing a rigid core instead of a spread of shared links.
Yes, over 50% of sentences begin with "and" word.

My conclusion is that something at statistics level may look not like a real text, yet be a real text in some exotic language with specific content Smile

I also noticed that on many diagrams Rohonc Codex behaves similarly to some Chinese text.

Hello quimqu!
Though I unfortunately do not understand much of your work I admire your enthusiasm!

I was surprised by the differences between Voynich A and B. Is it possible to run a test on a more fine-grained distinction of Voynichese like Rene's RZ languages? Or is the sample size too small? At least a distinction between A, B, C would be interesting. A has 12273 words, B has 19043 and C 6030 words.
You are not allowed to view links. Register or Login to view.

Do you think labels should be excluded? The labels in drawings probably behave differently and / or are not related.

Does Transliteration have an effect? Have you tested different VM transliterations?

And have you used similar text length in all samples? Does that have an impact?

(7 hours ago)Bernd Wrote: You are not allowed to view links. Register or Login to view.Hello quimqu!
Though I unfortunately do not understand much of your work I admire your enthusiasm!

I was surprised by the differences between Voynich A and B. Is it possible to run a test on a more fine-grained distinction of Voynichese like Rene's RZ languages? Or is the sample size too small? At least a distinction between A, B, C would be interesting. A has 12273 words, B has 19043 and C 6030 words.
You are not allowed to view links. Register or Login to view.

Do you think labels should be excluded? The labels in drawings probably behave differently and / or are not related.

Does Transliteration have an effect? Have you tested different VM transliterations?

And have you used similar text length in all samples? Does that have an impact?

Hello Bernd,

thank you for your support. I trully believe that data science can give us some help with the Voynich language.

Regarding your questions:

Rene's A and B could fit, but C has a very low number of words. The problem is that even if I normalize, small corpus like C give too simple graphs, and then they are usually plotted as outliers (and maybe they aren't). I am afraid that if I test, I can get some results that are not comparable to the longer texts. So I think I shouldn't, in order not to give false results.

I have not excluded labels, even if I find it logical to exclude them. I can test with and without the labels and see how different the graph is.

In my You are not allowed to view links. Register or Login to view., you will see EVA and CUVA (You are not allowed to view links. Register or Login to view.). They are not so far from each other. You can see also Currier's A and B, which are quite different.

And yes, text length has an impact. As I said before, I normalize the KPI trying to be independent from the text length, but as said, smaller corpora give lighter graphs and even if they are normalized, they are not that comparable. So I tried to keep the graphs with corpora between 20.000 and 50.000 words.

Thanks again!

Pages: 1 2 3 4 5 6

quimqu

quimqu

quimqu

Jorge_Stolfi

Jorge_Stolfi

Rafal

quimqu

Rafal

Bernd

quimqu