The Voynich Ninja

Full Version: Inside the Voynich Network: graph analysis
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Pages: 1 2 3 4 5 6
(31-10-2025, 03:33 PM)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view.
(30-10-2025, 11:52 PM)Kaybo Wrote: You are not allowed to view links. Register or Login to view.Could you try for me Portuguese?

Quimqu already tested You are not allowed to view links. Register or Login to view..  Would those do?

The longer, the better. I can try with the files you gave me, Jorge, and let you know, even if they are quite short.
Quimqu, and would you be interested to check in your analysis Rohonc Codex?

If you don't know it, it's an old manuscript written in an unknown language, often compared to Voynich Manuscript. In my work I try to show that it is meaningful and written in some crude, constructed language:

You are not allowed to view links. Register or Login to view.

I can provide the transcription.
(31-10-2025, 05:49 PM)Rafal Wrote: You are not allowed to view links. Register or Login to view.Quimqu, and would you be interested to check in your analysis Rohonc Codex?

If you don't know it, it's an old manuscript written in an unknown language, often compared to Voynich Manuscript. In my work I try to show that it is meaningful and written in some crude, constructed language:

You are not allowed to view links. Register or Login to view.

I can provide the transcription.

Of course, please provide me the transcription
Quote:Of course, please provide me the transcription

Thanks a lot! I am really curious what your results will be.


I included my "transcription" in the attachment. Actually from the purist point of view it is not a proper transcription as I am giving word IDs (assigned by me) instead of representation of each original symbol by Latin letters. The problem is that Rohonc Codex script is a bit similar to Chinese with about 800 different symbols so writing it down with Latin letters is not that easy. But I guess what we have should work well with your algorithms.

The text when I worked with it seemed a bit repetitive. I am curious if your methods will spot it.
(31-10-2025, 09:05 PM)Rafal Wrote: You are not allowed to view links. Register or Login to view.Actually from the purist point of view it is not a proper transcription as I am giving word IDs (assigned by me) instead of representation of each original symbol by Latin letters.

Since you mention this point, the correct term for this is transliteration. And then it does not matter if it is converted to Latin characters or code points.
Again, a quite dense post. So, I summarize here first the findings, and then, if you are interested, you can deepen into the dense part of the post.

I have calculated normalized KPI for Voynich texts, natural language texts and generated texts. Across all metrics, the Voynich Manuscript behaves very differently from real languages. Its words tend to repeat in tight, predictable patterns instead of spreading naturally through the text. Unlike normal writing, where some words act as connectors or carry more weight, all Voynich words play almost the same role. The result feels organized and rule-based, but not like any language used for communication.

Now to the dense part  Cry

To compare the Voynich Manuscript with ordinary texts, I first built a graph for each text. Nodes represent tokens and edges represent co-occurrences. However, many graph metrics depend on graph size. Larger graphs naturally have more edges, higher degree, and different path lengths, so direct comparison would be misleading.

To fix this, I normalized all key performance indicators (KPIs). Each metric was divided by the value expected from a random or degree-preserving null model with the same number of nodes and links. This way, clustering coefficients, path lengths, modularity, and other values become comparable across texts of different sizes.

After normalization, the indicators reflect the internal structure of each text rather than its length. This allows a fair comparison between the Voynich graph and other texts, inependently from their length.

So, here are the plots, which I will try to update in the following days with new texts that other voynich ninja asked to analyze.

[attachment=11920]

This plot compares modularity with h1, the average next-word unpredictability per token. While modularity shows how strongly the text divides into clusters of words that frequently co-occur, h1 captures how much variation each word allows in what follows it. Natural languages occupy the center, combining moderate modularity with a healthy range of continuations per word, structured yet flexible. Artificial and shuffled texts show lower modularity and different h1, indicating looser or more random connections. The Voynich variants stand out, especially EVA A and EVA, with very high modularity and moderate h1. This means each Voynich word tends to appear in fixed, repetitive contexts rather than freely combining with others. In linguistic terms, the text behaves like a system of rigid word sequences, highly organized but locally predictable, suggesting a rule-based construction rather than natural language syntax.

[attachment=11919]

This plot compares modularity with H1, the global unigram entropy that reflects the diversity of the vocabulary and how evenly words are distributed. Modularity is the same as in the first plot, and we see cleary three outliers in terms oh H1 entropy: Timm's both texts (unshuffled and shullfed) have extreme H1 entropy, while surprisingly, Voynich EVA A has surprisingly low H1 entropy. This suggests that the Voynich A section is unusually repetitive at the token level, using a very restricted set of symbols or words compared to both natural and artificial texts. Its high modularity combined with such low entropy implies that its word co-occurrence network is extremely clustered, with strong internal repetition and limited cross-linking between word groups. In other words, Voynich A behaves like a tightly organized subsystem with recurrent local patterns rather than a flexible linguistic system.

[attachment=11918]

This plot compares σ (small-world index), which measures how efficiently a network balances local clustering and global reach, with C₍rand₎, the average clustering expected from a random network preserving the same degree distribution. Natural languages form two groups, one around 0.22 C₍rand₎ and the other around 2.8. Artificial or shuffled texts mostly stay near that range (specially Timm's text) or drop lower, showing disrupted structure. The Voynich variants, however, specially EVA A and EVA (total) reach the highest σ values (>4) while maintaining the lowest C₍rand₎, indicating extreme small-world organization: highly clustered local patterns far stronger than any random expectation. (Note that Latin In Psalmum Exposito has very simmilar values as Voynich). This reinforces the idea that the Voynich text is internally coherent and self-reinforcing, producing a network that is both tightly knit and unusually segregated, unlike any natural or mechanically generated language in the comparison.

[attachment=11917]

This plot compares the small-world index (σ) with L₍rand₎, the average path length expected from a random degree-preserving network. In linguistic terms, L₍rand₎ represents how easily any two words would connect by chance, while σ measures how much more efficiently the real text achieves both clustering and connectivity. Natural languages cluster around σ ≈ 3 and L₍rand₎ ≈ 2.5, showing a stable balance between local cohesion and global reach. Artificial texts like Markov or Timm’s simulations vary but remain below that stability line. The Voynich variants, specially EVA A and both full Voynich texts (EVA and CUVA), again occupy the upper-right corner (again together with Latin In Psalmum Expositio), showing the highest σ values even at similar path lengths. This means the Voynich word network connects globally as efficiently as natural language, yet maintains far stronger local clustering. The result reinforces a consistent pattern: Voynich behaves like a hyper-small-world system, structurally optimized and internally repetitive, distinct from both human language and random models.

[attachment=11916]

This plot compares the resilience fraction to half, which measures how much of the network must be removed before its main component loses half of its nodes, with the Gini degree, which reflects inequality in how connections are distributed among nodes. The Voynich variants stand at the bottom of the resilience, their networks are more uniform but collapse faster when nodes are removed. This means the Voynich lacks dominant hubs yet depends heavily on many small (i.e. daii, chol, etc), tightly interlinked groups. In other words, its structure is homogeneous but fragile, reinforcing the picture of a highly ordered system with repetitive local patterns rather than a robust, hierarchically organized linguistic network.

[attachment=11915]

This plot compares eigenvector concentration, which measures how much network influence is dominated by a few central nodes, with the Gini degree, which shows how unevenly connections are distributed. Spanish and Catalan languages occupy the upper area, with relatively high eigenvector concentration, meaning a few highly connected words (like articles or prepositions) anchor the text’s structure. Artificial and shuffled texts are more dispersed, reflecting weaker central hubs. The Voynich variants (and Latin In Psalmum Expositio) cluster at the bottom, with low eigenvector concentration, indicating that influence is spread evenly and no nodes dominate. This uniformity suggests that the Voynich word network lacks linguistic hierarchy: for example, there are no functional equivalents to “the” or “and.” Instead, its connectivity is flat and homogeneous, consistent with a self-similar system where all tokens play structurally similar roles rather than forming grammatical or semantic hierarchies typical of real language.

From a linguistic perspective, the Voynich Manuscript behaves unlike any known language. Its words form tightly bound groups that repeat with remarkable regularity, showing internal consistency but little grammatical flexibility. Unlike natural languages, which rely on a few common function words to link diverse terms, the Voynich text distributes its words evenly, with no clear hierarchy or connectors. Its structure is highly organized and self-contained: words co-occur in predictable clusters, yet the text maintains coherence across the whole manuscript. This gives the impression of a system driven by formal rules rather than meaning a closed combinatorial code where repetition and pattern replace syntax and semantics. In fact, it is what our human intuition tells us when we see the words repeating through the MS (qokeedy, daiin, etc).

As always, any thoughts are welcome.
(01-11-2025, 12:16 AM)quimqu Wrote: You are not allowed to view links. Register or Login to view.I have calculated normalized KPI for Voynich texts, natural language texts and generated texts.

I am intrigued  by how different Vietnamese comes out in all those plots.  It may be correct, but maybe it is something banal in the file format that is messing up the counting. 

Just to clarify, the file I gave you uses VIQR, a pre-Unicode encoding that uses many funky special characters to indicate the many diacritics of Vietnamese spelling; except that some accents were  remapped to avoid confusion with puctuation: "?"->"ß",   "("->"µ",    "."->"°".  Here is a sample:

sao ddu+'c ye^su no'i vo+'i o^ng giaß nhu+ ta muo^'n no' o+ß la°i cho

to+'i khi ta dde^'n thi` vie^°c gi` dde^'n ngu+o+i pha^`n ngu+o+i ha~y
cu+' theo ta va^°y co' lo+`i ddo^`n dda°i giu+~a anh em la` mo^n ddo^`
a^'y kho^ng phaßi che^'t nhu+ng ddu+'c ye^su dda~ kho^ng no'i vo+'i

Everything in the sample above other than space is to be treated as a letter.

Here is a Vietnamese sample again (maybe the same, not sure). It is a translation of the New Testament, probably from English.  This version is in the Latin-1 encoding, with upper case mapped to lower case,
all (actual) punctuation removed.  It is in 3 files because it the whole is  ~490'000 bytes and the forum limits each attachment to 200'000 byes.

[attachment=11923]
[attachment=11924]
[attachment=11925]

If you would rather have it in some other format (UTF-8, one word per line, whatever) just let me know.

All the best, --stolfi
In case you are interested, here is also a text in Mandarin Chinese, transcribed in pinyin with numeric tones.  Sample:

ge4 wei4 ting1 zhong4 mei3 guo2 zheng4 fu3 jue2 ding4 jin4 yi1 bu4 dong4
jie2 mei3 guo2 jin4 chu1 kou3 yin2 hang2 xiang4 can1 yu4 zhong1 guo2
xiang4 mu4 de5 mei3 guo2 gong1 si1 ti2 gong1 de5 dai4 kuan3 zhong1 guo2
biao3 shi4 zhei4 xiang4 jue2 ding4 jiang1 you3 sun3 liang3 guo2 de5 mao4


It is a set of transcripts of the Voice of America broadcasts from the late 1990s.  Word characters are lowercase Latin letters plus "ü" and the digits 1-5, in the iso-latin-1 encoding. Again all punctuation has been removed. Sentence and parag boundaries are not marked (that holds also for the previous Vietnamese samples).  The total file is ~300'000 bytes and 59476 words, split in two because of forum limitations.

[attachment=11926]
[attachment=11927]

The language here is probably closer to current Mandarin Chinese, written by fluent reporters (rather than "Biblical" and translated from English, like the Vietnamese sample).  On the other hand it may be rather formulaic. For example, the names "China" (zhong1 guo2) and "United States" (mei3 guo2) may be way too common. 

All the best, --stolfi
Quote:Its words form tightly bound groups that repeat with remarkable regularity

I wonder how could it be as VM doesn't have repeated sentences and doesn't seem to have any regularity when watched with a "naked eye".

I can imagine such groups for example in prayers:
Quote:Mother of good counsel, pray for us.
Mother of our Creator, pray for us.
Mother of our Savior, pray for us.
Virgin most prudent, pray for us.
Virgin most venerable, pray for us.
Virgin most renowned, pray for us.

But VM doesn't have such a thing.

Are these groups something more subtle and abstract? How do they work?
Does your method allow to find and show some concrete examples of such regularity?
(01-11-2025, 01:37 AM)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view.I am intrigued  by how different Vietnamese comes out in all those plots.  It may be correct, but maybe it is something banal in the file format that is messing up the counting. 

Hello Jorge, I work with words separated with a dot. The Vietnamese file starts like:

gia.phaß.ddu+'c.chu'a.ye^su.kito^.con.ddavi't.con.abraham.abraham.sinh.ysaac.ysaac.sinh.yaco^be^.yaco^be^.sinh.yudda.va`.ca'c.anh.em.o^ng.yudda.sinh.phare^.va`o.zara.bo+ßi.thamar.phare^.sinh.esro^m.esro^m.sinh.aram.aram.sinh.aminaddab.aminaddab.sinh.naasso^n

I limited it to 50.000 tokens. Please tell me if you find this correct or I missed something.

(01-11-2025, 01:58 AM)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view.In case you are interested, here is also a text in Mandarin Chinese, transcribed in pinyin with numeric tones.

I will definetly put Chinese on the plots.
Pages: 1 2 3 4 5 6