The Voynich Ninja - Inside the Voynich Network: graph analysis

Pages: 1 2 3 4 5 6

(26-10-2025, 12:09 PM)Mauro Wrote: You are not allowed to view links. Register or Login to view.Out of curiosity, can you run the same analysis on this text: You are not allowed to view links. Register or Login to view.

It's been generated with an order-0 Markov chain (you may wish to remove the initial header, the four lines prefixed by %c%)

Hello Mauro,

here you have the results:

[attachment=11825][attachment=11826]

This weekend, I’ve been analysing and comparing the Voynich manuscript with other texts (natural languages and pseudo-Voynich texts like Torsten Timm's or Mauro's) with graphs, using the same network-analysis pipeline. The Voynich shows a clear internal structure: not random, but also not quite like any known language. Its patterns are organised and repetitive, with many words that appear in almost fixed combinations.

This post is a bit dense, so here is the short summary of the findings:

Using the same network-analysis method on the Voynich manuscript and several comparison texts (English, Latin, Catalan, and the pseudo-Voynich texts by Torsten Timm and Mauro), I calculated a range of graph metrics that describe how words connect to each other.

The first group of metrics measures how "language-like" the structure is: clustering, assortativity, modularity, and small-world behavior (I explain the metrics later inthis post). These values place the Voynich close to real languages, showing that it is not random and that it follows consistent structural rules.

The second group of metrics goes deeper into how those structures behave. Here the Voynich stands apart: it is too repetitive, too stable, and too dependent on a few central word stems. These properties make it difficult to see it as a normal natural language. Instead, it looks like a highly organized but constrained system: something that imitates language patterns while being much more rigid than any known human tongue.

One possible explanation is that the Voynich text comes from a natural language that was transformed by some mechanical process. That would explain why its structure looks linguistic (it keeps the same kind of connections and clusters as real languages) but its variation is unusually low. The process could have been something like syllable substitution, compression, or a cipher that preserved the general patterns while hiding the original meaning.

Another option is that the text was generated mechanically from the start, following fixed templates or rules, without any underlying meaning. In that case it imitates the surface of a language but does not actually carry content.

The data do not tell us which of these two ideas is right, but they make one thing clear: the Voynich is not random. It is highly structured, yet too rigid to be a normal human language in its current form. Overall, the results are not new, but they reach the same conclusion by different means: the Voynich text is structured and consistent, yet far more rigid and repetitive than any known natural language.

--------------------

So, let's go to the dense part of the post. The first metrics are the following:

KPI	Voynich (full corpus)	Natural languages	Artificial (Timm / Pseudo)	Interpretation
Clustering coefficient C	0.68	0.65-0.75	0.60-0.66	High clustering is typical of structured, rule-based systems like natural languages.
Small-world sigma σ	190	10-100	30-50	Voynich network is extremely “small-world”: words are tightly interconnected.
Degree assortativity	-0.23	-0.20 to -0.30	≈ 0	Negative assortativity means frequent “hub” words link to many rarer ones, as in real languages.
Louvain modularity	0.22	0.10-0.25	0.10-0.15	Moderate modularity: clear internal communities but not extreme compartmentalisation.
Entropy-degree corr. (log1p)	0.99	0.95-0.99	0.9-0.98	High correlation shows that more connected words are also more contextually variable: a linguistic trait.
Mean next-word entropy (H_mean)	0.65 bits	0.9-1.2 bits	0.3-0.7 bits	Voynich transitions are more predictable than in natural languages: more fixed patterns.
% of tokens with H = 0	≈ 70%	≈ 45-55%	≈ 70-85%	High proportion of deterministic transitions: a sign of templated or repetitive structure.
Type-Token Ratio (TTR)	0.186	0.10-0.20	0.16-–0.19	Lexical diversity in the Voynich is within the normal linguistic range.
Max node degree (hub dominance)	Very high	High	Very high	A few words act as extreme hubs (like “daiin”, “ol”), stronger than in typical natural languages.

OK, let's take a look at the KPI and what they mean:

- Clustering coefficient C: This tells us how much the neighbours of a word are also connected to each other. In normal language, related words tend to appear together in small clusters (like “herbal remedy” or “city walls”). High clustering means the text has local structure, not random word order.
- Small-world σ: This compares the graph to a random network. If σ is high, it means the network has short paths between words but still a lot of local clustering. Real languages usually have σ around 10-100. Random noise gives σ near 1. The Voynich text has σ around 190, which means it is very tightly connected.
- Degree assortativity: This measures whether highly connected words tend to link to other highly connected words. In real languages, frequent "function words" like "the", "and", "of" connect to many rare "content words", so the value is usually negative (around -0.2 to -0.3). If the value is close to zero or positive, it suggests a random or artificial system.
- Modularity: This shows how strongly the network splits into separate groups (called communities). In a language, these groups often correspond to topics, morpheme families, or stylistic units. Values around 0.1-0.25 are common in real texts. If the value is too low, there is no structure; too high, and the text is fragmented.
- Entropy-degree correlation: Entropy measures how predictable the next word is after a given word. When this correlates strongly with the number of connections (degree), it means words with more possible continuations are also less predictable, which is what happens in real language. A high correlation (close to 1) is a good sign of linguistic organization.
- Mean next-word entropy (H_mean): This is the average unpredictability of the next word. Higher entropy means more variation and freedom; lower entropy means repetitive or formulaic sequences. Normal languages have values around 1 bit. The Voynich text is closer to 0.6 bits, meaning it repeats many short patterns.
- Percent of tokens with H = 0: This is the percentage of words that always appear in the same exact context. In human language about half of the words are flexible. In the Voynich text, around 70% of words have fixed positions or neighbours.
- Type-Token Ratio (TTR): This measures lexical diversity: how many unique words there are compared to total words. If TTR is too low, the text repeats the same few words; if it is too high, it may be random. The Voynich value (≈0.18) is within the normal range for real languages.
- Max node degree (hub dominance): This tells us how many connections the most frequent word has. In normal texts, function words are hubs but not overwhelmingly dominant.
In the Voynich, some words like daiin or ol are super-hubs that connect almost everywhere, suggesting a more constrained or mechanical pattern.

The first table listed the main indicators used to decide whether the Voynich behaves like a real language or a random system. But graphs can reveal much more. The next table shows additional network features that help refine the picture (things like how words form clusters, how stable those clusters are, and how dependent the text is on a few central tokens). These metrics confirm the same pattern: the Voynich text is structured, but also rigid and repetitive, unlike any known natural language.

Additional KPI	Voynich (full corpus)	Natural languages	Artificial (Timm / Pseudo)	Interpretation
Degree distribution shape	Steep (few very high hubs)	Zipf-like, smoother	Irregular or flatter	Voynich shows extreme hubs and fewer mid-frequency words → repetitive templates.
Betweenness centrality (bridges)	Few, dominated by same stems	Many bridges	Few, repetitive	Limited connectors → reduced syntactic flexibility.
Average path length	≈ 3.5–4	≈ 2.5–3	≈ 3–4	Slightly longer paths → weaker global connectivity.
Community size distribution	1–2 large, many tiny	Balanced	Uneven	Dominated by few structural families → strong repetition.
Reciprocity (A↔B pairs)	Low	Moderate to high	Low	Mostly one-directional links → fixed word order.
Stability across sections	Very stable	Moderately variable	Stable	Constant KPIs → same mechanism throughout the text.
Motif frequency (loops / triads)	High number of closed loops	Mostly open triads	High loops	Repetitive short cycles typical of patterned generation.
Eigenvector centrality	Concentrated on few stems	Spread across function words	Concentrated	Lack of differentiated “functional vocabulary.”
Assortativity by entropy	Low (predictable words cluster)	Higher, mixed	Low	Predictable words link together → rigid local structure.
Network resilience	Low (breaks after few removals)	High (robust)	Low	Strong dependency on few central hubs → mechanical organisation.

This is the explanations for the KPI's:
- Degree distribution shape: This describes how many words have few connections versus many. Natural languages follow a Zipf-like pattern: most words are rare, and a few are very common. The Voynich follows a similar overall trend, but the drop-off is steeper: the most frequent words are too dominant, and middle-frequency words are fewer than expected.
This points to a system with strong templates or repetitive affixes.
- Betweenness centrality: This measures how often a word acts as a "bridge" between otherwise separate clusters. In normal language, conjunctions or pronouns often fill this role (e.g., and, that, which). In the Voynich, only a handful of tokens have high betweenness, and most are the same recurring stems (chedy, daiin, ol). That suggests less syntactic flexibility and fewer multi-purpose connectors.
- Average path length: This is the average number of steps needed to go from one word to another through their connections. Real languages tend to have short path lengths (because of function words). The Voynich’s path length is slightly longer, meaning that its “grammar bridges” are weaker and the network is less efficiently connected at the global scale.
- Community size distribution: This shows whether the text’s thematic or morphological clusters are evenly sized or dominated by a few. Natural languages have many medium-sized communities. The Voynich network has one or two very large communities and many tiny ones: a pattern of strong repetition within a few structural groups rather than balanced thematic diversity.
- Reciprocity: This checks how often two words co-occur in both directions (A→B and B→A). High reciprocity suggests a flexible, bidirectional syntax (as in natural language).
The Voynich has low reciprocity: connections mostly go one way, reinforcing the idea of fixed or formulaic order.
- Temporal or positional variation: If you divide the text into sections, you can see whether the network KPIs stay stable. In real language, KPIs fluctuate slightly with topic.
In the Voynich, they are remarkably constant, which hints that the same mechanism or rule set produced the whole manuscript.
- Motif frequency (triads and loops): Counting recurring small subgraphs (like word triplets) shows internal patterns. Natural languages have many asymmetric triplets (A→B→C, but not the reverse). The Voynich graph has more closed and repeated loops, another sign of formulaic or patterned text generation.
- Eigenvector centrality: This tells how much a word is connected to other well-connected words. In normal language, function words like the or of dominate. In the Voynich, the top eigenvector words are the same repetitive stems that dominate all metrics, which reinforces the impression of limited functional differentiation.
- Assortativity by entropy: Instead of linking by degree, this looks at whether predictable words connect with other predictable words. In the Voynich, low-entropy words cluster together, again suggesting rigid sequences rather than flexible syntax.
- Network resilience: If you remove the most connected words, does the network fall apart? In natural language, the network remains mostly connected (because redundancy is high).
In the Voynich, removing a few central tokens breaks the network quickly, which shows dependency on a small set of structural hubs.

Your reciprocity is at odds with Marke Fincher's (2008) findings on word reversibility, where he found that Voynichese has weaker word order (ie higher reciprocity) than several natural languages

What happens if you change the window to only directly adjacent tokens? (Or pairs of tokens)? What if you make the edges directed? Do you see anything new?

(27-10-2025, 01:08 AM)RadioFM Wrote: You are not allowed to view links. Register or Login to view.Your reciprocity is at odds with Marke Fincher's (2008) findings on word reversibility, where he found that Voynichese has weaker word order (ie higher reciprocity) than several natural languages

Fincher’s word reversibility looks at whether a word pair appears in both orders in the text (A B and B A). It measures flexibility in word order and found that Voynichese allows more reversals than normal languages.

My reciprocity is different: it works on the co-occurrence network, not on word order. It checks whether two words are linked in both directions within the network of words that tend to appear together. Low reciprocity means that most associations go one way only, suggesting fixed or mechanical patterns rather than flexible syntax.

(27-10-2025, 07:15 AM)stopsquark Wrote: You are not allowed to view links. Register or Login to view.What happens if you change the window to only directly adjacent tokens? (Or pairs of tokens)? What if you make the edges directed? Do you see anything new?

When we limit the window to directly adjacent tokens and make the graph directed, we measure something closer to real word order instead of general co-occurrence:

Text	Reciprocity	Clustering C_obs	Small-world σ	Modularity	Communities	H mean (bits)	Frac H = 0
Voynich MS	0.102	0.0766	114.18	0.332	38	0.64	0.72
Tirant lo Blanch (Cat.)	0.107	0.0384	23.72	0.289	13	1.01	0.53
Culpeper (Eng.)	0.142	0.0429	26.77	0.281	18	1.19	0.46
Alchemical Herbal (Lat.)	0.056	0.0229	8.54	0.494	17	0.57	0.66
Timm generated	0.119	0.0762	25.32	0.303	14	0.97	0.54
Mauro pseudo-Voynich	0.111	0.0881	10.67	0.245	27	0.87	0.60

The main differences are:
- Reciprocity becomes clearer. Natural languages have slightly higher reciprocity (around 0.10–0.14) because word order is flexible: A → B sometimes also appears as B → A. In the Voynich, reciprocity stays low (0.10), meaning word pairs almost never reverse. But note that the lowest is the Alchemical Herbal in Latin. *
- Entropy drops a lot. More than 70% of Voynich tokens have only one possible follower (H = 0), while in normal texts it is about half. This suggests mechanical repetition rather than grammatical structure. Again, Alchemical Herbal is not so far away from the Voynich, and has an even lower entropy
- The small-world property remains but becomes extreme. Voynich has σ around 114, while natural languages range between 8 and 30. This means there are many short, repetitive loops, typical of a templated system.
- Community structure changes too. Natural texts form a bit more of a dozen broad clusters, but the Voynich splits into many small, tight ones. It still shows order, but it is too fragmented.

When using directed, adjacent links, the Voynich still looks organized, but its structure is too rigid and repetitive compared to real languages. This supports the idea that it is not a normal language but a constrained, procedural system that imitates word sequences without real syntax.

* To make it clear with RadioFM later You are not allowed to view links. Register or Login to view. and my You are not allowed to view links. Register or Login to view.: my measure of reciprocity is not the same as Fincher’s "word reversibility". Fincher looked at how often a word pair A → B also appears as B → A somewhere in the text. That is a frequency-based metric, tied to the linear sequence of the text. In my case, reciprocity in this post is calculated on a directed graph of bigrams. Each word is a node, and there is a link A → B if B ever follows A at least once. Reciprocity here then measures the topological symmetry of that network (the percentage of links that are bidirectional) not how frequently they occur. So Fincher’s measure reflects how often reversals happen, while mine reflects whether such connections exist at all. They are related but not equivalent, and they can move in opposite directions in repetitive or highly structured texts like the Voynich.

When you analyse the Voynich Manuscript as a network, you can measure things that ordinary cryptographic tests don't touch normally: reciprocity, clustering, modularity, assortativity, small-world index... These values describe how words connect to one another, not how often they appear. If the text were a simple cipher (monoalphabetic, polyalphabetic or transposition) those structures would be almost flat. In encrypted writing, the relationships between words are destroyed: clustering drops to the level of random noise, modularity disappears, the network stops being small-world, and reciprocity goes close to zero. None of that happens in the Voynich data.

The Voynich graph shows a moderate reciprocity (about 0.10), strong clustering (0.076 compared to 0.0009 for a random graph), a very high small-world index (around 114), negative assortativity, and modularity near 0.32 with 30-plus lexical communities. These are all signatures of a structured language system. The entropy is moderate: not random, not rigid. The Voynich text behaves like human language in network space.

From these results, three interpretations remain possible:

First, it could be a real but unknown language with unusually strong morphological regularity. The network structure looks natural: high modularity and clustering suggest syntax and phrase boundaries, while the negative assortativity shows the presence of functional “hub” words linking content words. The repetitive affixes could belong to a genuine inflectional system, perhaps from a lost or artificial dialect that never developed full orthographic variation. Under my opinion, strange, but feasible.

Second, it might be an invented but rule-driven system that behaves like a proto-language. The graph shows the kind of pattern expected in early or simplified linguistic systems: many unique word forms built from a small set of stems, frequent local repetition, and short connection paths between tokens. The result is a network that mimics the topology of a real language but with less semantic depth, consistent with an experiment in structured speech or a semi-linguistic code. A sort of proto-Esperanto, a deliberately basic but functional invented language.

Third, it could represent a mnemonic code: a designed scheme to store or recall knowledge through structured word forms. In that case the recurring prefixes and suffixes act as markers for categories or concepts. The strong clustering and modularity indicate grouped ideas, while the small-world property would make the network easy to navigate in memory. The entropy and redundancy match what you would expect from a system intended for memorisation, not encryption. This would also fit well with cabalistic or alchemical contexts, and with the thematic organisation of the Voynich, which notably lacks theological content.

Each of these possibilities fits the observed network properties. What the graph makes very unlikely is that the Voynich Manuscript is a random or straightforward cipher. The structure is too human, too internally consistent, and too optimised for association to be pure disguise.

I hope I may ask a question that strays somewhat from the topic, but which is currently on my mind: 4. Could it also be that the text lost a lot of ‘information’ during a ‘clumsy’ copying process (from an older text)? For example: The text originally had a much higher number of different glyphs, which were massively reduced to a low standard selection due to a clumsy copying process. For various reasons, I would estimate that this standardisation has reduced the number of different glyph types to approximately 30-40 per cent of the original glyphs.

It’s an interesting idea, but there are problems: the structure stays uniform across all sections, and a massive reduction of symbols would normally destroy the regular grammar we still see. So while not impossible, my opinion is that it seems more likely that the Voynich was designed with its limited glyph set from the start.

Well, my compliments to quimqu.

His is the most clear demonstration I've seen that long-range ('long' being up to 5 words apart in this case, if I understood correctly) interactions exist and are important for the VMS. As he says, this has been proposed and tested before, but what finally convinced me 100% is this graph, which he kindly sent me and includes a data point for a randomly shuffled VMS, where it falls quite apart from the real VMS:

[attachment=11835]

After this work, the possibility that the VMS is just a mechanical random gibberish (equivalent to an order-0 or a low-order Markov chain) becomes, in my mind, (much) more improbable.

The same graph also brings me to re-evaluate Torsten Timm's work, because it's undeniable it's plotted 'near' the VMS, while its shuffled version is far away. This does not mean I fully endorse Timm's theory, because there are texts in natural languages which are near the VMS too and, imho, because it does not explain convincingly the other VMS peculiarity, the words structure, but surely it rises in plausibility.

Many other hypothesis remain in place (quimqu listed some), I won't annoy you with the n-th rehashing of a long list, but at least I see some progress.

PS.: it might be interesting to analyze some more texts in natural languages. I find it strange to see texts in the same language split into two different groups (with the Voynich in the middle). Analyzing more text/languages would clarify if this is a real feature (which would be an interesting discovery by itself, I guess) or not, and in any case better determine where the VMS stands vs. natural languages texts.

Pages: 1 2 3 4 5 6