This weekend, I’ve been analysing and comparing the Voynich manuscript with other texts (natural languages and pseudo-Voynich texts like Torsten Timm's or Mauro's) with graphs, using the same network-analysis pipeline. The Voynich
shows a clear internal structure: not random, but also not quite like any known language. Its patterns are organised and repetitive, with many words that appear in almost fixed combinations.
This post is a bit dense, so here is the short summary of the findings:
Using the same network-analysis method on the Voynich manuscript and several comparison texts (English, Latin, Catalan, and the pseudo-Voynich texts by Torsten Timm and Mauro), I calculated a range of graph metrics that describe how words connect to each other.
The first group of metrics measures how "language-like" the structure is: clustering, assortativity, modularity, and small-world behavior (I explain the metrics later inthis post). These values place the Voynich close to real languages, showing that it is not random and that it follows consistent structural rules.
The second group of metrics goes deeper into how those structures behave. Here the Voynich stands apart: it is too repetitive, too stable, and too dependent on a few central word stems. These properties make it difficult to see it as a normal natural language. Instead, it looks like a highly organized but constrained system: something that imitates language patterns while being much more rigid than any known human tongue.
One possible explanation is that the Voynich text comes from a natural language that was transformed by some mechanical process. That would explain why its structure looks linguistic (it keeps the same kind of connections and clusters as real languages) but its variation is unusually low. The process could have been something like syllable substitution, compression, or a cipher that preserved the general patterns while hiding the original meaning.
Another option is that the text was generated mechanically from the start, following fixed templates or rules, without any underlying meaning. In that case it imitates the surface of a language but does not actually carry content.
The data do not tell us which of these two ideas is right, but they make one thing clear: the Voynich is not random. It is highly structured, yet too rigid to be a normal human language in its current form. Overall, the results are not new, but they reach the same conclusion by different means: the Voynich text is structured and consistent, yet far more rigid and repetitive than any known natural language.
--------------------
So, let's go to the dense part of the post. The first metrics are the following:
| KPI | Voynich (full corpus) | Natural languages | Artificial (Timm / Pseudo) | Interpretation |
| Clustering coefficient C | 0.68 | 0.65-0.75 | 0.60-0.66 | High clustering is typical of structured, rule-based systems like natural languages. |
| Small-world sigma σ | 190 | 10-100 | 30-50 | Voynich network is extremely “small-world”: words are tightly interconnected. |
| Degree assortativity | -0.23 | -0.20 to -0.30 | ≈ 0 | Negative assortativity means frequent “hub” words link to many rarer ones, as in real languages. |
| Louvain modularity | 0.22 | 0.10-0.25 | 0.10-0.15 | Moderate modularity: clear internal communities but not extreme compartmentalisation. |
| Entropy-degree corr. (log1p) | 0.99 | 0.95-0.99 | 0.9-0.98 | High correlation shows that more connected words are also more contextually variable: a linguistic trait. |
| Mean next-word entropy (H_mean) | 0.65 bits | 0.9-1.2 bits | 0.3-0.7 bits | Voynich transitions are more predictable than in natural languages: more fixed patterns. |
| % of tokens with H = 0 | ≈ 70% | ≈ 45-55% | ≈ 70-85% | High proportion of deterministic transitions: a sign of templated or repetitive structure. |
| Type-Token Ratio (TTR) | 0.186 | 0.10-0.20 | 0.16-–0.19 | Lexical diversity in the Voynich is within the normal linguistic range. |
| Max node degree (hub dominance) | Very high | High | Very high | A few words act as extreme hubs (like “daiin”, “ol”), stronger than in typical natural languages. |
OK, let's take a look at the KPI and what they mean:
-
Clustering coefficient C: This tells us how much the neighbours of a word are also connected to each other. In normal language, related words tend to appear together in small clusters (like “herbal remedy” or “city walls”). High clustering means the text has local structure, not random word order.
-
Small-world σ: This compares the graph to a random network. If σ is high, it means the network has short paths between words but still a lot of local clustering. Real languages usually have σ around 10-100. Random noise gives σ near 1. The Voynich text has σ around 190, which means it is very tightly connected.
-
Degree assortativity: This measures whether highly connected words tend to link to other highly connected words. In real languages, frequent "function words" like "the", "and", "of" connect to many rare "content words", so the value is usually negative (around -0.2 to -0.3). If the value is close to zero or positive, it suggests a random or artificial system.
-
Modularity: This shows how strongly the network splits into separate groups (called communities). In a language, these groups often correspond to topics, morpheme families, or stylistic units. Values around 0.1-0.25 are common in real texts. If the value is too low, there is no structure; too high, and the text is fragmented.
-
Entropy-degree correlation: Entropy measures how predictable the next word is after a given word. When this correlates strongly with the number of connections (degree), it means words with more possible continuations are also less predictable, which is what happens in real language. A high correlation (close to 1) is a good sign of linguistic organization.
-
Mean next-word entropy (H_mean): This is the average unpredictability of the next word. Higher entropy means more variation and freedom; lower entropy means repetitive or formulaic sequences. Normal languages have values around 1 bit. The Voynich text is closer to 0.6 bits, meaning it repeats many short patterns.
-
Percent of tokens with H = 0: This is the percentage of words that always appear in the same exact context. In human language about half of the words are flexible. In the Voynich text, around 70% of words have fixed positions or neighbours.
-
Type-Token Ratio (TTR): This measures lexical diversity: how many unique words there are compared to total words. If TTR is too low, the text repeats the same few words; if it is too high, it may be random. The Voynich value (≈0.18) is within the normal range for real languages.
-
Max node degree (hub dominance): This tells us how many connections the most frequent word has. In normal texts, function words are hubs but not overwhelmingly dominant.
In the Voynich, some words like
daiin or
ol are super-hubs that connect almost everywhere, suggesting a more constrained or mechanical pattern.
The first table listed the main indicators used to decide whether the Voynich behaves like a real language or a random system. But graphs can reveal much more. The next table shows additional network features that help refine the picture (things like how words form clusters, how stable those clusters are, and how dependent the text is on a few central tokens). These metrics confirm the same pattern: the Voynich text is structured, but also rigid and repetitive, unlike any known natural language.
| Additional KPI | Voynich (full corpus) | Natural languages | Artificial (Timm / Pseudo) | Interpretation |
| Degree distribution shape | Steep (few very high hubs) | Zipf-like, smoother | Irregular or flatter | Voynich shows extreme hubs and fewer mid-frequency words → repetitive templates. |
| Betweenness centrality (bridges) | Few, dominated by same stems | Many bridges | Few, repetitive | Limited connectors → reduced syntactic flexibility. |
| Average path length | ≈ 3.5–4 | ≈ 2.5–3 | ≈ 3–4 | Slightly longer paths → weaker global connectivity. |
| Community size distribution | 1–2 large, many tiny | Balanced | Uneven | Dominated by few structural families → strong repetition. |
| Reciprocity (A↔B pairs) | Low | Moderate to high | Low | Mostly one-directional links → fixed word order. |
| Stability across sections | Very stable | Moderately variable | Stable | Constant KPIs → same mechanism throughout the text. |
| Motif frequency (loops / triads) | High number of closed loops | Mostly open triads | High loops | Repetitive short cycles typical of patterned generation. |
| Eigenvector centrality | Concentrated on few stems | Spread across function words | Concentrated | Lack of differentiated “functional vocabulary.” |
| Assortativity by entropy | Low (predictable words cluster) | Higher, mixed | Low | Predictable words link together → rigid local structure. |
| Network resilience | Low (breaks after few removals) | High (robust) | Low | Strong dependency on few central hubs → mechanical organisation. |
This is the explanations for the KPI's:
-
Degree distribution shape: This describes how many words have few connections versus many. Natural languages follow a
Zipf-like pattern: most words are rare, and a few are very common. The Voynich follows a similar overall trend, but the drop-off is steeper: the most frequent words are
too dominant, and middle-frequency words are fewer than expected.
This points to a system with strong templates or repetitive affixes.
-
Betweenness centrality: This measures how often a word acts as a "bridge" between otherwise separate clusters. In normal language, conjunctions or pronouns often fill this role (e.g.,
and,
that,
which). In the Voynich, only a handful of tokens have high betweenness, and most are the same recurring stems (
chedy,
daiin,
ol). That suggests less syntactic flexibility and fewer multi-purpose connectors.
- Average path length: This is the average number of steps needed to go from one word to another through their connections. Real languages tend to have short path lengths (because of function words). The Voynich’s path length is slightly longer, meaning that its “grammar bridges” are weaker and the network is less efficiently connected at the global scale.
- Community size distribution: This shows whether the text’s thematic or morphological clusters are evenly sized or dominated by a few. Natural languages have many medium-sized communities. The Voynich network has one or two very large communities and many tiny ones: a pattern of strong repetition within a few structural groups rather than balanced thematic diversity.
- Reciprocity: This checks how often two words co-occur in
both directions (A→B and B→A). High reciprocity suggests a flexible, bidirectional syntax (as in natural language).
The Voynich has low reciprocity: connections mostly go one way, reinforcing the idea of fixed or formulaic order.
- Temporal or positional variation: If you divide the text into sections, you can see whether the network KPIs stay stable. In real language, KPIs fluctuate slightly with topic.
In the Voynich, they are remarkably constant, which hints that the same mechanism or rule set produced the whole manuscript.
- Motif frequency (triads and loops): Counting recurring small subgraphs (like word triplets) shows internal patterns. Natural languages have many asymmetric triplets (A→B→C, but not the reverse). The Voynich graph has more closed and repeated loops, another sign of formulaic or patterned text generation.
-
Eigenvector centrality: This tells how much a word is connected to other well-connected words. In normal language, function words like
the or
of dominate. In the Voynich, the top eigenvector words are the same repetitive stems that dominate all metrics, which reinforces the impression of limited functional differentiation.
-
Assortativity by entropy: Instead of linking by degree, this looks at whether predictable words connect with other predictable words. In the Voynich, low-entropy words cluster together, again suggesting rigid sequences rather than flexible syntax.
-
Network resilience: If you remove the most connected words, does the network fall apart? In natural language, the network remains mostly connected (because redundancy is high).
In the Voynich, removing a few central tokens breaks the network quickly, which shows dependency on a small set of structural hubs.