The Voynich Ninja

Full Version: Inside the Voynich Network: graph analysis
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Pages: 1 2 3 4 5
Hello all,

You may know I’ve been working on the Voynich Manuscript using machine learning and data science. Lately I’ve started a new line of research: the study of the Voynich through graphs.

Graph analysis is still quite new in languages studies, but it’s becoming a powerful tool in machine learning. It lets us see how words connect and interact. The idea is simple: each word (token) becomes a node, and we draw edges between them whenever they occur near each other in the text. Those edges can have weights (how often the pair appears), directions (which one comes first), and other information. When you do this, the text becomes a network, and that network has a structure we can measure.

In my case, I built co-occurrence graphs using a sliding window of 5 tokens. Two words are connected if they appear close to each other. I also repeated the same process for several other texts:

- Latin (De Docta Ignorantia, Platonis Apologia, Alchemical Herbal),
- La Reine Margot and Old Medicine French texts,
- Catalan (Tirant lo Blanch),
- Spanish (Lazarillo de Tormes),
- English (Culpepper),
- and a sample of the synthetic text generated by Torsten Timm.

Once you have the graph, you can study things like community structure, modularity, and assortativity: basically, how tightly the vocabulary groups together and how predictable those connections are. Well, yes, it looks this crazy...

[attachment=11797]

The Voynich graph has thousands of nodes and over a hundred thousand weighted connections. It’s not random: it shows clear clusters of related tokens, similar to topic domains in real languages.

When comparing modularity (how strongly the graph is divided into communities) the Voynich ranks around 0.25. For context:

- Scholastic and alchemical Latin texts are around 0.28–0.33.
- Narrative texts like Tirant lo Blanch or La Reine Margot are only 0.14–0.19.

So the Voynich has a structured, technical-style network, closer to medieval treatises than to prose literature.

[attachment=11798]

Nothing new here: I looked at entropy, which measures how predictable the next word is from the previous ones. Lower entropy means the text is more repetitive or rule-bound.

In this comparison, the Voynich again behaves more like Latin scholastic or medical texts, highly structured and formulaic, than like natural flowing prose.

[attachment=11799]

To visualize everything together, I used a radar plot combining graph properties (modularity, assortativity) with token and character-level entropies. Each text forms its own fingerprint.

[attachment=11800]

When we plot entropy (what should be the syntactic freedom) against modularity (what should be the lexical structure), each text takes its own position in what you could call a complexity space_

[attachment=11802][attachment=11801]

The Voynich lands right in the middle, between the tightly structured Latin technical and medical compilations, and the more free-flowing narrative works like Tirant lo Blanch or La Reine Margot. It’s not as rigid as scholastic Latin, but not as loose as prose either.

The same pattern appears when comparing character entropy (morphological freedom) to modularity: again, the Voynich sits halfway between those two worlds. This suggests the manuscript has an intermediate level of organization: structured enough to follow internal rules, but not fully regular like formal Latin treatises.

It might reflect a controlled or encoded version of natural language, or simply a writing system with its own conventions. It’s also interesting that the same language can produce very different results depending on the type of text: a Latin medical recipe and a Latin philosophical dialogue, for example, can have completely distinct structural profiles. That gives a sense of how much “style” and “purpose” shape the internal geometry of a text.

It’s also worth noting that the Torsten Timm generated text, which is algorithmic, shows a very similar position in this “complexity space.” That means internal consistency and structured co-occurrence can emerge from both linguistic and mechanical systems. So, these results don’t demonstrate that the Voynich encodes a real language, only that it behaves like a text with rules, not pure randomness.

As always, any thoughts are welcome!
It'd be interesting to plot the probability distribution of the degree of each node. I think it could paint a bigger picture than cond. entropy alone. I expect Voynichese to have a more multimodal-like distribution than the other corpora, but it's just a wild guess.
Great job by the way  Smile
(24-10-2025, 11:35 PM)quimqu Wrote: You are not allowed to view links. Register or Login to view.I’ve been working on the Voynich Manuscript using machine learning and data science.

Ugh!  But you did get some interesting results in spite of these handicaps... Big Grin

Quote:each word (token) becomes a node, and we draw edges between them whenever they occur near each other in the text. using a sliding window of 5 tokens. Once you have the graph, you can study things like community structure, modularity, and assortativity

I suppose you consider how often the pair appears, not just yes/no?

The VMS has ~39'000 tokens (word occurrences) and ~7'000 lexemes ("word types", distinct words).  With a 5-token window you will collect about 4 undirected pairs per token (or 8 directed ones).  That is ~156'000 undirected pairs, giving an average node degree of ~45 -- much less than the ~24 million edges of the complete graph, where each node has degree ~7000. 

Therefore much of the structure of the graph will be an artifact of random sampling.  That is, even if chedal could occur next to 1000 different lexemes with equal frequency, the graph could show only a dozen of them, at random.

The structure of truly random graphs has been studied a lot, but since the tokens have very different frequencies, it is not clear how to apply those results to this situation.

For that reason, I would consider only pairs u,v where the frequency of u-near-v is significantly higher than what one would expect if the tokens were chosen independently with the observed lexeme frequencies (that is, generated by a word markov of order 0).  With that cutoff criterion, this simple gibberish should give an almost empty graph.  No?

And also there is the problem that Voynichese seems to have a smaller vocabulary than other "control" texts of the same size.  Thus the graph of Voynichese will have fewer nodes, hence perhaps a higher average degree.  This difference alone could create differences in the community structure etc.

Quote:The same pattern appears when comparing character entropy (morphological freedom) [...] It might reflect a controlled or encoded version of natural language, or simply a writing system with its own conventions. It’s also interesting that the same language can produce very different results depending on the type of text

Indeed.  Character-based statistics are extremely sensitive to the dialect and spelling/encoding system used.  If one were to write German with "x" instead of "sch", and with "q" or "j" for the two sounds of "ch", the character statistics would be different from those of official German.

Moreover, the frequency of characters and character tuples is dominated by their occurence in the most common words.  The popularity of "t", "h", and "e" in English is due in great part to "the" being the most common in English prose.  But "the" may be nearly absent in some "technical" books like herbals or pharmaceutical manuals.  And if one were to spell that word (only) as "ye" or "d'", the frequencies of "t", "e", "th" etc would measurably drop.

Once a friend watched a bit of Japanese TV and wondered why people were saying "ta" all the time.  "Ta" is the past tense marker, and it was a news program, where almost all sentences were in the past tense...

Quote:English (Culpepper)

The name is "peper" not "pepper". I know because I made the same mistake...

All the best, --stolfi
Well...

I don't get one thing...

You say (like most people) that Voynich has a non-random structure. But on both graphs it appears really close to Torsten Timm text.
Actually Torsten Timm point is the nearest to Voynich point on both graphs.

And Torsten Timm point is in many dimensions somewhere in the middle of meaningful texts. It's leftmost only on one dimension. It doesn't stand out of the crowd in your tests.

So my question is - are these actually good measures if text is meaningful or gibberish? What do they actually measure?
(25-10-2025, 06:43 PM)Rafal Wrote: You are not allowed to view links. Register or Login to view.So my question is - are these actually good measures if text is meaningful or gibberish?
Not really, because meaningful text can display a whole gamut of these metrics depending on the language, encoding, content and writing styles, it seems.

(25-10-2025, 06:43 PM)Rafal Wrote: You are not allowed to view links. Register or Login to view.You say (like most people) that Voynich has a non-random structure. But on both graphs it appears really close to Torsten Timm text.
I think they meant 'random' as in 'unpredictable'. TT's is gibberish but not fully random in that sense, it's more mechanical.
(25-10-2025, 02:53 PM)RadioFM Wrote: You are not allowed to view links. Register or Login to view.It'd be interesting to plot the probability distribution of the degree of each node. I think it could paint a bigger picture than cond. entropy alone. I expect Voynichese to have a more multimodal-like distribution than the other corpora, but it's just a wild guess.

Great job by the way  Smile

Thank you RadioFM,

it is an interesting point of view that you give here. Luckily, this is easily calculated and plotted. Here you have the degree distributions for several corpora, in log scale.
[attachment=11810]
The result is quite revealing: all texts follow a similar heavy-tailed distribution, meaning that most tokens have few connections while a small number of words act as hubs.
The Voynich and Torsten Timm texts show almost identical shapes and very close statistics (mean and median degree around 20–90). They both sit between the compact structure of the French medical and scholastic Latin texts (around 60 mean degree) and the more rich / prose texts like Culpeper or Tirant lo Blanch (mean degree 300–500).
In other words, the Voynich doesn’t look random at all (neither Torsten's, because it follows creation rules). Its structure follows the same kind of scale-free law that natural and technical texts do, just with a bit less connectivity than full prose.
(25-10-2025, 04:24 PM)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view.The name is "peper" not "pepper". I know because I made the same mistake...

Corrected, thank you.

Regarding the rest of comments, please note that I do count how often the pair appears, not just yes/no. You’re right, the graph can get some noise from random co-ocurrences, especially with a small window. I already drop all single links to clean it a bit, but yeah, it’s still kind of raw.

Your idea about checking if pairs appear more often than expected from random frequency (like an order-0 markov baseline) makes total sense. That would give a cutoff to keep only the meaningful edges. I’ll probably try that next (stay tunned Smile )

And true, the Voynich has fewer different words, so the average degree goes up a bit by definition. That’s why I focus more on the shape of the distributions and on modularity, not the absolute numbers.

Anyway, what’s cool is that both the Voynich and Timm graphs end up with similar heavy-tail shape as real texts (You are not allowed to view links. Register or Login to view.): not random at all, just a bit less connected. So yeah, your idea is spot on, I’ll test that. Thanks!
(25-10-2025, 06:43 PM)Rafal Wrote: You are not allowed to view links. Register or Login to view.So my question is - are these actually good measures if text is meaningful or gibberish? What do they actually measure?

Exactly as RadioFM answered you, Torsten Timm’s text is not random. It has rules to be generated, so it’s mechanical but structured. That’s why it shows a similar behavior to Voynich: both are internally consistent systems with regular patterns, not chaos.
To check this, I also added a shuffled version of the Culpeper text (same words, totally random order). You can see it on the plot (bottom right corner) with the lowest modularity and the highest entropy. That’s what true randomness looks like: no community structure and full unpredictability.
[attachment=11811]
So, what these measures capture is structure vs. freedom, not meaning. They tell us how organized the text is, not whether it says something understandable. Meaningful human language can appear anywhere along that scale, depending on style, genre, and even spelling conventions.

The interesting part is that Voynich and Timm both land in the middle zone, far from random, but not as rigid as scholastic Latin or as free as prose. That’s what makes it so tricky: it behaves like language, but not quite like any known one.
(25-10-2025, 06:43 PM)Rafal Wrote: You are not allowed to view links. Register or Login to view.But on both graphs it appears really close to Torsten Timm text.  Actually Torsten Timm point is the nearest to Voynich point on both graphs

Of course. Because their generator was tuned to produce text with statistics as close as possible to those of the VMS.  For one thing, it requires a seed text with the desired properties.

Pseudo-English generators using word-based Markov chains are old and easy programming exercises.  Their output seems very similar to real English -- until you try to understand what it is saying.  I bet that such pseudo-English would score very similar to real English in Quimqu's analysis.

All the best, --stolfi
Out of curiosity, can you run the same analysis on this text: You are not allowed to view links. Register or Login to view.

It's been generated with an order-0 Markov chain (you may wish to remove the initial header, the four lines prefixed by %c%)
Pages: 1 2 3 4 5