24-10-2025, 11:35 PM
Hello all,
You may know I’ve been working on the Voynich Manuscript using machine learning and data science. Lately I’ve started a new line of research: the study of the Voynich through graphs.
Graph analysis is still quite new in languages studies, but it’s becoming a powerful tool in machine learning. It lets us see how words connect and interact. The idea is simple: each word (token) becomes a node, and we draw edges between them whenever they occur near each other in the text. Those edges can have weights (how often the pair appears), directions (which one comes first), and other information. When you do this, the text becomes a network, and that network has a structure we can measure.
In my case, I built co-occurrence graphs using a sliding window of 5 tokens. Two words are connected if they appear close to each other. I also repeated the same process for several other texts:
- Latin (De Docta Ignorantia, Platonis Apologia, Alchemical Herbal),
- La Reine Margot and Old Medicine French texts,
- Catalan (Tirant lo Blanch),
- Spanish (Lazarillo de Tormes),
- English (Culpepper),
- and a sample of the synthetic text generated by Torsten Timm.
Once you have the graph, you can study things like community structure, modularity, and assortativity: basically, how tightly the vocabulary groups together and how predictable those connections are. Well, yes, it looks this crazy...
[attachment=11797]
The Voynich graph has thousands of nodes and over a hundred thousand weighted connections. It’s not random: it shows clear clusters of related tokens, similar to topic domains in real languages.
When comparing modularity (how strongly the graph is divided into communities) the Voynich ranks around 0.25. For context:
- Scholastic and alchemical Latin texts are around 0.28–0.33.
- Narrative texts like Tirant lo Blanch or La Reine Margot are only 0.14–0.19.
So the Voynich has a structured, technical-style network, closer to medieval treatises than to prose literature.
[attachment=11798]
Nothing new here: I looked at entropy, which measures how predictable the next word is from the previous ones. Lower entropy means the text is more repetitive or rule-bound.
In this comparison, the Voynich again behaves more like Latin scholastic or medical texts, highly structured and formulaic, than like natural flowing prose.
[attachment=11799]
To visualize everything together, I used a radar plot combining graph properties (modularity, assortativity) with token and character-level entropies. Each text forms its own fingerprint.
[attachment=11800]
When we plot entropy (what should be the syntactic freedom) against modularity (what should be the lexical structure), each text takes its own position in what you could call a complexity space_
[attachment=11802][attachment=11801]
The Voynich lands right in the middle, between the tightly structured Latin technical and medical compilations, and the more free-flowing narrative works like Tirant lo Blanch or La Reine Margot. It’s not as rigid as scholastic Latin, but not as loose as prose either.
The same pattern appears when comparing character entropy (morphological freedom) to modularity: again, the Voynich sits halfway between those two worlds. This suggests the manuscript has an intermediate level of organization: structured enough to follow internal rules, but not fully regular like formal Latin treatises.
It might reflect a controlled or encoded version of natural language, or simply a writing system with its own conventions. It’s also interesting that the same language can produce very different results depending on the type of text: a Latin medical recipe and a Latin philosophical dialogue, for example, can have completely distinct structural profiles. That gives a sense of how much “style” and “purpose” shape the internal geometry of a text.
It’s also worth noting that the Torsten Timm generated text, which is algorithmic, shows a very similar position in this “complexity space.” That means internal consistency and structured co-occurrence can emerge from both linguistic and mechanical systems. So, these results don’t demonstrate that the Voynich encodes a real language, only that it behaves like a text with rules, not pure randomness.
As always, any thoughts are welcome!
You may know I’ve been working on the Voynich Manuscript using machine learning and data science. Lately I’ve started a new line of research: the study of the Voynich through graphs.
Graph analysis is still quite new in languages studies, but it’s becoming a powerful tool in machine learning. It lets us see how words connect and interact. The idea is simple: each word (token) becomes a node, and we draw edges between them whenever they occur near each other in the text. Those edges can have weights (how often the pair appears), directions (which one comes first), and other information. When you do this, the text becomes a network, and that network has a structure we can measure.
In my case, I built co-occurrence graphs using a sliding window of 5 tokens. Two words are connected if they appear close to each other. I also repeated the same process for several other texts:
- Latin (De Docta Ignorantia, Platonis Apologia, Alchemical Herbal),
- La Reine Margot and Old Medicine French texts,
- Catalan (Tirant lo Blanch),
- Spanish (Lazarillo de Tormes),
- English (Culpepper),
- and a sample of the synthetic text generated by Torsten Timm.
Once you have the graph, you can study things like community structure, modularity, and assortativity: basically, how tightly the vocabulary groups together and how predictable those connections are. Well, yes, it looks this crazy...
[attachment=11797]
The Voynich graph has thousands of nodes and over a hundred thousand weighted connections. It’s not random: it shows clear clusters of related tokens, similar to topic domains in real languages.
When comparing modularity (how strongly the graph is divided into communities) the Voynich ranks around 0.25. For context:
- Scholastic and alchemical Latin texts are around 0.28–0.33.
- Narrative texts like Tirant lo Blanch or La Reine Margot are only 0.14–0.19.
So the Voynich has a structured, technical-style network, closer to medieval treatises than to prose literature.
[attachment=11798]
Nothing new here: I looked at entropy, which measures how predictable the next word is from the previous ones. Lower entropy means the text is more repetitive or rule-bound.
In this comparison, the Voynich again behaves more like Latin scholastic or medical texts, highly structured and formulaic, than like natural flowing prose.
[attachment=11799]
To visualize everything together, I used a radar plot combining graph properties (modularity, assortativity) with token and character-level entropies. Each text forms its own fingerprint.
[attachment=11800]
When we plot entropy (what should be the syntactic freedom) against modularity (what should be the lexical structure), each text takes its own position in what you could call a complexity space_
[attachment=11802][attachment=11801]
The Voynich lands right in the middle, between the tightly structured Latin technical and medical compilations, and the more free-flowing narrative works like Tirant lo Blanch or La Reine Margot. It’s not as rigid as scholastic Latin, but not as loose as prose either.
The same pattern appears when comparing character entropy (morphological freedom) to modularity: again, the Voynich sits halfway between those two worlds. This suggests the manuscript has an intermediate level of organization: structured enough to follow internal rules, but not fully regular like formal Latin treatises.
It might reflect a controlled or encoded version of natural language, or simply a writing system with its own conventions. It’s also interesting that the same language can produce very different results depending on the type of text: a Latin medical recipe and a Latin philosophical dialogue, for example, can have completely distinct structural profiles. That gives a sense of how much “style” and “purpose” shape the internal geometry of a text.
It’s also worth noting that the Torsten Timm generated text, which is algorithmic, shows a very similar position in this “complexity space.” That means internal consistency and structured co-occurrence can emerge from both linguistic and mechanical systems. So, these results don’t demonstrate that the Voynich encodes a real language, only that it behaves like a text with rules, not pure randomness.
As always, any thoughts are welcome!
