The Voynich Ninja

Pages: 1 2 3 4 5 6

(28-10-2025, 05:15 PM)Mauro Wrote: You are not allowed to view links. Register or Login to view.The same graph also brings me to re-evaluate Torsten Timm's work, because it's undeniable it's plotted 'near' the VMS, while its shuffled version is far away.

The T&T generator is vaguely similar to a high-order word-based Markov model, so it naturally mimics the long-range correlations (up to 4 words apart and more) of the target language. An English-based version of the T&T output (or of an order-3 Markov) should plot close to real English too, while their shuffled versions would plot far from it.

Mauro, thank you for the compliments.

You got it right, the co-ocurrence graph has been created with a window of 5 tokens.

Now I have some more tests to carry on:

- see the differences with different windows
- create the graph taking into account the paragraphs. I think this is important because my opinion is that the window of 5 should not take into account words from the next paragraph. This will reduce the edges and I am curious to see how different the graph KPI will be
- I will post also graphs KPI for languages A and B

Regarding the differences between texts of the same language, I think the graphs can help separating styles or themes of the texts. I will try to continue testing more texts.

If anyone has any suggestion, I am open to try it through.

I’ve run the co-occurrence graphs for window sizes from K = 1 up to 10 on the same four corpora: Alchemical Herbal Latin, Tirant lo Blanch, Torsten Timm’s generated text, and the Voynich Manuscript.
The global behaviour is smooth across all K, and most indicators move monotonically: clustering grows, path length shortens, modularity declines. Yet the Voynich keeps its peculiar position: extremely small-world at every scale, moderately modular, and lower in entropy.

Corpus	K	C	σ	Modularity	H_mean
AlchemicalHerbalLatin	1	0.550	86.67	0.424	2.41
	2	0.550	86.67	0.427	2.41
	3	0.620	60.86	0.373	2.94
	5	0.680	39.45	0.303	3.59
	7	0.707	29.47	0.259	4.01
	10	0.729	21.36	0.223	4.44
Tirant_lo_Blanch	1	0.637	169.75	0.235	2.79
	2	0.637	169.75	0.239	2.79
	3	0.680	119.96	0.196	3.30
	5	0.716	78.99	0.155	3.93
	7	0.729	59.49	0.136	4.33
	10	0.739	43.82	0.121	4.74
TorstenTimm	1	0.610	144.11	0.142	2.99
	2	0.610	144.11	0.142	2.99
	3	0.664	109.02	0.130	3.48
	5	0.706	76.33	0.113	4.09
	7	0.725	60.25	0.107	4.47
	10	0.739	46.54	0.103	4.87
Voynich	1	0.535	410.18	0.281	2.50
	2	0.535	410.18	0.284	2.50
	3	0.615	307.83	0.267	3.02
	5	0.674	201.41	0.237	3.68
	7	0.698	152.95	0.232	4.11
	10	0.717	115.74	0.223	4.56

Across all corpora, increasing K makes the graph more tightly connected and less modular, but the Voynich keeps its distinct profile. Its small-world index is several times larger than any other text at every scale, and even at K = 10 it remains roughly twice that of the closest competitor. Clustering converges around 0.7 for all, yet Voynich begins lower and catches up only slowly. Modularity drops from 0.28 to 0.22, remaining midway between the natural languages and the Timm generator. Entropy grows with K as expected, but the Voynich still shows lower H_mean and a higher proportion of deterministic transitions.

The K-sweep reinforces the pattern seen before: the Voynich network is not random, but its combination of high connectivity, modest modularity, and low entropy makes it stand apart from both natural language and algorithmic texts.

I also repeated the Voynich run for K = 5, this time restricting co-occurrence windows to stay within paragraph boundaries, so that tokens at the end of one paragraph don’t link to those at the beginning of the next (note that in the previous results, the window could cath words from the next paragraph; in this run, I avoided this). The difference is subtle but measurable across several KPIs.

Setting	Nodes	Edges	C	σ	Modularity	avg_degree	kcore_max	degree_gini	H_mean
Voynich (K=5)	8744	130797	0.674	201.41	0.237	29.91	85	0.621	3.68
Voynich (K=5, paragraph-limited)	8324	115394	0.712	219.29	0.228	27.73	82	0.651	4.28

Confining the windows within paragraphs slightly reduces the network size and density, but the structure becomes more tightly clustered (C rises from 0.67 to 0.71) and even more small-world (σ ≈ 219 vs. 201). Restricting co-occurrences within paragraphs increases clustering while keeping path lengths comparable, which numerically raises σ. In practice, this means the network becomes more locally cohesive rather than globally more efficient. The degree distribution flattens a bit (higher Gini), and modularity remains stable. Mean entropy increases (which is expected, as limiting context makes each local neighbourhood less diverse, reducing the smoothing effect of cross-paragraph transitions).

Limiting co-occurrences to within paragraphs makes the Voynich network smaller but more internally cohesive: words cluster more tightly inside paragraphs, paths stay short, and entropy rises as each section becomes more self-contained and less connected to the rest of the text.

The overall picture suggests that paragraph boundaries act as real topological separators in the Voynich text. When preserved, they sharpen the internal cohesion of each component while still keeping the global small-world pattern intact, reinforcing the impression of a text with strong local repetition and high global connectivity.

I am now re-running more text comparisons, even with language A and language B corpus (the Currier one's, not my topic modeling ones). I will try to post the results toorrow.

To compare the Voynich Manuscript with ordinary texts and Voynich languages A and B, I built word–co-occurrence graphs (window of 5 tokens) for each corpus (≈20,000 tokens each, except Voynich A ≈11,000 as it does not have more). I have detected that the metrics depend too much from the graph size (nodes) so I built this comparison with similar length texts.

For every graph I computed key network metrics: clustering ©, path length (L), small-worldness (σ), degree inequality (Gini), eigenvector concentration, and resilience under targeted attacks. As said, all texts were normalized in size to make the metrics comparable.

Here are the results:

Corpus	C	L	σ	Gini deg	Eigen conc.	Resilience ½
LA In Psalmum Exposito	0.63	2.48	4.40	0.55	0.0031	0.22
FR La reine Margot	0.69	2.26	3.03	0.62	0.0056	0.21
VOY Marco's Markov	0.65	2.46	3.19	0.62	0.0031	0.22
EN Romeo and Juliet	0.70	2.26	2.84	0.63	0.0053	0.23
DE Simpl. Simpl.	0.67	2.35	3.43	0.60	0.0039	0.21
Voynich EVA	0.66	2.60	3.95	0.58	0.0037	0.18
Voynich EVA A	0.65	2.56	4.18	0.53	0.0042	0.18
Voynich EVA B	0.67	2.58	3.60	0.59	0.0044	0.17
T.Timm	0.68	2.24	2.78	0.63	0.0069	0.24
CA Tirant lo Blanch	0.70	2.12	3.00	0.62	0.0085	0.22
Voynich CUVA	0.65	2.65	4.15	0.57	0.0033	0.18
LA De docta ignorantia	0.64	2.31	3.28	0.60	0.0063	0.26
ES Lazarillo	0.71	2.19	3.47	0.58	0.0079	0.19
Naibe Cipher	0.62	2.33	3.23	0.58	0.0040	0.29
VI Vietnamese Stolfi	0.67	2.16	2.20	0.62	0.0063	0.31

What we can see:

- All Voynich sections have higher σ (small-worldness) than ordinary texts, meaning stronger local clustering and weaker global connectivity (Note: in this calculation, the small-world coefficient σ was computed against degree-preserving random graphs, not simple Erdős–Rényi graphs. This means each random control graph keeps the same degree distribution as the original, making σ more reliable for linguistic networks. That's why the values are lower than in previous posts)
- Average path length L is slightly longer: information (or adjacency) travels less efficiently across the graph, suggesting that the network is more fragmented into local clusters with fewer bridging links connecting them.
- Degree Gini is lower: no strong “hub” tokens comparable to function words in real languages, indicating a more uniform connectivity pattern where all nodes share similar importance instead of a few dominating the network.
- Eigenvector concentration and resilience are both lower: influence and robustness are evenly spread rather than hierarchical, implying that the network lacks dominant central nodes and would fragment quickly if connections were removed.

I show here a couple of plots:

[attachment=11854][attachment=11853][attachment=11852]

Natural languages form networks with a few highly connected function words and many peripheral ones.
The Voynich graphs are more uniform and modular: dense local clusters, few bridges, and a fragile large-scale structure.

As deduced throughout the thread, this pattern fits a system built from repetitive combinational rules rather than a natural linguistic process, with structural dependencies extending across at least five-token window rather than just immediate bigram links.

Can you please update with the new data also the two graphs you posted before, ie. in post You are not allowed to view links. Register or Login to view. (labelled Modularity vs. Character entropy h1, and Modularity vs. conditional entropy H1)? Possibly with shuffled Voynich too, for reference. Thanks in advance!

(30-10-2025, 01:17 PM)Mauro Wrote: You are not allowed to view links. Register or Login to view.Can you please update with the new data also the two graphs you posted before, ie. in post You are not allowed to view links. Register or Login to view. (labelled Modularity vs. Character entropy h1, and Modularity vs. conditional entropy H1)? Possibly with shuffled Voynich too, for reference. Thanks in advance!

Hello Mauro, I am working on it. I am trying to find a normalization for the KPI's, so we can compare different texts with different number of tokens, as some KPI are node dependent. That's why some texts in the same language were so apart. I am trying to get normalized data to compare as much different text as we can.

Hi, very interesting work. Could you try for me Portuguese? Maybe even a special text in old Portuguese?

(30-10-2025, 11:52 PM)Kaybo Wrote: You are not allowed to view links. Register or Login to view.Hi, very interesting work. Could you try for me Portuguese? Maybe even a special text in old Portuguese?

Hi Kaybo,

please provide the texts that you want me to analyze.

Thank you Smile

(30-10-2025, 11:52 PM)Kaybo Wrote: You are not allowed to view links. Register or Login to view.Could you try for me Portuguese?

Quimqu already tested You are not allowed to view links. Register or Login to view.. Would those do?

Pages: 1 2 3 4 5 6

Jorge_Stolfi

quimqu

quimqu

quimqu

quimqu

Mauro

quimqu

Kaybo

quimqu

Jorge_Stolfi