Inside the Voynich Network: graph analysis

Inside the Voynich Network: graph analysis - Printable Version

+- The Voynich Ninja (https://www.voynich.ninja)
+-- Forum: Voynich Research (https://www.voynich.ninja/forum-27.html)
+--- Forum: Analysis of the text (https://www.voynich.ninja/forum-41.html)
+--- Thread: Inside the Voynich Network: graph analysis (/thread-4998.html)

Pages: 1 2 3 4 5 6 7 8 9 10 11

RE: Inside the Voynich Network: graph analysis - Jorge_Stolfi - 07-11-2025

(07-11-2025, 12:46 AM)ReneZ Wrote: You are not allowed to view links. Register or Login to view.I have also played with my own:
You are not allowed to view links. Register or Login to view.
The 'fun' results are in Annex A.

Wow, thanks Rene, that is a very nice explanation of Markov generators!

I also wrote my own to generate the above samples, but using words as units rather than characters. That automatically makes all output tokens be valid words, so the nonsense is evident only when one tries to parse the sentences.

It is frogmonkey.py in You are not allowed to view links. Register or Login to view.. It imports the file error_funcs.py in the same folder.

All the best, --stolfi

RE: Inside the Voynich Network: graph analysis - MarcoP - 07-11-2025

About You are not allowed to view links. Register or Login to view.: a few years ago, I experimented with You are not allowed to view links. Register or Login to view.. As always, it didn't go well, but there was a marginally interesting result: results for Q13 and Q20 are comparable with each other. My results also partially overlap with the core loop, in particular:

Code:
ol(green)->shedy/chedy(purple)->qokedy(blue)->qokedy(blue,loop)

                                       |

                                       V

                                 shedy/chedy(purple)

I am not sure I read correctly the "core loop" and qokaiin doesn't appear in the table on the left, but it is possible that the two-ways connection between shedy(purple) and qokaiin(gray) is also in agreement.

Filename: res1.jpg Size: 136.07 KB 07-11-2025, 07:09 AM

RE: Inside the Voynich Network: graph analysis - magnesium - 07-11-2025

(01-11-2025, 12:16 AM)quimqu Wrote: You are not allowed to view links. Register or Login to view.Again, a quite dense post. So, I summarize here first the findings, and then, if you are interested, you can deepen into the dense part of the post.

This is fascinating work! Thank you for including Naibbe ciphertext in your analysis. While I knew that Naibbe ciphertext wouldn't be perfect at replicating the VMS, I'm so pleased to see people testing its additional properties.

RE: Inside the Voynich Network: graph analysis - quimqu - 07-11-2025

Let's continue with the analysis of the Voynich MS though graphs.

This first part of the analysis is based on directed word-to-word graphs, where each edge connects a token A → B if word B follows A in the text. This approach keeps the natural direction of information flow, unlike undirected co-occurrence graphs that only record proximity. Each corpus was converted into a directed graph and measured using a set of topological and information-theoretic metrics. These metrics describe structure, hierarchy, and randomness in how words connect.

The comparison includes three groups of texts:

Natural languages: medieval, classical, and modern works in Latin, English, French, Spanish, and others.
Voynich texts: the full Voynich corpus and its internal subsets (A, B).
Artificial ciphers: the Naibe cipher and Torsten Timm’s generated text.

The table below summarizes the main indicators for all groups, showing the range of values for each metric and how Voynich differs from natural and artificial systems.

KPI	Natural texts (min–max)	Voynich (min–max)	Artificial (Naibe, Timm)	Description	Voynich vs Natural / Artificial
H0_mean_lifetime	0.79–0.97	0.81–0.85	0.79–0.89	Mean persistence of connected components. Measures graph cohesion at 0-dim topology.	Voynich is slightly lower than natural texts, close to artificial, suggesting weaker component stability.
H1_frac_inf	≈0.00 (none)	≈0.00 (none)	≈0.00 (none)	Fraction of infinite 1-dim holes. Reflects whether loops persist indefinitely.	No difference; all show finite loops only.
avg_clustering	0.38–0.53	0.50–0.57	0.45–0.82	Local density of triangles. Measures how often neighbors are connected.	Voynich has moderate clustering, slightly higher than natural texts but lower than some artificial graphs.
spectral_gap	1.6–5.9	2.9–4.5	5.9	Second Laplacian eigenvalue gap. Indicates overall graph connectivity and mixing speed.	Voynich falls mid-range; less connected than natural, less random than artificial.
kcore_max	4–17	5–12	8–17	Maximal k-core index. Captures how strongly nodes are mutually connected in dense cores.	Voynich has mid-core density, lower than typical artificial graphs.
flow_hierarchy	0.76–0.91	0.58–0.65	0.51–0.64	Fraction of edges in acyclic subgraphs. Quantifies directionality and information flow.	Voynich shows weaker directionality, closer to artificial graphs → more isotropic structure.
edge_reciprocity	0.08–0.24	0.34–0.42	0.36–0.48	Proportion of mutual (bidirectional) edges. Measures how often relations are symmetric.	Voynich has high reciprocity, like artificial ciphers, unlike natural directional flow.
entropy_rate	0.53–1.59	0.89–1.87	1.74–2.01	Average information rate of random walk transitions. Higher = less predictable.	Voynich entropy sits between natural and artificial → partially random transition dynamics.
degree_assortativity	-0.10–0.15	≈0.00	≈0.00	Correlation between node degrees. Positive = hubs connect to hubs.	Voynich similar to artificial: nearly neutral assortativity (no social-like structure).
avg_shortest_path	2.0–2.9	2.1–2.2	2.0–2.1	Mean minimal steps between nodes. Reflects overall navigability.	Voynich is very close to artificial; both slightly shorter paths than natural languages.

To make it visual, I have reduced all the KPI's to a two dimension vector with PCA method and plotted. Ypou can see that Torsten Timm's generated text is fairly away from any other text, while Voynich (full, A or B) are even closer to natural languages than naibe cipher. Here are the results:

Filename: directed.jpg Size: 41.59 KB 07-11-2025, 11:00 PM

I thought that reducing the KPI's to two dimensions with PCA method helps positioning each text. So I did the same experiment on the co-ocurrence graphs generated at the first part of this study.

This plot shows the position of each corpus based on ten structural metrics of its word co-occurrence graph. Each point represents a text, and the axes (PCA1 and PCA2) combine multiple graph properties into two main dimensions that explain most of the variation. The analysis includes natural languages, the Voynich text (EVA and CUVA) and its internal subsets (A and B), and thre artificial or generated corpora (Naibe Cipher, Timm and Mauro's Markov). By reducing all metrics to a two-dimensional map, we can see how similar or different the internal network organization of each text is:

Filename: Coocurrence.jpg Size: 51.85 KB 07-11-2025, 11:30 PM

The Voynich samples form a compact cluster close to natural languages, but slightly below them along the first axis, suggesting similar small-world and modular properties but lower diversity and reciprocity. The artificial texts are more scattered and show larger deviations, with the Markov and Naibe models standing apart in the upper area (with non european text or the Rohon), indicating higher uniformity and weaker hierarchical patterns. Overall, the Voynich graphs behave more like "European" natural languages than synthetic ones or asiatic ones, though they remain slightly detached, showing a distinct but structured internal topology.

RE: Inside the Voynich Network: graph analysis - Rafal - 08-11-2025

Are you able to interpret pca1 and pca2, the new 2 dimensions that emerged from your analysis? What features of text do they describe?

RE: Inside the Voynich Network: graph analysis - quimqu - 08-11-2025

(08-11-2025, 12:21 AM)Rafal Wrote: You are not allowed to view links. Register or Login to view.Are you able to interpret pca1 and pca2, the new 2 dimensions that emerged from your analysis? What features of text do they describe?

Yes, I won't bother you with numbers, but if we check the components we can say something like this:

For Directed graphs (A to B) (I attach again the plots for better understanding):

Filename: directed.jpg Size: 41.59 KB 08-11-2025, 09:14 AM

High PCA1: networks where words link back and forth easily, short cycles, high clustering and entropy. More symmetric flow of information.
Low PCA1: networks with strong directional chains, longer paths, less reciprocity. More syntactically constrained, hierarchical language flow.

PCA1 measures the degree of reciprocity vs. hierarchy in the word graph: how balanced the directional connections are.

High PCA2: corresponds to more fragmented, less cohesive structures
Low PCA2: means dense, compact graphs with a strong central core and tightly connected components.

PCA2 describes how compact or fragmented the word network is: low values mean a dense, cohesive core, while high values indicate a loose, fragmented structure.

For co-ocurrence graphs (window of 5 tokens connections):

Filename: Coocurrence.jpg Size: 51.85 KB 08-11-2025, 09:15 AM

High PCA1: high reciprocity, high clustering, steeper Zipf slope. Words tend to co-occur in repeated local patterns (dense and regular).
Low PCA1: high type–token ratio, strong modularity, high degree assortativity. More lexical variety and topic segmentation.

In this case, PCA1 captures lexical diversity vs. repetition and regularity in the co-occurrence network.It measures whether a text’s structure is broad and modular (many distinct word groups) or tight and repetitive (dense clusters of recurring pairs).

High PCA2: more evenly distributed connections, higher resilience, and longer paths. The network tolerates removal of nodes without breaking apart.
Low PCA2: high inequality of degree (large gini_deg), meaning a few dominant hubs control most co-occurrences.

PCA2 describes core centralization vs. distributed connectivity: how much the network depends on key hubs.

RE: Inside the Voynich Network: graph analysis - Rafal - 17-11-2025

Quimqu, iy was suggested that you compare the same text in different languages with your methods:

You are not allowed to view links. Register or Login to view.

Now as we are comparing different texts in different languages we don't know the source of differences.
The good candidate would be Bible as we can easily get it in most languages.

The interesting thing would be to compare languages that really differ. So no Spanish to Portugese but typical European languages to exotic ones (Chinese,Japanese, Swahili) and"weird" European (Hungarian, Finnish).

We could see how language nature like being agglutinative or not impacts your statistics with the same content.
Would you be interested?

RE: Inside the Voynich Network: graph analysis - Jorge_Stolfi - 17-11-2025

I have the following versions of the Pentateuch (first 5 book of the Old Testament)
translated to the follwoing languages:

You are not allowed to view links. Register or Login to view. Mandarin (Union)
You are not allowed to view links. Register or Login to view. Mandarin (another transl.)
You are not allowed to view links. Register or Login to view. Hebrew
You are not allowed to view links. Register or Login to view. Latin (Vulgate)
You are not allowed to view links. Register or Login to view. Russian (Synodal)
You are not allowed to view links. Register or Login to view. Vietnamese (Cadaman)

Those texts have been cleaned, and those files have one token per line,
with each token clearly marked as word, punctuation, or symbol. However
the character set and spelling in each file is specific to each
language, so it may need some re-coding before it can be fed to an
analysis program.

The Chinese texts, in particular, use an old encoding with two bytes per
Chinese character. There is a Linux program "autogb -i gb -o utf8" that
converts them to the modern Unicode/UTF8.

For character and word structure analysys the Chinese characters would
then have to be converted to the phonetic pinyin Romanization; but for
word-level analysis (like those graph measures) it would be enough to
map each Chinese Unicode character to some distinct ascii string, like
'U+FCBA' or '{40567}''

Hope it helps, --stolfi

RE: Inside the Voynich Network: graph analysis - quimqu - 17-11-2025

(17-11-2025, 12:29 PM)Rafal Wrote: You are not allowed to view links. Register or Login to view.Now as we are comparing different texts in different languages we don't know the source of differences.
The good candidate would be Bible as we can easily get it in most languages.

Thanks Rafal, yes, this was one of the items on my list: search for the same text in different laguages (as much different, better).

(17-11-2025, 02:50 PM)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view.The Chinese texts, in particular, use an old encoding with two bytes per
Chinese character. There is a Linux program "autogb -i gb -o utf8" that
converts them to the modern Unicode/UTF8.

Hello Jorge,

I have taken a look at the files. I think I can use them directly. I understand "a" is character and "p" is punctuaton, so I just need to put all "a" and "p" in a row to get the words and punctuation.

I will try to get also the Pentateuch in different european languages, so we can compare all of them in a single run.

Thank you both.

RE: Inside the Voynich Network: graph analysis - nablator - 17-11-2025

(17-11-2025, 02:50 PM)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view.The Chinese texts, in particular, use an old encoding with two bytes per
Chinese character. There is a Linux program "autogb -i gb -o utf8" that
converts them to the modern Unicode/UTF8.

Notepad++ auto-detects GB2312 and converts to UTF-8 or whatever encoding you want.