Inside the Voynich Network: graph analysis

Inside the Voynich Network: graph analysis - Printable Version

+- The Voynich Ninja (https://www.voynich.ninja)
+-- Forum: Voynich Research (https://www.voynich.ninja/forum-27.html)
+--- Forum: Analysis of the text (https://www.voynich.ninja/forum-41.html)
+--- Thread: Inside the Voynich Network: graph analysis (/thread-4998.html)

Pages: 1 2 3 4 5 6 7 8 9 10 11

RE: Inside the Voynich Network: graph analysis - Jorge_Stolfi - 18-11-2025

(18-11-2025, 04:49 PM)quimqu Wrote: You are not allowed to view links. Register or Login to view.But this needs to be tested and proofed... I really don't know if this can be so easy... (Guess not).

I don't think it is worth wasting time on that. Adjusting the files to somehow compensate for the grammatical differences would be a horribly complicated task.

For one thing, a word in one language may become two or more words in the other separated by unrelated words. Italian "Luigi canterà?" = English "Will Luigi sing?"

I don't know any Hebrew, but, just for fun and edification, I got from the net the first verse of the Bible (GEN 1:1) in Hebrew, transcribed as it would be written in antiquity (without vowel/pronunciation marks), and what seems to be sort of a modern scholarly translation o fthe same:

BREŠYT BRE ELHYM ET HŠMYM WET HERṢ

BREŠYT = "in the beginning of"
BRE = "the creation"
ELHM = "by the gods"
ET = [direct object marker]
HŠMYM = "of the Heavens"
WET = "and"
HERṢ = "the Earth"

The "E" is aleph א which Whipedia says it is used only as a syllable marker, or whatever.
The ELHM is morphologically plural ("gods") but seems to be grammatically singular and now assumed to be a conventional/honorific way to refer to (the single) God.

Again, I don't know any Hebrew so the above example must be full of errors. The point was only to show how far apart the local structure of two translations of the same text can be.

The sense of the translation above is very different from that of the traditional one "In the beginning God created Heavens and Earth". And that is another problem to keep in mind: every language is ambiguous, and the ambiguities generally do not match. Thus the job of translation constantly requires choosing one possible interpretation of the input words, and writing something that could be interpreted in the same way...

All the best, --stolfi

RE: Inside the Voynich Network: graph analysis - quimqu - 19-11-2025

Well, finally I got the results for the Pentateuch in multiple languages. The run was quite long. Obtaining the KPI for each language took about one hour and a half each, except for Hebrew, that took about 10 hours. I attach first the results in this table, some plots, and then I attach some analysis (with the help of IA and my supervision afterwards, to summarize them; it will be good that some language expert (Jorge Stolfi, volunteers?) can go through the text to see if it makes really sense; I have checked what I considered but I am not a linguist). I have other plots but this post is quite long, so if you are interested in any of them, just ask me and I will post them.

Here are the main KPI by language. Remember that it is the same text, the first 5 books of the Bible, translated in multiple languages:

corpus	C	L	modularity	assort_degree	reciprocity	TTR	gini_deg	zipf_slope	betweenness_avg	resilience_frac_to_half
English	0.7666	2.0371	0.7124	0.0198	0.1373	0.0298	0.6996	-1.1380	0.0002	0.2199
French	0.7739	2.1227	0.7270	0.0349	0.1119	0.0256	0.6927	-1.1320	0.0001	0.1908
Latin	0.6570	2.2485	0.8165	0.0159	0.0801	0.0089	0.6299	-1.2245	0.0001	0.2200
Mandarin Other	0.7207	1.9919	0.1300	0.4068	0.1650	0.0115	0.6303	-1.0823	0.0004	0.3652
Mandarin Union	0.7135	1.9913	0.1380	0.4069	0.1757	0.0107	0.6189	-1.0910	0.0004	0.3737
Russian	0.6752	2.2441	0.7944	0.0270	0.0835	0.0104	0.6744	-1.2164	0.0001	0.2222
Spanish	0.7691	2.0930	0.7474	0.0216	0.1242	0.0119	0.6941	-1.1710	0.0001	0.1753
Vietnamese	0.7649	2.1641	0.7484	-0.3592	0.1712	0.0275	0.7068	-1.1176	0.0001	0.2500
German	0.7590	2.0996	0.7915	0.0163	0.1262	0.0129	0.6992	-1.1534	0.0001	0.1987
Hebrew	0.6145	2.8124	0.3923	-0.0871	0.0195	0.0045	0.5409	-1.3817	0.0001	0.2203

Following, I attach the PCA (dimension reduction) plot, where we can see sumarized how each language is positioned:

Filename: PCA.jpg Size: 33.71 KB 19-11-2025, 09:03 AM

PCA1 captures how morphologically rich, structurally diverse, and modular a language’s co-occurrence graph is.
- High PCA1 means many distinct forms, many communities, and a dispersed graph.
- Low PCA1 means a compact lexicon, more predictable structure, and stronger reliance on a small set of repeated forms.
PCA2 captures how robust or centralized the graph is.
- High PCA2 means the graph does not depend on a few hubs, so it stays connected even if central nodes are removed.
- Low PCA2 means strong hub dependency and a more fragile structure.

We can have a summary of each language:

Hebrew: Very morphologicaly rich and high structural diversity. Also quite robust. This is on of the most extreme languages in the dataset (the one that took 10 hours to analyse).
Latin: Very morphologicaly rich and a large, diverse graph. Slightly centralized and less robust than Hebrew.
Russian: Very morphologicaly rich but with a more centralized structure than Latin or Hebrew.
Mandarin Other: Very low morphological diversity, but extremely high robustness. The graph is even and not dominated by hubs.
Mandarin Union: Same pattern as the other Mandarin version, and even more robust.
Vietnamese: Low morphological diversity. Slightly centralized but not as much as European languages.
English: Low morphological diversity and strongly centralized. The graph depends heavily on a few very frequent words.
German: More diverse than English but still centralized. The fact that german joints words to create compound ones increases variety, but hub words play a strong role.
French: Low morphological diversity and the most centralized structure in the dataset.
Spanish: A bit more diverse than French but similarly centralized and fragile.

Here is a in depth analysis of each language (Warning! This is AI generated, even if I have checked the output I am not a linguist).

English
Profile:
A compact lexicon, strongly centralized network, extremely high degree inequality, and strong dependence on a small set of ultra frequent function words.
Graph characteristics:

High clustering.
Short path lengths.
Very high maximum degree and high gini: a few words act as massive hubs.
High eigenvector concentration and higher betweenness: central nodes control connectivity.
Moderate small world index.
Moderate resilience: removal of hubs causes significant degradation.

Interpretation:
English translations rely heavily on a closed class of very frequent function words that shape the graph. This produces a hub dominated network with fast navigation but low robustness.
Comparisons:

Relative to French: structurally similar but English is slightly more centralized.
Relative to Latin and Hebrew: much more hub dependent.
Relative to Mandarin: much higher inequality and lower resilience.

French
Profile:
Similar to English but slightly more function word dependent and even less resilient.
Graph characteristics:

High clustering.
Very high hub dominance, higher than English for some metrics.
Short path lengths due to connectors like “de”, “et”, “la”.
High eigenvector concentration.
Lowest resilience among all languages.

Interpretation:
French forms a tightly controlled network dominated by a few grammatical words. Removing them contracts the graph substantially.
Comparisons:

Relative to English: even more skewed toward top connectors.
Relative to Hebrew and Latin: far less distributed and much more fragile.
Relative to Mandarin: less even, with far more extreme hubs.

Latin
Profile:
The largest lexicon, very long tail, and highly modular structure. Strong small world behavior combined with low reliance on hubs.
Graph characteristics:

Lower clustering than English/French but still very high.
Longest average path length among the major languages.
Very large number of nodes because of inflectional expansion.
Steepest Zipf slope.
Low eigenvector concentration and low betweenness centralization: many routes exist.
High small world index.
Resilience similar to English but for different reasons.

Interpretation:
Latin forms a sprawling network where many inflected forms increase surface variety. Local clusters gather around semantic or syntactic roots, but global navigation does not rely on a tiny connector set.
Comparisons:

Relative to English/French: less centralized, less fragile, and more morphologically diverse.
Relative to Hebrew: similar in distributed structure but more extreme in node proliferation.
Relative to Mandarin: far more modular and with stronger clustering relative to random.

Hebrew
Profile:
A distributed network shaped by root morphology. High clustering, moderate path lengths, and less dependence on hubs than Indo-European languages.
Graph characteristics:

High clustering, near English/French but with different internal structure.
Larger lexicon than English and French, though smaller than Latin’s.
Moderate degree inequality: hubs exist but not overwhelmingly.
Lower eigenvector concentration and betweenness than English/French.
Resilience better than English/French, worse than Mandarin.
Zipf slope between Latin and English.

Interpretation:
Hebrew combines strong local structure (root based cohesion) with relatively balanced connectivity. It avoids the extreme hub dependence of English or French while also avoiding the extreme expansion seen in Latin.
Comparisons:

Relative to English/French: less centralized and more resilient.
Relative to Latin: less extreme in lexicon size and modularity.
Relative to Mandarin: less evenly distributed, with stronger clustering.

Mandarin
Profile:
A compact lexicon with highly even degree distribution. High resilience and lower small world behavior because clustering is less extreme relative to random graphs.
Graph characteristics:

Smallest node set due to segmentation style.
Very low inequality: degree distribution is smooth.
Short path lengths.
Low small world index: clustering does not exceed random by a large factor.
Highest resilience by far: removal of hubs barely affects the network.
Zipf slope shallower than all other languages.

Interpretation:
Mandarin appears as the most uniformly connected graph. Each node contributes modestly to connectivity, and no single node dominates. The network is robust and flatter than Indo-European languages.
Comparisons:

Relative to English/French: much more even, far more resilient, lower centralization.
Relative to Hebrew: flatter distribution and lower clustering.
Relative to Latin: dramatically smaller and less modular

German
Profile:
A moderately large lexicon, strong compounding, and relatively balanced function word usage. German sits structurally between English/French and Latin/Hebrew, combining moderate hub centrality with richer lexical variety.
Graph characteristics:

Clustering moderately high, but usually below English and French.
Average path length slightly longer than English/French due to broader vocabulary and compounds.
Degree inequality lower than English/French but higher than Hebrew and Mandarin.
Central nodes exist (“und”, “der”, “die”, “zu”), but they do not dominate the graph as strongly as “the” or “de” do in English/French.
Eigenvector concentration moderate: influence is distributed but not flat.
Zipf slope steeper than English/French (reflecting many low frequency compounds), but not as steep as Latin.
Resilience higher than English/French, lower than Mandarin, close to Hebrew.

Interpretation:
German forms a network with more lexical spread than other Western European languages. Compounding inflates the tail of the frequency distribution, increasing the number of low frequency nodes. This reduces extreme hub dominance and distributes connectivity across more words. The structure remains efficient but less tightly centralized.
Comparisons:

Relative to English and French:
Less hub dominated, lower inequality, longer paths, greater resilience.
Relative to Latin:
Less morphologically explosive, fewer nodes, weaker modularity, and lower small world index.
Relative to Hebrew:
Similar in reduced hub dominance, but German has more extreme tail heaviness due to compounding.
Relative to Mandarin:
More centralized, less resilient, and far higher clustering relative to random.

RE: Inside the Voynich Network: graph analysis - Rafal - 19-11-2025

Quote:The ELHM is morphologically plural ("gods") but seems to be grammatically singular and now assumed to be a conventional/honorific way to refer to (the single) God.

Yes. It is standardly pronounced as Elohim ( You are not allowed to view links. Register or Login to view. ) as is indeed a respectful plural case for a single entity. Something like "We, king of England and Scotland... ". Such form is sometimes called "pluralis maiestatis"

As for the analysis:

PCA1 captures how morphologically rich, structurally diverse, and modular a language’s co-occurrence graph is.
The graph makes sense to me. You basically get the difference between languages with declension and without it. Notice that Asian languages don't have declension just like English.

Hebrew seemed weird to me but I learnt a few things from ChatGPT Wink

It really matters if you write Hebrew without vowels (like in ancient Biblical times) or with them like currently. Adding vowels makes the number of unique words grow several times. You seem to use vowel included transcription here. And Hebrew has declension (I was earlier wrong). So the data may be correct after all.

PCA2 captures how robust or centralized the graph is.
I have problems with imagining it. Quimqu, are you able to give as example a sample of text which is robust and which isn't. Or does such feature emerge only for big texts?

And going back to Voynich Manuscript... How would you say, what is VM position on PCA1 and PCA2?

RE: Inside the Voynich Network: graph analysis - quimqu - 19-11-2025

(19-11-2025, 01:31 PM)Rafal Wrote: You are not allowed to view links. Register or Login to view.PCA2 captures how robust or centralized the graph is.
I have problems with imagining it. Quimqu, are you able to give as example a sample of text which is robust and which isn't. Or does such feature emerge only for big texts?

Hi Rafal,

a centralized (fragilegraph) text would be: "The man and the woman went to the city. The child and the dog walked in the city. The workers and the farmers live in the city.". Why fragile? The words "the", "and" and "city" appear in every sentence, thus they act as hubs in the graph. If we remove them, all other words stop being connected to one another. The graph is centralized and not robust (PCA2 low). Languages like English, French, German and Spanish are so.

A robust text (unoform graph) would be "Mountains surround the valley. Rivers carve paths through hills. Forests shelter animals and rivers. Hills border mountains that form valleys.". There is no single connector that appears everywhere. Many words link to multiple clusters. Removing words don't collapse the graph as most words remain connected. This is what happens in Mandarin (logic as a Chinese character means multiple words) and Hebrew (less than in Chinese).

(19-11-2025, 01:31 PM)Rafal Wrote: You are not allowed to view links. Register or Login to view.And going back to Voynich Manuscript... How would you say, what is VM position on PCA1 and PCA2?

Well, graphs depend on text length, thematic, style, language, etc. I'd rather don't put the Voynich with the Pentateuch language comparison. But you can see how the MS behaves compared with simmilar corpus (in length) and different languages in this You are not allowed to view links. Register or Login to view.

RE: Inside the Voynich Network: graph analysis - Rafal - 19-11-2025

Quote:But you can see how the MS behaves compared with simmilar corpus (in length) and different languages in this You are not allowed to view links. Register or Login to view.

Do I think correctly that dimensions and their interpretation in Principal Component Analysis emerge from the used data and so are different each time?

So is the meaning of PCA1 and PCA2 the same on both graphs?

RE: Inside the Voynich Network: graph analysis - quimqu - 19-11-2025

(19-11-2025, 02:20 PM)Rafal Wrote: You are not allowed to view links. Register or Login to view.Do I think correctly that dimensions and their interpretation in Principal Component Analysis emerge from the used data and so are different each time?

So is the meaning of PCA1 and PCA2 the same on both graphs?

Yes. PCA dimensions always depend on the data you use. This means the patterns that define PCA1 and PCA2 change whenever the values or graph metrics change. So. the interpretation of the components is not fixed. so PCA1 and PCA2 in one analysis will not necessarily mean the same as PCA1 and PCA2 in another.

In fact, it is a good way to summarize and put a whole bunch of results in an easy to see plot, but it is quite difficult to understand the PCA components meaning. That's why I said I have many other plots to check the graph characteristics separately.

RE: Inside the Voynich Network: graph analysis - quimqu - 20-11-2025

Yesterday Rafal asked me to position the Voynich. I told him that I thought it was not a good idea to position the Voynich in the Penteteuch study, but after re-thinking it... well, we don't know the text fo the Voynich, so it is the same comparing it with the Pentateuch or with other books... And so I put the Voynich data into the PCA plot:

Filename: Voynich Pentateuch.jpg Size: 28.8 KB 20-11-2025, 08:19 AM

We can see that the Voynich (as a whole) comes out close to Latin in the PCA plot, but Voynich A and Voynich B behave quite differently from each other. The thing is that PCA can be a bit tricky here, because what it shows is mainly the directions where the data varies the most. It is a plot of variation, not a plot of real similarity.

So in those high-variation dimensions, the Voynich ends up near Latin. But that doesn’t mean they are actually close when you look at the underlying metrics. And when we focus on the KPIs where the Voynich has smaller real distances… the results are actually quite surprising:

If we calclate the euclidean distance of the KPI, in all the Voynich corpus (EVA, CUVA, A and B) the nearest language is Mandarin, followed by English.

Voynich variant	Language 1	Language 2	Language 3	Language 4	Language 5	Language 6	Language 7	Language 8	Language 9	Language 10
Voynich CUVA	Mandarin Union (15516.18)	English (17416.91)	Mandarin Other (20211.07)	Vietnamese (36435.13)	Spanish (63245.70)	French (67399.46)	German (74894.93)	Hebrew (125298.75)	Russian (126951.13)	Latin (140980.96)
Voynich EVA	Mandarin Union (14941.14)	English (16929.77)	Mandarin Other (19719.66)	Vietnamese (36023.59)	Spanish (63012.04)	French (67169.80)	German (74703.50)	Hebrew (125042.78)	Russian (126656.67)	Latin (140810.12)
Voynich EVA A	Mandarin Union (102035.97)	English (106230.68)	Mandarin Other (107906.38)	Vietnamese (125713.50)	Spanish (153639.03)	French (155921.74)	German (165153.76)	Hebrew (215671.55)	Russian (217425.71)	Latin (230164.48)
Voynich EVA B	Mandarin Union (51741.69)	English (55704.18)	Mandarin Other (57564.85)	Vietnamese (75215.28)	Spanish (102978.62)	French (107093.42)	German (114510.12)	Hebrew (165138.20)	Russian (166790.26)	Latin (180836.49)

But how near it is from Mandarin? In fact, they are not near at all. Graph metrics show that, although Voynich may be closer to Mandarin than other languages, it really doesn't resemble it at all. In fact, the differences are very large. The word occurrence patterns of the Voynich text do not resemble the word occurrence patterns of those translations of the Pentateuch.

Voynich graphs have longer paths, so, less internal connectivity and more fragmentation. Natural languages, such as Mandarin, show more compact and well-connected networks.

In natural languages, contexts tend to cluster into communities and repeated patterns, which is not the case in Voynich.

Degree preservation (sigma_degpres): This indicator is much higher in the MS than in any natural language in the Pentateuch corpus. This means that the distribution of bigrams and co-occurrences is much more rigid and less compatible with the variability of any real language.

In Voynich, nodes with similar patterns have a greater tendency to connect to each other, while in natural languages this tendency is very weak or non-existent. So, similar words tend to link together, but in natural languages this does not happen.

Voynich shows a much higher diversity of forms than in human languages, especially in the EVA A variant. This (high TTR) is incompatible with natural texts, which have much more marked repetitions and structures. The MS uses lots of different words but much less repeated than any natural language.

Zipf distribution: although it follows an approximately Zipfian behavior, the slope is somewhat steeper than in natural languages. Therefore, a more regular and more controlled frequency distribution, which also does not fit with the variability of any real language. This means that the most frequent words are much ore frequent than in a natural language and that the less frequent words are much les frequent also.

A final note:
In this study we are comparing graph based KPIs extracted from the Pentateuch in several languages with the same KPIs extracted from the Voynich text. Since the content of the Voynich manuscript is unknown, part of the differences observed in the graph metrics could be influenced by the underlying subject matter or internal structure of the text. In other words, some patterns may reflect the nature of the content itself rather than a linguistic property. For this reason, these results should be interpreted with caution. The comparison shows structural differences, but it does not imply that the Voynich behaves like or unlike any specific language in terms of meaning or topic.

RE: Inside the Voynich Network: graph analysis - Jorge_Stolfi - 20-11-2025

(20-11-2025, 09:39 AM)quimqu Wrote: You are not allowed to view links. Register or Login to view.the differences are very large and indicate a completely different behavior from any natural language.ces, but it does not imply that the Voynich behaves like or unlike any specific language in terms of meaning or topic.

Thanks for all this data, but let me emphasize again that those metrics are not properties of the language, but of the text. You observed it yourself, on the previous post where you explained what the fragility measure means.

Moreover, as the Pentateuch test shows, they are not properties of the semantics contents of the text either, but only of the patterns in which the words occur.  Even two translations of the same text into the same language (see the two Chinese versions) will give different metrics.

So the correct statement would be "the word occurrence patterns of the Voynich text do not resemble the word occurrence patterns of those translations of the Pentateuch."   Which is not unexpected, since the structure of the text of the typical Herbal is very different from that of a narrative like the Pentateuch.

The next interesting test would be to compare the Voynich Herbal texts with the texts of Marco's Alchemical Herbal, in Latin and English.

All the best, --stolfi

RE: Inside the Voynich Network: graph analysis - Philipp Harland - 20-11-2025

(18-11-2025, 12:48 AM)Philipp Harland Wrote: You are not allowed to view links. Register or Login to view.Seems like a very interesting method. I don't know if it's in the literature or not but it's fascinating nonetheless. Is it truly research-grade, i.e. can it produce non-trivial results that couldn't be produced without it? It seems like it's working pretty well for the VMS.

It does seem like a convenient framework to generate insights about a piece of text. I'm just wondering if the mileage is worth the time and resources it takes is all.

The graph QUIMQU* posted looks pseudo-parabolic, that's a start. But we'd need a lot more data to truly come to a meaningful conclusion.

RE: Inside the Voynich Network: graph analysis - nablator - 20-11-2025

(07-11-2025, 11:24 PM)quimqu Wrote: You are not allowed to view links. Register or Login to view.This first part of the analysis is based on directed word-to-word graphs, where each edge connects a token A → B if word B follows A in the text. This approach keeps the natural direction of information flow, unlike undirected co-occurrence graphs that only record proximity.

What are your exact criteria of co-occurrence? Is there a distance limit?

With bifolia possibly reordered (wrong page order) and non-sequential components (paragraphs, circular texts, radial texts, labels, etc.) the VMS cannot be processed like other texts.