![]() |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Inside the Voynich Network: graph analysis - Printable Version +- The Voynich Ninja (https://www.voynich.ninja) +-- Forum: Voynich Research (https://www.voynich.ninja/forum-27.html) +--- Forum: Analysis of the text (https://www.voynich.ninja/forum-41.html) +--- Thread: Inside the Voynich Network: graph analysis (/thread-4998.html) |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
RE: Inside the Voynich Network: graph analysis - Jorge_Stolfi - 18-11-2025 (18-11-2025, 04:49 PM)quimqu Wrote: You are not allowed to view links. Register or Login to view.But this needs to be tested and proofed... I really don't know if this can be so easy... (Guess not). I don't think it is worth wasting time on that. Adjusting the files to somehow compensate for the grammatical differences would be a horribly complicated task. For one thing, a word in one language may become two or more words in the other separated by unrelated words. Italian "Luigi canterà?" = English "Will Luigi sing?" I don't know any Hebrew, but, just for fun and edification, I got from the net the first verse of the Bible (GEN 1:1) in Hebrew, transcribed as it would be written in antiquity (without vowel/pronunciation marks), and what seems to be sort of a modern scholarly translation o fthe same: BREŠYT BRE ELHYM ET HŠMYM WET HERṢ BREŠYT = "in the beginning of" BRE = "the creation" ELHM = "by the gods" ET = [direct object marker] HŠMYM = "of the Heavens" WET = "and" HERṢ = "the Earth" The "E" is aleph א which Whipedia says it is used only as a syllable marker, or whatever. The ELHM is morphologically plural ("gods") but seems to be grammatically singular and now assumed to be a conventional/honorific way to refer to (the single) God. Again, I don't know any Hebrew so the above example must be full of errors. The point was only to show how far apart the local structure of two translations of the same text can be. The sense of the translation above is very different from that of the traditional one "In the beginning God created Heavens and Earth". And that is another problem to keep in mind: every language is ambiguous, and the ambiguities generally do not match. Thus the job of translation constantly requires choosing one possible interpretation of the input words, and writing something that could be interpreted in the same way... All the best, --stolfi RE: Inside the Voynich Network: graph analysis - quimqu - 19-11-2025 Well, finally I got the results for the Pentateuch in multiple languages. The run was quite long. Obtaining the KPI for each language took about one hour and a half each, except for Hebrew, that took about 10 hours. I attach first the results in this table, some plots, and then I attach some analysis (with the help of IA and my supervision afterwards, to summarize them; it will be good that some language expert (Jorge Stolfi, volunteers?) can go through the text to see if it makes really sense; I have checked what I considered but I am not a linguist). I have other plots but this post is quite long, so if you are interested in any of them, just ask me and I will post them. Here are the main KPI by language. Remember that it is the same text, the first 5 books of the Bible, translated in multiple languages:
Following, I attach the PCA (dimension reduction) plot, where we can see sumarized how each language is positioned: PCA1 captures how morphologically rich, structurally diverse, and modular a language’s co-occurrence graph is. - High PCA1 means many distinct forms, many communities, and a dispersed graph. - Low PCA1 means a compact lexicon, more predictable structure, and stronger reliance on a small set of repeated forms. PCA2 captures how robust or centralized the graph is. - High PCA2 means the graph does not depend on a few hubs, so it stays connected even if central nodes are removed. - Low PCA2 means strong hub dependency and a more fragile structure. We can have a summary of each language: Hebrew: Very morphologicaly rich and high structural diversity. Also quite robust. This is on of the most extreme languages in the dataset (the one that took 10 hours to analyse). Latin: Very morphologicaly rich and a large, diverse graph. Slightly centralized and less robust than Hebrew. Russian: Very morphologicaly rich but with a more centralized structure than Latin or Hebrew. Mandarin Other: Very low morphological diversity, but extremely high robustness. The graph is even and not dominated by hubs. Mandarin Union: Same pattern as the other Mandarin version, and even more robust. Vietnamese: Low morphological diversity. Slightly centralized but not as much as European languages. English: Low morphological diversity and strongly centralized. The graph depends heavily on a few very frequent words. German: More diverse than English but still centralized. The fact that german joints words to create compound ones increases variety, but hub words play a strong role. French: Low morphological diversity and the most centralized structure in the dataset. Spanish: A bit more diverse than French but similarly centralized and fragile. Here is a in depth analysis of each language (Warning! This is AI generated, even if I have checked the output I am not a linguist). English Profile: A compact lexicon, strongly centralized network, extremely high degree inequality, and strong dependence on a small set of ultra frequent function words. Graph characteristics:
English translations rely heavily on a closed class of very frequent function words that shape the graph. This produces a hub dominated network with fast navigation but low robustness. Comparisons:
French Profile: Similar to English but slightly more function word dependent and even less resilient. Graph characteristics:
French forms a tightly controlled network dominated by a few grammatical words. Removing them contracts the graph substantially. Comparisons:
Latin Profile: The largest lexicon, very long tail, and highly modular structure. Strong small world behavior combined with low reliance on hubs. Graph characteristics:
Latin forms a sprawling network where many inflected forms increase surface variety. Local clusters gather around semantic or syntactic roots, but global navigation does not rely on a tiny connector set. Comparisons:
Hebrew Profile: A distributed network shaped by root morphology. High clustering, moderate path lengths, and less dependence on hubs than Indo-European languages. Graph characteristics:
Hebrew combines strong local structure (root based cohesion) with relatively balanced connectivity. It avoids the extreme hub dependence of English or French while also avoiding the extreme expansion seen in Latin. Comparisons:
Mandarin Profile: A compact lexicon with highly even degree distribution. High resilience and lower small world behavior because clustering is less extreme relative to random graphs. Graph characteristics:
Mandarin appears as the most uniformly connected graph. Each node contributes modestly to connectivity, and no single node dominates. The network is robust and flatter than Indo-European languages. Comparisons:
German Profile: A moderately large lexicon, strong compounding, and relatively balanced function word usage. German sits structurally between English/French and Latin/Hebrew, combining moderate hub centrality with richer lexical variety. Graph characteristics:
German forms a network with more lexical spread than other Western European languages. Compounding inflates the tail of the frequency distribution, increasing the number of low frequency nodes. This reduces extreme hub dominance and distributes connectivity across more words. The structure remains efficient but less tightly centralized. Comparisons:
RE: Inside the Voynich Network: graph analysis - Rafal - 19-11-2025 Quote:The ELHM is morphologically plural ("gods") but seems to be grammatically singular and now assumed to be a conventional/honorific way to refer to (the single) God. Yes. It is standardly pronounced as Elohim ( You are not allowed to view links. Register or Login to view. ) as is indeed a respectful plural case for a single entity. Something like "We, king of England and Scotland... ". Such form is sometimes called "pluralis maiestatis" As for the analysis: PCA1 captures how morphologically rich, structurally diverse, and modular a language’s co-occurrence graph is. The graph makes sense to me. You basically get the difference between languages with declension and without it. Notice that Asian languages don't have declension just like English. Hebrew seemed weird to me but I learnt a few things from ChatGPT It really matters if you write Hebrew without vowels (like in ancient Biblical times) or with them like currently. Adding vowels makes the number of unique words grow several times. You seem to use vowel included transcription here. And Hebrew has declension (I was earlier wrong). So the data may be correct after all.PCA2 captures how robust or centralized the graph is. I have problems with imagining it. Quimqu, are you able to give as example a sample of text which is robust and which isn't. Or does such feature emerge only for big texts? And going back to Voynich Manuscript... How would you say, what is VM position on PCA1 and PCA2? RE: Inside the Voynich Network: graph analysis - quimqu - 19-11-2025 (19-11-2025, 01:31 PM)Rafal Wrote: You are not allowed to view links. Register or Login to view.PCA2 captures how robust or centralized the graph is. Hi Rafal, a centralized (fragilegraph) text would be: "The man and the woman went to the city. The child and the dog walked in the city. The workers and the farmers live in the city.". Why fragile? The words "the", "and" and "city" appear in every sentence, thus they act as hubs in the graph. If we remove them, all other words stop being connected to one another. The graph is centralized and not robust (PCA2 low). Languages like English, French, German and Spanish are so. A robust text (unoform graph) would be "Mountains surround the valley. Rivers carve paths through hills. Forests shelter animals and rivers. Hills border mountains that form valleys.". There is no single connector that appears everywhere. Many words link to multiple clusters. Removing words don't collapse the graph as most words remain connected. This is what happens in Mandarin (logic as a Chinese character means multiple words) and Hebrew (less than in Chinese). (19-11-2025, 01:31 PM)Rafal Wrote: You are not allowed to view links. Register or Login to view.And going back to Voynich Manuscript... How would you say, what is VM position on PCA1 and PCA2? Well, graphs depend on text length, thematic, style, language, etc. I'd rather don't put the Voynich with the Pentateuch language comparison. But you can see how the MS behaves compared with simmilar corpus (in length) and different languages in this You are not allowed to view links. Register or Login to view. RE: Inside the Voynich Network: graph analysis - Rafal - 19-11-2025 Quote:But you can see how the MS behaves compared with simmilar corpus (in length) and different languages in this You are not allowed to view links. Register or Login to view. Do I think correctly that dimensions and their interpretation in Principal Component Analysis emerge from the used data and so are different each time? So is the meaning of PCA1 and PCA2 the same on both graphs? RE: Inside the Voynich Network: graph analysis - quimqu - 19-11-2025 (19-11-2025, 02:20 PM)Rafal Wrote: You are not allowed to view links. Register or Login to view.Do I think correctly that dimensions and their interpretation in Principal Component Analysis emerge from the used data and so are different each time? Yes. PCA dimensions always depend on the data you use. This means the patterns that define PCA1 and PCA2 change whenever the values or graph metrics change. So. the interpretation of the components is not fixed. so PCA1 and PCA2 in one analysis will not necessarily mean the same as PCA1 and PCA2 in another. In fact, it is a good way to summarize and put a whole bunch of results in an easy to see plot, but it is quite difficult to understand the PCA components meaning. That's why I said I have many other plots to check the graph characteristics separately. RE: Inside the Voynich Network: graph analysis - quimqu - 20-11-2025 Yesterday Rafal asked me to position the Voynich. I told him that I thought it was not a good idea to position the Voynich in the Penteteuch study, but after re-thinking it... well, we don't know the text fo the Voynich, so it is the same comparing it with the Pentateuch or with other books... And so I put the Voynich data into the PCA plot: We can see that the Voynich (as a whole) comes out close to Latin in the PCA plot, but Voynich A and Voynich B behave quite differently from each other. The thing is that PCA can be a bit tricky here, because what it shows is mainly the directions where the data varies the most. It is a plot of variation, not a plot of real similarity. So in those high-variation dimensions, the Voynich ends up near Latin. But that doesn’t mean they are actually close when you look at the underlying metrics. And when we focus on the KPIs where the Voynich has smaller real distances… the results are actually quite surprising: If we calclate the euclidean distance of the KPI, in all the Voynich corpus (EVA, CUVA, A and B) the nearest language is Mandarin, followed by English.
But how near it is from Mandarin? In fact, they are not near at all. Graph metrics show that, although Voynich may be closer to Mandarin than other languages, it really doesn't resemble it at all. In fact, the differences are very large. The word occurrence patterns of the Voynich text do not resemble the word occurrence patterns of those translations of the Pentateuch. Voynich graphs have longer paths, so, less internal connectivity and more fragmentation. Natural languages, such as Mandarin, show more compact and well-connected networks. In natural languages, contexts tend to cluster into communities and repeated patterns, which is not the case in Voynich. Degree preservation (sigma_degpres): This indicator is much higher in the MS than in any natural language in the Pentateuch corpus. This means that the distribution of bigrams and co-occurrences is much more rigid and less compatible with the variability of any real language. In Voynich, nodes with similar patterns have a greater tendency to connect to each other, while in natural languages this tendency is very weak or non-existent. So, similar words tend to link together, but in natural languages this does not happen. Voynich shows a much higher diversity of forms than in human languages, especially in the EVA A variant. This (high TTR) is incompatible with natural texts, which have much more marked repetitions and structures. The MS uses lots of different words but much less repeated than any natural language. Zipf distribution: although it follows an approximately Zipfian behavior, the slope is somewhat steeper than in natural languages. Therefore, a more regular and more controlled frequency distribution, which also does not fit with the variability of any real language. This means that the most frequent words are much ore frequent than in a natural language and that the less frequent words are much les frequent also. A final note: In this study we are comparing graph based KPIs extracted from the Pentateuch in several languages with the same KPIs extracted from the Voynich text. Since the content of the Voynich manuscript is unknown, part of the differences observed in the graph metrics could be influenced by the underlying subject matter or internal structure of the text. In other words, some patterns may reflect the nature of the content itself rather than a linguistic property. For this reason, these results should be interpreted with caution. The comparison shows structural differences, but it does not imply that the Voynich behaves like or unlike any specific language in terms of meaning or topic. RE: Inside the Voynich Network: graph analysis - Jorge_Stolfi - 20-11-2025 (20-11-2025, 09:39 AM)quimqu Wrote: You are not allowed to view links. Register or Login to view.the differences are very large and indicate a completely different behavior from any natural language.ces, but it does not imply that the Voynich behaves like or unlike any specific language in terms of meaning or topic. Thanks for all this data, but let me emphasize again that those metrics are not properties of the language, but of the text. You observed it yourself, on the previous post where you explained what the fragility measure means. Moreover, as the Pentateuch test shows, they are not properties of the semantics contents of the text either, but only of the patterns in which the words occur. Even two translations of the same text into the same language (see the two Chinese versions) will give different metrics. So the correct statement would be "the word occurrence patterns of the Voynich text do not resemble the word occurrence patterns of those translations of the Pentateuch." Which is not unexpected, since the structure of the text of the typical Herbal is very different from that of a narrative like the Pentateuch. The next interesting test would be to compare the Voynich Herbal texts with the texts of Marco's Alchemical Herbal, in Latin and English. All the best, --stolfi RE: Inside the Voynich Network: graph analysis - Philipp Harland - 20-11-2025 (18-11-2025, 12:48 AM)Philipp Harland Wrote: You are not allowed to view links. Register or Login to view.Seems like a very interesting method. I don't know if it's in the literature or not but it's fascinating nonetheless. It does seem like a convenient framework to generate insights about a piece of text. I'm just wondering if the mileage is worth the time and resources it takes is all. The graph QUIMQU* posted looks pseudo-parabolic, that's a start. But we'd need a lot more data to truly come to a meaningful conclusion. RE: Inside the Voynich Network: graph analysis - nablator - 20-11-2025 (07-11-2025, 11:24 PM)quimqu Wrote: You are not allowed to view links. Register or Login to view.This first part of the analysis is based on directed word-to-word graphs, where each edge connects a token A → B if word B follows A in the text. This approach keeps the natural direction of information flow, unlike undirected co-occurrence graphs that only record proximity. What are your exact criteria of co-occurrence? Is there a distance limit? With bifolia possibly reordered (wrong page order) and non-sequential components (paragraphs, circular texts, radial texts, labels, etc.) the VMS cannot be processed like other texts. |