Well, finally I got the results for the Pentateuch in multiple languages. The run was quite long. Obtaining the KPI for each language took about one hour and a half each, except for Hebrew, that took about 10 hours. I attach first the results in this table, some plots, and then I attach some analysis (with the help of IA and my supervision afterwards, to summarize them; it will be good that some language expert (Jorge Stolfi, volunteers?) can go through the text to see if it makes really sense; I have checked what I considered but I am not a linguist). I have other plots but this post is quite long, so if you are interested in any of them, just ask me and I will post them.
Here are the main KPI by language. Remember that it is the same text, the first 5 books of the Bible, translated in multiple languages:
| corpus |
C |
L |
modularity |
assort_degree |
reciprocity |
TTR |
gini_deg |
zipf_slope |
betweenness_avg |
resilience_frac_to_half |
| English |
0.7666 |
2.0371 |
0.7124 |
0.0198 |
0.1373 |
0.0298 |
0.6996 |
-1.1380 |
0.0002 |
0.2199 |
| French |
0.7739 |
2.1227 |
0.7270 |
0.0349 |
0.1119 |
0.0256 |
0.6927 |
-1.1320 |
0.0001 |
0.1908 |
| Latin |
0.6570 |
2.2485 |
0.8165 |
0.0159 |
0.0801 |
0.0089 |
0.6299 |
-1.2245 |
0.0001 |
0.2200 |
| Mandarin Other |
0.7207 |
1.9919 |
0.1300 |
0.4068 |
0.1650 |
0.0115 |
0.6303 |
-1.0823 |
0.0004 |
0.3652 |
| Mandarin Union |
0.7135 |
1.9913 |
0.1380 |
0.4069 |
0.1757 |
0.0107 |
0.6189 |
-1.0910 |
0.0004 |
0.3737 |
| Russian |
0.6752 |
2.2441 |
0.7944 |
0.0270 |
0.0835 |
0.0104 |
0.6744 |
-1.2164 |
0.0001 |
0.2222 |
| Spanish |
0.7691 |
2.0930 |
0.7474 |
0.0216 |
0.1242 |
0.0119 |
0.6941 |
-1.1710 |
0.0001 |
0.1753 |
| Vietnamese |
0.7649 |
2.1641 |
0.7484 |
-0.3592 |
0.1712 |
0.0275 |
0.7068 |
-1.1176 |
0.0001 |
0.2500 |
| German |
0.7590 |
2.0996 |
0.7915 |
0.0163 |
0.1262 |
0.0129 |
0.6992 |
-1.1534 |
0.0001 |
0.1987 |
| Hebrew |
0.6145 |
2.8124 |
0.3923 |
-0.0871 |
0.0195 |
0.0045 |
0.5409 |
-1.3817 |
0.0001 |
0.2203 |
Following, I attach the PCA (dimension reduction) plot, where we can see sumarized how each language is positioned:
[
attachment=12490]
PCA1 captures how morphologically rich, structurally diverse, and modular a language’s co-occurrence graph is.
- High PCA1 means many distinct forms, many communities, and a dispersed graph.
- Low PCA1 means a compact lexicon, more predictable structure, and stronger reliance on a small set of repeated forms.
PCA2 captures how robust or centralized the graph is.
- High PCA2 means the graph does not depend on a few hubs, so it stays connected even if central nodes are removed.
- Low PCA2 means strong hub dependency and a more fragile structure.
We can have a summary of each language:
Hebrew: Very morphologicaly rich and high structural diversity. Also quite robust. This is on of the most extreme languages in the dataset (the one that took 10 hours to analyse).
Latin: Very morphologicaly rich and a large, diverse graph. Slightly centralized and less robust than Hebrew.
Russian: Very morphologicaly rich but with a more centralized structure than Latin or Hebrew.
Mandarin Other: Very low morphological diversity, but extremely high robustness. The graph is even and not dominated by hubs.
Mandarin Union: Same pattern as the other Mandarin version, and even more robust.
Vietnamese: Low morphological diversity. Slightly centralized but not as much as European languages.
English: Low morphological diversity and strongly centralized. The graph depends heavily on a few very frequent words.
German: More diverse than English but still centralized. The fact that german joints words to create compound ones increases variety, but hub words play a strong role.
French: Low morphological diversity and the most centralized structure in the dataset.
Spanish: A bit more diverse than French but similarly centralized and fragile.
Here is a in depth analysis of each language (Warning! This is AI generated, even if I have checked the output I am not a linguist).
English
Profile:
A compact lexicon, strongly centralized network, extremely high degree inequality, and strong dependence on a small set of ultra frequent function words.
Graph characteristics:- High clustering.
- Short path lengths.
- Very high maximum degree and high gini: a few words act as massive hubs.
- High eigenvector concentration and higher betweenness: central nodes control connectivity.
- Moderate small world index.
- Moderate resilience: removal of hubs causes significant degradation.
Interpretation:
English translations rely heavily on a closed class of very frequent function words that shape the graph. This produces a hub dominated network with fast navigation but low robustness.
Comparisons:- Relative to French: structurally similar but English is slightly more centralized.
- Relative to Latin and Hebrew: much more hub dependent.
- Relative to Mandarin: much higher inequality and lower resilience.
French
Profile:
Similar to English but slightly more function word dependent and even less resilient.
Graph characteristics:- High clustering.
- Very high hub dominance, higher than English for some metrics.
- Short path lengths due to connectors like “de”, “et”, “la”.
- High eigenvector concentration.
- Lowest resilience among all languages.
Interpretation:
French forms a tightly controlled network dominated by a few grammatical words. Removing them contracts the graph substantially.
Comparisons:- Relative to English: even more skewed toward top connectors.
- Relative to Hebrew and Latin: far less distributed and much more fragile.
- Relative to Mandarin: less even, with far more extreme hubs.
Latin
Profile:
The largest lexicon, very long tail, and highly modular structure. Strong small world behavior combined with low reliance on hubs.
Graph characteristics:- Lower clustering than English/French but still very high.
- Longest average path length among the major languages.
- Very large number of nodes because of inflectional expansion.
- Steepest Zipf slope.
- Low eigenvector concentration and low betweenness centralization: many routes exist.
- High small world index.
- Resilience similar to English but for different reasons.
Interpretation:
Latin forms a sprawling network where many inflected forms increase surface variety. Local clusters gather around semantic or syntactic roots, but global navigation does not rely on a tiny connector set.
Comparisons:- Relative to English/French: less centralized, less fragile, and more morphologically diverse.
- Relative to Hebrew: similar in distributed structure but more extreme in node proliferation.
- Relative to Mandarin: far more modular and with stronger clustering relative to random.
Hebrew
Profile:
A distributed network shaped by root morphology. High clustering, moderate path lengths, and less dependence on hubs than Indo-European languages.
Graph characteristics:- High clustering, near English/French but with different internal structure.
- Larger lexicon than English and French, though smaller than Latin’s.
- Moderate degree inequality: hubs exist but not overwhelmingly.
- Lower eigenvector concentration and betweenness than English/French.
- Resilience better than English/French, worse than Mandarin.
- Zipf slope between Latin and English.
Interpretation:
Hebrew combines strong local structure (root based cohesion) with relatively balanced connectivity. It avoids the extreme hub dependence of English or French while also avoiding the extreme expansion seen in Latin.
Comparisons:- Relative to English/French: less centralized and more resilient.
- Relative to Latin: less extreme in lexicon size and modularity.
- Relative to Mandarin: less evenly distributed, with stronger clustering.
Mandarin
Profile:
A compact lexicon with highly even degree distribution. High resilience and lower small world behavior because clustering is less extreme relative to random graphs.
Graph characteristics:- Smallest node set due to segmentation style.
- Very low inequality: degree distribution is smooth.
- Short path lengths.
- Low small world index: clustering does not exceed random by a large factor.
- Highest resilience by far: removal of hubs barely affects the network.
- Zipf slope shallower than all other languages.
Interpretation:
Mandarin appears as the most uniformly connected graph. Each node contributes modestly to connectivity, and no single node dominates. The network is robust and flatter than Indo-European languages.
Comparisons:- Relative to English/French: much more even, far more resilient, lower centralization.
- Relative to Hebrew: flatter distribution and lower clustering.
- Relative to Latin: dramatically smaller and less modular
German
Profile:
A moderately large lexicon, strong compounding, and relatively balanced function word usage. German sits structurally between English/French and Latin/Hebrew, combining moderate hub centrality with richer lexical variety.
Graph characteristics:- Clustering moderately high, but usually below English and French.
- Average path length slightly longer than English/French due to broader vocabulary and compounds.
- Degree inequality lower than English/French but higher than Hebrew and Mandarin.
- Central nodes exist (“und”, “der”, “die”, “zu”), but they do not dominate the graph as strongly as “the” or “de” do in English/French.
- Eigenvector concentration moderate: influence is distributed but not flat.
- Zipf slope steeper than English/French (reflecting many low frequency compounds), but not as steep as Latin.
- Resilience higher than English/French, lower than Mandarin, close to Hebrew.
Interpretation:
German forms a network with more lexical spread than other Western European languages. Compounding inflates the tail of the frequency distribution, increasing the number of low frequency nodes. This reduces extreme hub dominance and distributes connectivity across more words. The structure remains efficient but less tightly centralized.
Comparisons:- Relative to English and French:
Less hub dominated, lower inequality, longer paths, greater resilience.
- Relative to Latin:
Less morphologically explosive, fewer nodes, weaker modularity, and lower small world index.
- Relative to Hebrew:
Similar in reduced hub dominance, but German has more extreme tail heaviness due to compounding.
- Relative to Mandarin:
More centralized, less resilient, and far higher clustering relative to random.