After a while, here is an update of my experiments with the UDHR corpus.
I have added a fourth property: the percentage of words starting with the most frequent symbol. E.g. in the string:
aa aba caa addd
the value is 75% (3 of 4 words starting with the most frequent symbol "a")
I have also added another Voynich dataset: the first 13,000 characters of the Currier-D'Imperio transcription (I extracted the data from one of Rene's IVTT files); the length of the sample is comparable with the average length of UDHR files.
The Voynich data-sets I now consider are:
_C-D_13K the Currier-D'Imperio file described above
_EVA_ZL_ALL Zandbergen-Landini whole transcription of the manuscript
_NEAL_A Currier A subset of Takahashi's transcription, modified as suggested by Philip Neal
_NEAL_B same as above for Currier B
These correspond to the blue diamonds in the plots.
I have matched each UDHR file against four measures:
- ENT1: 1st order entropy
- COND: conditional entropy
- REP1000: number of word repetitions for 1000 words. I have updated this count to include repetitions in which the two occurrences are separated by a dash (in addition to repetitions separated by a space). Reduplications with no separation (like "purpur") are also included.
- %_INIT_MOST_FREQ: percentage of words starting with the most frequent character
Entropy graphs (the second one "zooms" into the VMS area). As is well known, most languages have higher entropy values than Voynichese.
REP1000 / %_INIT_MOST_FREQ graphs. Most languages cluster near the origin, with zero repetition and less than 10% of words starting with the most frequent character.
The five best matches are (by increasing distance from the VMS samples):
1.16 hms Hmong, Southern Qiandong (China)
1.53 auc Waorani (Ecuador)
1.66 pam Pampangan (Philippines)
1.89 cot Caquinte (Peru)
1.99 rar Rarotongan (Polynesia)
These correspond to the green circles in the plots.
Both the numeric distance and the plots should make clear that hms Hmong is better than all the other UDHR samples, according to these specific measures.
These languages are from very different places. Geographically, none of these languages looks likely for the VMS, but I am considering looking into some on them, in order to understand the kind of phenomena that can result in Voynich-like texts.
I have manually selected these six languages as the closest ones that are also geographically European or close to Europe:
3.13 gla Gaelic, Scottish
3.48 gle Gaelic, Irish
3.52 hye Armenian
3.65 ydd Yiddish, Eastern
3.74 nld Dutch
3.77 eus Basque Euskara
These correspond to the orange circles in the plots.
The best one (Scottish Gaelic) only ranks 77th among the 378 language I considered. These European candidates perform quite poorly in the entropy plot and also in the repetition measure. The Basque UDHR text has three instances of exactly repeated words: I am fairly sure that word repetition is an actual linguistic phenomenon in Basque (see for instance You are not allowed to view links.
Register or
Login to view. and You are not allowed to view links.
Register or
Login to view.) but the evidence suggests that the phenomenon is considerably more frequent in Voynichese.
On the other hand, the two variants of Gaelic are good matches in terms of words beginning with the most frequent symbol: both have several short function words starting with a-. Armenian and Eastern Yiddish also perform rather well according to this measure. But these four languages do not seem to make use of word repetition. The result is that, also in the second plot, all European languages appear to be rather distant from the VMS samples.
I also manually selected three languages (yellow circles) that are somehow intermediate between the green and orange samples: they fit slightly better than the orange circles and they are slightly closer to Europe than the green circles.
2.30 plt Malagasy, Plateau (Madagascar)
2.40 flm Chin, Falam (Southeast Asia)
2.69 njo Naga, Ao (North-East India)
Even if the results are still negative, I believe this research is promising: with whatever system the VMS was created, if it has a linguistic meaning, it is very likely to be expressed in a language that has a close relative in the UDHR corpus. Assuming that this is the case, why are the results negative? There are several non mutually exclusive explanations, for instance:
- We have not found the right set of statistical properties yet.
- Our geographical assumptions are wrong (e.g. Voynichese is a close relative of Hmong).
- The encoding used in the manuscript is not directly comparable with the writing system used in the UDHR (e.g. the ms is written in a complex cipher that scrambles the statistics; or the ms is written alphabetically while the corresponding language in the UDHR is written syllabically).