The Voynich Ninja

Full Version: Experiments with language corpora
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Pages: 1 2 3
This is a simple experiment partly inspired byYou are not allowed to view links. Register or Login to view. by Hauer and Kondrak  and by the You are not allowed to view links. Register or Login to view. recently mentioned by Donald Fisk.

It is something I have put together quickly, files have been processed without attempting to remove punctuation or to apply any other normalization (partly because not all samples are in the Latin alphabet). As always, I might have made errors, so double checking would be welcome.

I have used "the dataset created by Emerson et al. (2014) from the text of the Universal Declaration of Human Rights (UDHR) in 380 languages", which Hauer & Kondrak used. It is available on You are not allowed to view links. Register or Login to view..

For each language, I have computed conditional entropy and a custom repetition measure. I consider as repetition any exact repetition of three or more consecutive characters optionally separated by a space.
These count as repetitions:
barbarian
magis magisque
This does not (because the repetition is separated by a letter):
pellmell

I compared with three Voynich datasets:
1. the complete Zandbergen-Landini EVA transcription
2. Currier A in Takahashi's transcription and modified a-la-Neal (treating benches, benched-gallows, ee, in, iin as single characters)
3. the same as above for Currier B

The Voynich samples are plotted in green.
The purple circle corresponds to Latin - entropy:2.98, 2 repetitions in about 10.000 characters. Actually, of the two repetitions, one is coincidental (per personam).

[attachment=2112]

If one only considers high-repetition (>1 per 1000 characters), low entropy (<2.4) languages, only 10 are selected. All these 10 texts are written in the Latin alphabet. Geographically, none of them seems like a plausible candidate, even if some might be conceivably possible:
Code:
rar    Rarotongan    Oceania    Cook Islands
qud    Quechua (Unified Quichua, old Hispanic orthography)    South-America    Peru
hms    Hmong, Southern Qiandong    Asia    China
cbs    Cashinahua    South-America    Peru
prq    Ashéninka Perené    South-America    Peru
mri    Maori    Oceania    New Zealand
qug    Quichua, Chimborazo Highland    South-America    Ecuador
fon    Fon    Africa    Niger
miq    Mískito    Central-America    Nicaragua
kmb    Mbundu    Africa    Angola

(the prq and cbs files are identical: this must  be an error in the corpus)
[attachment=2111]

I think it could be interesting to perform more structured experiments along these lines, adding more quantitative indexes that could help measure distance between languages. Also, there are other corpora that one could try with this or similar approaches.
Both the two lowest-entropy languages (Vai and Korean) are not written in the Latin alphabet. Both alphabets appear to be syllabic.
Very intersting Marco!

Have you done any pre-processing on the text files to 'clean them up' (as far as that is needed)?
Thank you, Marco, excellent work again. There's not much VM research being shared at the moment so it's nice to read something like this.

Given examples like Korean, does this mean that low entropy can be caused - in part or entirely - by the writing system, independent of the language? 

Also, might the choice of text influence the results? The Declaration, a series of statements about what is or "shall be", excludes certain language forms from occurring. Past tenses, for example. 

I've been looking around a bit for a readily available corpus but haven't found anything useful yet.
On this site they have the Bible in hundreds of languages: You are not allowed to view links. Register or Login to view.
Perhaps with some planning and forum collaboration we could compile a corpus from something like this?
(09-05-2018, 04:45 PM)ReneZ Wrote: You are not allowed to view links. Register or Login to view.Very intersting Marco!

Have you done any pre-processing on the text files to 'clean them up' (as far as that is needed)?

Thank you, Rene!
No, I haven't done any pre-processing. I didn't convert upper-case / lower-case, nor remove punctuation. A difficulty is that several of the files are not written in the Latin alphabet, so a generic pre-processing system is not easy do define. I only removed maybe half a dozen files that seemed wrong (too short to be an encoding of the text).
There certainly is some noise, and the samples are short (about 10.000 characters). I checked these rough entropy values with those on your site, and they do look comparable, but of course they should be taken as purely indicative (values from your site on the left):

Code:
Mattioli Latin   3.234  
Pliny    Latin   3.266  | 2.983 lat
Dante    Italian 3.126  | 3.000 ita
Tristan  German  3.039  | 2.825 deu_1901
(09-05-2018, 05:17 PM)Koen Gh. Wrote: You are not allowed to view links. Register or Login to view.Thank you, Marco, excellent work again. There's not much VM research being shared at the moment so it's nice to read something like this.

Given examples like Korean, does this mean that low entropy can be caused - in part or entirely - by the writing system, independent of the language? 

Of course the writing system can have an immense impact on every statistical study. For instance, the very common medieval practice of using different symbols for a few letters if they are word initial/final or inside words affects as basic a quantity as the apparent size of the alphabet.

(09-05-2018, 05:17 PM)Koen Gh. Wrote: You are not allowed to view links. Register or Login to view.Also, might the choice of text influence the results? The Declaration, a series of statements about what is or "shall be", excludes certain language forms from occurring. Past tenses, for example. 

These statistics are so "macroscopic" that they are not likely to be affected by the subject of the text. I am fairly sure this is true for entropy. In principle, repetition could be more common in some texts than in others, but when the phenomenon is as frequent as in the VMS (about one occurrence every 100 words) it is likely a basic feature of the language, in my opinion.

(09-05-2018, 05:17 PM)Koen Gh. Wrote: You are not allowed to view links. Register or Login to view.I've been looking around a bit for a readily available corpus but haven't found anything useful yet.
On this site they have the Bible in hundreds of languages: You are not allowed to view links. Register or Login to view.
Perhaps with some planning and forum collaboration we could compile a corpus from something like this?

This seems feasible. I would be surprised that nobody had already done this.
As an example, I have added to the plot the complete King James Bible, with pre-processing (punctuation removed and converted to lower case).
These are the numbers compared with the "eng" file from the Human Rights corpus:
Code:
KING_JAMES 0.053 2.992
eng        0.094 2.981

The yellow square in the plot is the complete, pre-processed King James. The blue circle is the un-pre-processed Human Rights "eng" file.
I wonder if the very small entropy  value for Korean isn't a consequence of processing of the individual bytes of UTF-8 encoded text.
(09-05-2018, 06:45 PM)ReneZ Wrote: You are not allowed to view links. Register or Login to view.I wonder if the very small entropy  value for Korean isn't a consequence of processing of the individual bytes of UTF-8 encoded text.


I am aware of the problem and it is certainly possible I got that wrong. I hope someone will double-check this  Smile
MARCO P wrote >>I would be surprised that nobody had already done this.

I've done this for almost all language groups and used Genesis or Gospel of Mark if Genesis does not exist in the period around 1500.
Sharing that information (here) is another thing. Big Grin
But in linguistic studies such things have been done before. 

Notes:
Resulting information as well as the basis corpora is sold to scholars and institutes commercially.
If you have a text cleaning up routine, that should be the same for all texts, otherwise the differences are very much polluted.
Based on a closer inspection, my suspicion about the Korean text turned out to be wrong.

The problem is that the conditional entropy (H2 - H1) is only a small part of the picture.

The UDHR text in Korean is indeed expressed in syllables, using UTF-8 encoding. The characters of all Hangul syllables are in the range (hex) AC00 to D7A3, and use 3 bytes in UTF-8.

In the experiments I removed all digits, replaced punctuation by spaces and collapsed consecutive spaces to single spaces.

If this text is processed on a bytes basis, the statistics  (without counting spaces) are:
Nr of different characters (=bytes): 63
H1 = 5.03
H2 = 7.84
Difference: 2.81  (conditional entropy)

If it is processed on a character (syllable) basis, the statistics (again without spaces) are:
Total characters: 3344
Nr of different characters: 303
H1 = 7.09
H2 = 9.01
Difference: 1.92 (conditional entropy)

This last value is very similar to the one reported by Marco.
While this is low (and close to Voynichese), the absolute entropy values are extremely high.

What is happening here is that a very large number of syllables occur exactly once in this relatively short text.
That means that each of these is followed by exactly one other syllable, meaning that this second syllable (in this text) is completely predictable. The average information carried by these second characters is low, and this is precisely the meaning of the conditional entropy.

Effectively, character pairs are under-sampled, or in other words, the text is too short to give a good measure of the conditional entropy.
Thank you, Rene!

I understand your conclusion about the sample being too short to properly evaluate Korean. It should be possible to find a longer text encoded in the same way and see what happens. But, after your discussion of the issue, I think this is interesting but not terribly relevant. Voynich-wise, your observation that H1 and H2 are extremely high in Korean suggests that looking at conditional entropy alone is not enough. One can say that a sample text has an entropy behaviour similar to another sample only if H1 in sample A is close to H1 in sample B, and the same holds for H2 (and conditional entropy would be close too, since it is defined as H2-H1). 

I guess that the H1 value for Korean is reasonably accurate, even if the text is short. Voynichese has an H1 value lower than 4, totally incompatible with the 7.09 of Korean.

In the next days. I will try comparing the UDHR samples with the VMS taking into account the different entropies.
Pages: 1 2 3