09-05-2018, 03:15 PM
This is a simple experiment partly inspired byYou are not allowed to view links. Register or Login to view. by Hauer and Kondrak and by the You are not allowed to view links. Register or Login to view. recently mentioned by Donald Fisk.
It is something I have put together quickly, files have been processed without attempting to remove punctuation or to apply any other normalization (partly because not all samples are in the Latin alphabet). As always, I might have made errors, so double checking would be welcome.
I have used "the dataset created by Emerson et al. (2014) from the text of the Universal Declaration of Human Rights (UDHR) in 380 languages", which Hauer & Kondrak used. It is available on You are not allowed to view links. Register or Login to view..
For each language, I have computed conditional entropy and a custom repetition measure. I consider as repetition any exact repetition of three or more consecutive characters optionally separated by a space.
These count as repetitions:
barbarian
magis magisque
This does not (because the repetition is separated by a letter):
pellmell
I compared with three Voynich datasets:
1. the complete Zandbergen-Landini EVA transcription
2. Currier A in Takahashi's transcription and modified a-la-Neal (treating benches, benched-gallows, ee, in, iin as single characters)
3. the same as above for Currier B
The Voynich samples are plotted in green.
The purple circle corresponds to Latin - entropy:2.98, 2 repetitions in about 10.000 characters. Actually, of the two repetitions, one is coincidental (per personam).
[attachment=2112]
If one only considers high-repetition (>1 per 1000 characters), low entropy (<2.4) languages, only 10 are selected. All these 10 texts are written in the Latin alphabet. Geographically, none of them seems like a plausible candidate, even if some might be conceivably possible:
(the prq and cbs files are identical: this must be an error in the corpus)
[attachment=2111]
I think it could be interesting to perform more structured experiments along these lines, adding more quantitative indexes that could help measure distance between languages. Also, there are other corpora that one could try with this or similar approaches.
Both the two lowest-entropy languages (Vai and Korean) are not written in the Latin alphabet. Both alphabets appear to be syllabic.
It is something I have put together quickly, files have been processed without attempting to remove punctuation or to apply any other normalization (partly because not all samples are in the Latin alphabet). As always, I might have made errors, so double checking would be welcome.
I have used "the dataset created by Emerson et al. (2014) from the text of the Universal Declaration of Human Rights (UDHR) in 380 languages", which Hauer & Kondrak used. It is available on You are not allowed to view links. Register or Login to view..
For each language, I have computed conditional entropy and a custom repetition measure. I consider as repetition any exact repetition of three or more consecutive characters optionally separated by a space.
These count as repetitions:
barbarian
magis magisque
This does not (because the repetition is separated by a letter):
pellmell
I compared with three Voynich datasets:
1. the complete Zandbergen-Landini EVA transcription
2. Currier A in Takahashi's transcription and modified a-la-Neal (treating benches, benched-gallows, ee, in, iin as single characters)
3. the same as above for Currier B
The Voynich samples are plotted in green.
The purple circle corresponds to Latin - entropy:2.98, 2 repetitions in about 10.000 characters. Actually, of the two repetitions, one is coincidental (per personam).
[attachment=2112]
If one only considers high-repetition (>1 per 1000 characters), low entropy (<2.4) languages, only 10 are selected. All these 10 texts are written in the Latin alphabet. Geographically, none of them seems like a plausible candidate, even if some might be conceivably possible:
Code:
rar Rarotongan Oceania Cook Islands
qud Quechua (Unified Quichua, old Hispanic orthography) South-America Peru
hms Hmong, Southern Qiandong Asia China
cbs Cashinahua South-America Peru
prq Ashéninka Perené South-America Peru
mri Maori Oceania New Zealand
qug Quichua, Chimborazo Highland South-America Ecuador
fon Fon Africa Niger
miq Mískito Central-America Nicaragua
kmb Mbundu Africa Angola
(the prq and cbs files are identical: this must be an error in the corpus)
[attachment=2111]
I think it could be interesting to perform more structured experiments along these lines, adding more quantitative indexes that could help measure distance between languages. Also, there are other corpora that one could try with this or similar approaches.
Both the two lowest-entropy languages (Vai and Korean) are not written in the Latin alphabet. Both alphabets appear to be syllabic.