RE: Bigram = phoneme theory (language agnostic)
geoffreycaveney > 10-03-2019, 02:45 PM
[font=Tahoma, Verdana, Arial, sans-serif]The statistical analysis of entropy, conditional character entropy, and bigram distribution (or character pair frequency distribution) "heat maps" is very interesting.[/font]
[font=Tahoma, Verdana, Arial, sans-serif]Perhaps relevant to this is the detailed global pan-linguistic study of consonant frequency in 50 languages from widely varying families and areas, which was performed and reported in the paper "On Consonant Frequency in Egyptian and Other Languages" by Carsten Peust in Lingua Aegyptia 16 (2008), pp. 105-134:[/font]
[font=Tahoma, Verdana, Arial, sans-serif]https://archiv.ub.uni-heidelberg.de/propylaeumdok/2676/1/Peust_On_consonant_frequency_in_Egyptian_2008.pdf[/font]
I was particularly struck by the statistics for Modern Greek (15th c. late medieval Byzantine Greek was much closer to Modern Greek than to Ancient Greek). It uses a small number of consonants very frequently, much more so than other European languages:
/s/ 18.2%
This Modern Greek /s/ is the highest single consonant frequency of all European languages in this study. Globally, it is only exceeded by Maori /t/, 26.3% (Maori is likely to have entropy and distribution stats very similar to Hawaiian, which is among the most similar to Voynichese), Bambara /n/, 20.2%, Tagalog /n/, 19.9% (Tagalog also comes up among the most similar entropies and distributions to Voynichese in Rene's studies as I understand them), Japanese /n/, 18.8%, and Maori /k/, 18.5%.
By contrast Latin's most frequent consonant /t/ is 15.7%, English /t/ is only 13.2%, and French /r/ is 13.7%.
Modern Greek is also the only language in the world in this study with /s/ as the most frequent consonant, except for Ancient Georgian (which however is only 12.5%).
Greek /s/ is most frequent non-initially, 19.8%, but also occurs with 13.4% frequency in initial position. Large numbers of Greek nouns and adjectives have final /s/.
/t/ 15.3%
Modern Greek /t/ occurs with a striking 24.5% frequency in word-initial position, exceeded only by Maori /t/, 30.7%, and slightly by Modern Hebrew /h/, 24.9%. In all three cases, this is the first letter of the definite article, although Modern Greek also has /o/ and /i/ as definite articles.
Note that Modern Greek has two consonants with higher frequency than any consonant in English or French.
In fact, I found it quite revealing to examine all 50 languages in this global study, ranking them by the frequency percentage of their two most frequent consonants. The median frequency of the two most frequent consonants is 27.6%, the mean is 27.5%, and the population standard deviation is 4.86% (using Bessel's correction since the 50-language study is a sample). 40 of the 50 languages in the study fall within one standard deviation of the mean; only the top 5 and the bottom 5 do not. Modern Greek is the 4th of the top 5, along with Maori, Japanese, Bambara, and Tagalog. The bottom 5 are Ingush, Czech, Ossetic, Manchu, and Yoruba. (Maori at 44.8% and Ingush and Czech at 15.9% and 16.9% are the extreme outliers in this study sample.) Thus Modern Greek in this respect patterns with Asian and particularly Austronesian languages rather than with other European and Indo-European languages, almost all of which have a more normal top two consonant frequency.
/n/ 11.9%
Not frequent in initial position, very frequent in non-initial position. /s/ and /n/ are the only consonants that can occur word-finally in Modern Greek. (Word-final /r/ also occurred in Ancient Greek.)
/r/ 9.2%
Almost non-existent in initial position (1.0%), very frequent elsewhere.
/k/ 8.7%
/p/ 8.2%
/m/ 6.2%
/l/ 5.1%
/dh/ 3.6% (This is the Modern Greek d, delta, pronounced as a voiced /th/, as in English "the".)
/kh/ 2.6%
/ph/ 2.3%
/th/ 2.2%
I write the fricatives /x/ and /f/ as /kh/ and /ph/ to emphasize their relationship to the Greek stop series /p/, /t/, /k/.
/v/ 2.0%
/gh/ 1.6%
/d/ 1.3% (Written as "nt" in Modern Greek; voiced stops are rare and occur mainly in borrowings.)
/z/ 0.9%
/b/ 0.4% (Written as "mp")
/g/ 0.4% (Written as "gk")
Now as is well known, Greek has always been a language that is relatively heavy on vowels and light on consonants. The same is true of Polynesian languages such as Hawaiian and Maori, and to a lesser extent other Austronesian languages such as Tagalog. In terms of entropy and bigram distribution studies, etc., it might be of interest to also examine the old Baybayin abugida that was used to write Tagalog prior to the 16th century, which contained 3 vowels and 14 consonants.
Finally, one form of medieval Greek was actually written in an almost vowelless abjad: the Judaeo-Greek (also called Yevanic) language spoken and written by the Romaniote Greek Jewish and Constantinopolitan Karaite Greek Jewish communities, which was written in the Hebrew script, and marked most vowels with the standard Hebrew vowel diacritic dots rather than with letters. These communities had slowly declined in the modern era, and then the Nazi occupation of Greece in World War II virtually wiped them out, but in medieval times the Greek Jewish community thrived in many areas of the Mediterranean. There was even a substantial Judaeo-Greek speaking community in southern Italy, some of whom migrated to northern Italy and other areas in the medieval period. (I also wish to thank D.N. O'Donovan for bringing to my attention a late 15th century reference to a "Karaite script" of Hebrew claimed to be written without the use of the letters aleph, ayin, he, chet, bet, and tsadi. I guess this may also have referred to a form of Greek, since it is one of the few languages which might reasonably be written without these letters, although standard Judaeo-Greek does use them.)
I think it would also be quite interesting to do entropy and bigram distribution studies of this medieval Judaeo-Greek language written in the Hebrew script. In addition to the lack of most vowels in the script, it was also notable for being based on the colloquial vernacular of Byzantine Greek, without the "Atticisms" that Byzantine and Modern Greek have often employed in writing to make the language look more like Ancient Greek than it actually is. For all of these reasons, I would very much like to see the entropy and bigram distribution statistics of Judaeo-Greek in the Hebrew script, to see how they compare with those of Hawaiian, Tagalog, and Voynichese.