Options

Updated PCA analysis of various languages

Index
Updated PCA analysis of various languages
RE: Updated PCA analysis of various languages

ReneZ > 01-06-2018, 01:58 PM

I had a look at the UDHR text in Abkhaz.
This is written in the Cyrillic script, with some additional characters. When converted to lower case, and eliminating numbers and punctuation, I am left with 44 characters, of which one is space.

I recognise at least 5 vowels.
When I ran an HMM analysis on it, it found 16 vowels, some of which clearly are consonants. This is not a very good result.

The character entropy of this text is completely standard. I compare it with other UDHR versions written in Cyrillic:

             H1 H2 (cond)
------------------------------------------
Abkhaz: 4.263    3.110
Belorus:    4.531    3.083
Bulgarian    4.221    3.155
Mongolian 4.425    3.201
Macedonian 4.162    3.080
Russian: 4.439    3.136

The only unusual thing I see is that the frequency plot of the character pairs is rather asymmetric.
The unusual aspect observed in the PCA plot cannot be found back in these statistics.
RE: Updated PCA analysis of various languages

DonaldFisk > 01-06-2018, 06:02 PM

(01-06-2018, 01:58 PM)ReneZ Wrote: You are not allowed to view links. Register or Login to view.I had a look at the UDHR text in Abkhaz.
This is written in the Cyrillic script, with some additional characters. When converted to lower case, and eliminating numbers and punctuation, I am left with 44 characters, of which one is space.

I recognise at least 5 vowels.
When I ran an HMM analysis on it, it found 16 vowels, some of which clearly are consonants. This is not a very good result.

The character entropy of this text is completely standard. I compare it with other UDHR versions written in Cyrillic:

             H1 H2 (cond)
------------------------------------------
Abkhaz: 4.263    3.110
Belorus:    4.531    3.083
Bulgarian    4.221    3.155
Mongolian 4.425    3.201
Macedonian 4.162    3.080
Russian: 4.439    3.136

The only unusual thing I see is that the frequency plot of the character pairs is rather asymmetric.
The unusual aspect observed in the PCA plot cannot be found back in these statistics.
Abkhaz text appears to be much more sensitive to file size than text in other languages (probably because of the size of its phoneme inventory) and the UDHR is significantly smaller than the files I've been using. When I plot it, it does look quite different from Figures 1 and 2 (and consequently the VMS plot (Figure 3)), but it also has a truncated vowel branch. I've been treating digraphs (e.g. кь and кә) as single glyphs. When I don't, it then looks more like the VMS plot.

What PCA shows is phonetically interchangeable (and often similar) glyphs (e.g. different vowels, or the same vowel with different tones) close together. What I'd also like, and I might work on next, is a way of visualizing what follows what.

According to the Wikipedia pages on Abkhaz language, alphabet, and phonology, there are only two phonemic vowels (а and ы) which can sound different when flanked by different consonants. Cyrillic е, и, о, and у are allophones of а and ы.
RE: Updated PCA analysis of various languages

ReneZ > 01-06-2018, 09:39 PM

Using the various digraphs as single units is certainly interesting and more 'correct'. Of course for the Voynich MS we don't know how to do this, although it is a reasonable guess that combinations like cTh are candidates.

In the HMM model of the UDHR in Abkhaz (that should identify vowels and consonants), the two groups are largely coinciding with two different halves of the PCA plot.

While the texts size of the UDHR is short, it is not *too* short, but it does suffer from what I'd call unusual vocabulary.
I have compared Latin texts of Pliny and Mattioli (16 centuries apart) and their properties almost coincide.
The Latin version of the UDHR is quite different.
Next Oldest Next Newest

Updated PCA analysis of various languages

Index

RE: Updated PCA analysis of various languages

RE: Updated PCA analysis of various languages

RE: Updated PCA analysis of various languages