29-12-2019, 11:04 AM
(28-12-2019, 03:13 AM)Stephen Carlson Wrote: You are not allowed to view links. Register or Login to view.On tools that distinguish between consonants and vowels, here is a recently published algorithm: You are not allowed to view links. Register or Login to view.
I have read (most of) Mayer's paper and ran a few experiments with his code.
The process he applies can be described in four steps:
1. Generate a vector for each symbol X. The vector is based on the frequencies of patterns:
abX aXb bXa
for each couple of symbols a,b in the alphabet + space. The number of dimensions you get for an alphabet with N symbols is:
3*(N+1)*(N+1) [+1 is due to the addition of space]
For instance, an alphabet with 19 symbols will result in 1200-dimensional vectors.
2. The vectors are normalized with the Positive Pointwise Mutual Information (PPMI) system. This "reflects how frequently a sound occurs in a context compared to what we would expect if sound and context were independent", which sounds very reasonable. But additionally "PPMI converts all negative values of the inner term to 0" - if I understand correctly, this means that all combinations that are rarer than expected are "normalized" to a flat 0. The huge impact of normalization is illustrated at page 36.
3. The normalized high-dimensional vectors are processed via Principal Component Analysis (PCA), in order to reduce the number of dimensions.
4. PC1 and PC2 are used to plot the symbols and cluster them using a variant of the K-means algorithm.
I executed the process for the first 38K words of Dante's Commedia and I got very reasonable results: 'a', 'e', 'i', 'o', 'u' are classified as vowels. Note that 'x' only occurs in the Roman numbering of the Cantos: so it is nice that it is far from both vowels and consonants. One can also observe that the vowel/consonant separation is due to PC1: PC2 contributes the separation of 'x' from regular consonants.
[attachment=3809]
I then ran the process on Voynich data: EVA and CUVA, processing Currier A and B separately.
The results are all similar and all poor. For some reason, PC2 is mirrored between A and B: I don't think this is significant. The problem is that, in all cases, a is the only symbol that is clearly different from all others. In Eva, 'o' is close enough to 'a' to also be classified as a vowel; in Cuva, 'a' is the only detected vowel.
Personally, I am not sure that including space in the creation of the vectors is a good idea with Voynichese: the fact that several characters almost exclusively appear at word-start or word-end makes Voynichese different from normal written languages. This could explain why some methods that work with normal written languages do not seem to work with Voynichese. Anyway, it seems that Mayer's method points to Voynichese being different from pronounceable languages.