The Voynich Ninja

Pages: 1 2 3 4 5

(28-12-2019, 03:13 AM)Stephen Carlson Wrote: You are not allowed to view links. Register or Login to view.On tools that distinguish between consonants and vowels, here is a recently published algorithm: You are not allowed to view links. Register or Login to view.

I have read (most of) Mayer's paper and ran a few experiments with his code.
The process he applies can be described in four steps:

1. Generate a vector for each symbol X. The vector is based on the frequencies of patterns:
abX aXb bXa
for each couple of symbols a,b in the alphabet + space. The number of dimensions you get for an alphabet with N symbols is:
3*(N+1)*(N+1) [+1 is due to the addition of space]
For instance, an alphabet with 19 symbols will result in 1200-dimensional vectors.

2. The vectors are normalized with the Positive Pointwise Mutual Information (PPMI) system. This "reflects how frequently a sound occurs in a context compared to what we would expect if sound and context were independent", which sounds very reasonable. But additionally "PPMI converts all negative values of the inner term to 0" - if I understand correctly, this means that all combinations that are rarer than expected are "normalized" to a flat 0. The huge impact of normalization is illustrated at page 36.

3. The normalized high-dimensional vectors are processed via Principal Component Analysis (PCA), in order to reduce the number of dimensions.

4. PC1 and PC2 are used to plot the symbols and cluster them using a variant of the K-means algorithm.

I executed the process for the first 38K words of Dante's Commedia and I got very reasonable results: 'a', 'e', 'i', 'o', 'u' are classified as vowels. Note that 'x' only occurs in the Roman numbering of the Cantos: so it is nice that it is far from both vowels and consonants. One can also observe that the vowel/consonant separation is due to PC1: PC2 contributes the separation of 'x' from regular consonants.

[attachment=3809]

I then ran the process on Voynich data: EVA and CUVA, processing Currier A and B separately.
The results are all similar and all poor. For some reason, PC2 is mirrored between A and B: I don't think this is significant. The problem is that, in all cases, a is the only symbol that is clearly different from all others. In Eva, 'o' is close enough to 'a' to also be classified as a vowel; in Cuva, 'a' is the only detected vowel.

Personally, I am not sure that including space in the creation of the vectors is a good idea with Voynichese: the fact that several characters almost exclusively appear at word-start or word-end makes Voynichese different from normal written languages. This could explain why some methods that work with normal written languages do not seem to work with Voynichese. Anyway, it seems that Mayer's method points to Voynichese being different from pronounceable languages.

Thanks for your very clear explanation Marco, and for taking the effort of running these experiments. I agree that the results for Italian are very clear, and that Voynichese appears to behave differently.

I wonder what would happen if you ran the test with spaces removed from the texts? Sojustonelongwordlikethis? I can't even predict whether this would impact the Italian results or not. You'd get some unusual sequences where word-final vowels meet word-initial vowels for example. For Voynichese it should shake things up somewhat, but there would still be predictable patterns.

Thank you, Koen, interesting suggestion!
I removed spaces, transforming each whole line into a single word. It does not seem to make much difference. Italian has significant word-boundary effects, avoiding the occurrence of vowel-vowel at word boundary. Word-boundary effects are also frequent in Currier B.

These are the vowels detected on the "no spaces" files:
Dante: a e i o u
CUVA_A: A I N
CUVA_B: A
EVA_A: a o
EVA_B: a o

So it seems that my guess that the problems with Voynichese are due to word-end / word-start preferences was not grounded.

PS: in the plots for Dante, deleting spaces moves 'u' from the extreme left to the right of the PC1 vowel range. This could indeed suggest that the special position of 'u' in the original file is due to the fact that it rarely appears at word end (while all other vowels are common in that position). But, as this experiment shows, the method is complex enough to make "intuition" very misleading...

Pages: 1 2 3 4 5

MarcoP

Koen G

MarcoP