The Voynich Ninja

Full Version: "Principal component analysis of characters in the Voynich manuscript
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
There is a new paper about the VMS: "Principal component analysis of characters in the Voynich manuscript and their classifications based on comparative analysis of writings in known languages"

The paper by Jonas Alin can be found You are not allowed to view links. Register or Login to view.

The author concludes: 
Quote:the Voynich language, unlike the known languages, shows proportionally larger separation also along PC2 in its character transition frequencies, showing that the Voynich characters probably abide by more complex co-occurrence rules than characters of typical known languages.


An additional finding is: 
Quote:The Voynich character’s observation points were generally close together in the plot if the characters looked similar. That is, the more similar the characters look, the more similar would be their pattern of transition probabilities to other characters.
I've just read the abstract.   How similar is it to the analysis I did in March 2018, here: You are not allowed to view links. Register or Login to view. and followed up here: You are not allowed to view links. Register or Login to view.?
Hi Donald,
a similar approach (described in a paper by Connor Mayer) was recently pointed out by Stephen Carlson You are not allowed to view links. Register or Login to view.. From the few experiments I discuss in that thread, Mayer's software generates graphs that are very different from yours. I guess this depends on his normalization method: I should try different normalization methods and see if I can duplicate your results with  Mayer's code.

Your plots seem indeed compatible with the abstract of Alin's paper. Here I have colour-edited one of your charts to make it more readable. I also manually added two lines according to what I see as visual clusters, also keeping You are not allowed to view links. Register or Login to view. in mind.

[attachment=3875]

The symbols below the red line correspond to the characters that Guy identifies as most likely vowels: EVA:o, a, y.
The symbols between the blue and red lines are somehow "intermediate". The set includes 'i', 'e', and the "benches", comparably with what Guy identifies as possibly vowels, but with much less confidence than the three "circle-glyphs" (but this area includes other glyphs, and Guy's transcription system handles 'h' as an individual characters, so there also are significant differences here).

Consistently with Alin's finding, clusters in your chart appear to be diagonally oriented: i.e. they are based on both PC1 and PC2.
From your plot it is also clear that (as Alin says) similar characters behave similarly: the three circles are in the same area (though with considerable space separating them). In the "intermediate" area, the two benches and EVA:e are close to each other, as are 'r' and 's'. At the top of the chart, gallows and bench-gallows also form distinct diagonally-separated clusters. Of course, it would be interesting to read the whole paper, but my impression is that Alin's results are very close to yours.
If you use frequencies instead of log frequencies, you get plots like this: You are not allowed to view links. Register or Login to view..   This loses information, though, and vowels no longer cleanly separate from consonants.   So if he did get a clean separation, I suppose he must have used log frequencies.  Similarly shaped VMS glyphs do indeed cluster together, but shapes in European alphabets don't particularly match pronunciation or other phonetic features (unlike, say, those in Hangeul).   This itself is an interesting observation, suggesting that whoever devised the script might have taken phonetics into account.   I'd like to read Jonas Alin's paper but I don't have access to Cryptologia.
Hello, I wrote the Cryptologia paper "Principal component analysis of characters in the Voynich manuscript and their classifications based on comparative analysis of writings in known languages".

You can read the full article here: 
You are not allowed to view links. Register or Login to view.

It's an e-print link that contains about 30 copies for now. If they have run out, contact me and I can send you a pdf by email.

Best regards, Jonas Alin
We have systematically misspelled "Alin" as "Alan": my apologies to Jonas! I wonder if an administrator could correct at least the first post in this thread?
(09-02-2020, 11:46 AM)MarcoP Wrote: You are not allowed to view links. Register or Login to view.We have systematically misspelled "Alin" as "Alan": my apologies to Jonas! I wonder if an administrator could correct at least the first post in this thread?

Thanks for noticing this Smile
I have finally found the time to read Jonas' paper.

The described approach is indeed very similar to the one discussed by Donald Fisk (see above). If I understand correctly, the main difference is that Donald treats space as a character while Jonas doesn't: personally, in the context of a glyph/sound correspondence, I favour Jonas approach, since space does not represent a sound. Of course, if glyphs do not represent sounds, considering the relationship with space could make more sense, but both Donald and Jonas appear to be exploring the "phonetic" line, discussing glyphs in terms of vowels and consonants. Though this assumption may be wrong, it still is the idea I favour the most: I am happy to see it explored under different approaches!

Both authors apparently only consider transitions from one character to the following one. The alternative would be to also consider transitions from each character to the one being examined, so that for an alphabet with C glyphs you get C*2 dimensions for each glyph X (aX Xa bX Xb cX Xc etc). I think that also considering the transition from the preceding character could result in more robust results: for instance, in English 'c' is often followed by the consonant 'k', by also considering the preceding character, words like 'black' or 'quick' will help correctly classifying 'c' as a consonant.

Since there is so much uncertainty about the VMS, it is always informative to see what happens with different set-ups. Once you have a computational tool in place, it is relatively easy to feed it multiple data-sets. E.g.:
  • different transliteration systems (EVA, CUVA, 101, etc)
  • different subsections of the VMS (Currier A/B, image-defined sections)
  • different systems for computing glyph vectors (with/without spaces, only following glyphs, only preceding glyphs, both following and preceding glyphs)

I attach Fig.7 from Jonas' paper (you can ignore the boxed characters and focus on the non-boxed labels).
Coming back to comparing Donald's and Jonas' work, their results are similar in the fact that the "circle" EVA:o,a,y are somehow separated from the other glyphs: this is something on which most analyses tend to agree (e.g. You are not allowed to view links. Register or Login to view. Sukhotin experiments; it is also what Emma and I found in You are not allowed to view links. Register or Login to view., fig.4.1, based on very different data).

An important exception is You are not allowed to view links. Register or Login to view., that points out that 'y' does not seem to belong to the same class as 'o' 'a'.

Jonas' results about the "gallows" largely differ from other similar work: typically, the four gallows EVA:k,t,p,f are reported as behaving similarly, and the four benched gallows EVA:ckh,cth,cph,cfh are also reported as being close to each other. Here 'p' 'f' are opposite to 'k' 't' both along PC1 and PC2.

Jonas conclusion is that this method does not allow a clear separation into Vowels and Consonants. This is also what Rene found with HMM. Of course, these results suggest that the phonetic idea I favour is wrong.
Quote:Both authors apparently only consider transitions from one character to the following one. The alternative would be to also consider transitions from each character to the one being examined, so that for an alphabet with C glyphs you get C*2 dimensions for each glyph X (aX Xa bX Xb cX Xc etc). I think that also considering the transition from the preceding character could result in more robust results: for instance, in English 'c' is often followed by the consonant 'k', by also considering the preceding character, words like 'black' or 'quick' will help correctly classifying 'c' as a consonant.

I would just like to add a comment that strictly vowel-consonant separation was not the goal of the article. I could agree that considering trigraph-frequencies (both preceding and succeeding characters) would possibly strengthen vowel-consonant separation/identification (this is for example what Sukhotin's algorithm does, see Ref. below). But at the same time, you may miss out on some important digraph relationships.

Also, Acedo's HMM algorithm both treats space as a character and treats the whole text as a long running line of characters. My primary remark to treating the text this way is that characters in the most frequently occurring words gets more statistical weights.

A second comment (edit): even though trigraph statistics were not considered, which characters that precede the character in question are anyhow visible from looking at the original variable vectors (these are the lines labelled with the boxed Voynich glyph), so the most common preceding characters can also be seen from which observation point characters are in similar direction with the variable axis (inside square-) character.

References

Guy, Jacques B M. 1991. “Vowel identification: an old (but good) algorithm.” Cryptologia 15 (3): 258–62. You are not allowed to view links. Register or Login to view..

Acedo, Luis. 2019. “A Hidden Markov Model for the Linguistic Analysis of the Voynich Manuscript.” Mathematical and Computational Applications 24 (1): 14. You are not allowed to view links. Register or Login to view..
(29-02-2020, 10:59 AM)MarcoP Wrote: You are not allowed to view links. Register or Login to view.different subsections of the VMS (Currier A/B, image-defined sections)

As far as I can tell at present, separate analysis of A- and B-sections give basically the same clustering and placements around the original variable axes, only with different rotations in relation to the principal components. The only significant difference between A and B appears to be the placement of the q character and original variable axis, with the qo transition more along the direction of the e axis in the A language, while located in the direction of the y axis in the B language.