The Voynich Ninja
[Article] A Hidden Markov Model for the Linguistic Analysis of the Voynich Manuscript - Printable Version

+- The Voynich Ninja (https://www.voynich.ninja)
+-- Forum: Voynich Research (https://www.voynich.ninja/forum-27.html)
+--- Forum: News (https://www.voynich.ninja/forum-25.html)
+--- Thread: [Article] A Hidden Markov Model for the Linguistic Analysis of the Voynich Manuscript (/thread-2972.html)



A Hidden Markov Model for the Linguistic Analysis of the Voynich Manuscript - Torsten - 16-10-2019

[font=Tahoma, Verdana, Arial, sans-serif]New paper about the VMS: "You are not allowed to view links. Register or Login to view.[/font][font=Tahoma, Verdana, Arial, sans-serif]"[/font]

[font=Tahoma, Verdana, Arial, sans-serif][font=Tahoma, Verdana, Arial, sans-serif]The paper by Luis Acedo is available You are not allowed to view links. Register or Login to view.[/font][/font]

[font=Tahoma, Verdana, Arial, sans-serif][font=Tahoma, Verdana, Arial, sans-serif][font=Tahoma, Verdana, Arial, sans-serif]The author concludes:[/font][/font][/font]
[font=Tahoma, Verdana, Arial, sans-serif][font=Tahoma, Verdana, Arial, sans-serif][font=Tahoma, Verdana, Arial, sans-serif]
Quote:[font=Tahoma, Verdana, Arial, sans-serif]The most interesting results are, however, those obtained with the observation probability matrix, which clearly separate two kinds of characters to be associated with vowel and consonant phonemes ... On the other hand, this correspondence is not as strong as in the case of the English text of Section 3.1 because there are symbols with noticeable probability that appear in both figures (in particular, the EVA symbols 'e', 'i', 's' and 'y').
[/font]
[/font]
[/font][/font]



RE: A Hidden Markov Model for the Linguistic Analysis of the Voynich Manuscript - davidjackson - 16-10-2019

I'm afraid I'm becoming very jaded about this sort of research. Great for uni credits no doubt, but not really adding anything to the world of Voynich.

It also illustrates what I was on about in the AI thread about garbage in, garbage out. The author explains:

Quote:We started with the same initial conditions as those given in Equations (11) and (12) for the
transition matrix and the distribution of the state t=0. The probability matrix for the observation
states (the Voynich characters) was randomized in the usual way explained in Section 3.1.

In other words, he used initialised the HMM using results found from his previous analysis of an English language translation of Don Quijote (Equations (11) and (12) being biased for the English language text). Meaning the HMM would be biased towards English. Had he used the German version of Shakespeare, no doubt the HMM would have been biased towards German. In third words, he expects to find vowels, and so his model does.

Secondly, the author has made a basic mistake in the transcription he used. He says (P.19):
Quote:At this point, it is also be necessary to explain why we chose this particular alphabet [EVA] instead
of the other alternatives. The main reasons are its popularity and the fact that many transcriptions
are available for it. Otherwise, some specialists would argue that some symbol combinations in this
alphabet, such as “ch” and “sh”(corresponding to the so-called “pedestals”) should be considered as
one single character each. On the other hand, the combinations “in” and “iin” are also candidates
for representing letters, although, in some other cases, “i” could be a single character. This is another
problem that computational analyses could help to solve [...]

This way, a transcription of the whole Voynich manuscript has been performed in such a way
that it can be used in computational analysis. [..] In particular, we used Takahashi’s transcription developed in 1999.
Of course, some pre-processing was required before applying the HMM algorithm because this file
includes some information about each line, including the folium number (recto or verso) and the
number of the line within each page of the manuscript. After removing this information, we were left
with a set of EVA characters separated by dots. These dots correspond to the spaces between words in
the original manuscript.


Takahashi's transcription still uses combinations within Eva, noted by capitals. For example, Sh is sh. At no point does the author mention that he changed these combinations into unique characters - what is more, he dismisses such cases and assumes they do not exist in EVA (this is incorrect). Therefore, his HMM will not correctly take these character indications into account, and hence his alphabet is incorrect. If it is case insensitive it will split Sh into s - h (when in the original manuscript this is a different glyph) or if case sensitive it will introduce a non-existent character into the model (S). Either way, it does not correspond to reality and the algorithm is working with an incorrect transcription of the text.


RE: A Hidden Markov Model for the Linguistic Analysis of the Voynich Manuscript - ReneZ - 16-10-2019

(16-10-2019, 07:49 PM)davidjackson Wrote: You are not allowed to view links. Register or Login to view.The author explains:


Quote:We started with the same initial conditions as those given in Equations (11) and (12) for the

transition matrix and the distribution of the state t=0. The probability matrix for the observation

states (the Voynich characters) was randomized in the usual way explained in Section 3.1.



In other words, he used initialised the HMM using results found from his previous analysis of an English language translation of Don Quijote (Equations (11) and (12) being biased for the English language text).

No, that's not correct. These are completely arbitrary initial conditions that are slightly off from 'all probabilities equal', to avoid that the process gets stuck from the very beginning.

The point about use of Eva is of course correct.

The publication isn't any breakthrough in any sense of the word, but there is a dearth of published studies of HMM analysis on the text.


RE: A Hidden Markov Model for the Linguistic Analysis of the Voynich Manuscript - davidjackson - 16-10-2019

(16-10-2019, 08:13 PM)ReneZ Wrote: You are not allowed to view links. Register or Login to view.No, that's not correct. These are completely arbitrary initial conditions that are slightly off from 'all probabilities equal', to avoid that the process gets stuck from the very beginning.
Really? Obviously you have far more experience than I do in these things, but I read it as he initialised the algorithm with random values as explained in part 2.3) (equations 6-10) then reestimated and used the same values to carry out both runs (init values for equations 11-12).

From the introduction to section 3:
Quote: First, we consider the case of a text in English and we implement the model optimization
algorithm to classify the letters of the alphabet (after removing all the punctuation signs) into two
classes corresponding to the inner states of the HMM. It is shown that these classes are clearly associated
with the vowels and the consonants in English and this provides the basic phonemic structure of the
language. Testing the algorithm with a known language gives us the necessary confidence to apply it
to the Voynich manuscript.



RE: A Hidden Markov Model for the Linguistic Analysis of the Voynich Manuscript - ReneZ - 17-10-2019

Hi David, yes, indeed. The process consists in iteratively estimating A and B matrices that best fit the properties of the text. One starts of with initial values for these matrices.
As he clearly says, Eq. 11 gives the initial estimate for the A matrix for the English text.
This is just meant to be close to:

Code:
0.5  0.5

0.5  0.5

The final value of the matrix, after convergence, for English, is given in Eq. 14.
Following that, he experiments with what happens for various different initial A matrices and comes up with a statistical output (for English) in Eq. 18.

For the Voynich text, he again starts with the version in Eq.11 and ends up with the output in Eq. 19.
He also does the 'varying bit' and has a statistical output in Eq. 21.


RE: A Hidden Markov Model for the Linguistic Analysis of the Voynich Manuscript - davidjackson - 17-10-2019

Thank you for clearing that up. I withdraw the first part of my above comment.


RE: A Hidden Markov Model for the Linguistic Analysis of the Voynich Manuscript - Torsten - 21-10-2019

The outcome of a Hidden Markov Model (HMM) experiment is interesting since the algorithm doesn't know anything about vowels and consonants. 

The result of Acados experiment was that there is "no clear separation among vowels and consonants". Acado tries to explain this as follows: "Perhaps, the most simple explanation for the absence of a clear separation among vowels and consonants in these four cases is that we are confronted with another example of letters that can function as both vowels and consonants as in the case of 'y' in English. However, for the Voynich manuscript, we have four letters with this capacity and this is a peculiarity whose meaning we cannot unravel for the moment. Another possibility is that the Voynich alphabet is some kind of abjad ..." (You are not allowed to view links. Register or Login to view., p. 11f).

Knight and Reddy also found a similar result: "Another method is to use a two-state bigram HMM over letters, and induce two clusters of letters with EM. In alphabetic languages like English, the clusters correspond almost perfectly to vowels and consonants. We find that a curious phenomenon occurs with the VMS – the last character of every word is generated by one of the HMM states, and all other characters by another; i.e., the word grammar is a∗b" (You are not allowed to view links. Register or Login to view., p. 80). 

It is noteworthy that the two experiments confirm the observations described by Currier: "There seem to be very strong  constraints in combinations of symbols; only a very limited number of letters occur with each other in certain positions of a 'word'" (You are not allowed to view links. Register or Login to view.). Note: Typical in word final position are 'y' (15409 out of 37919 words or 40.6 %), 'n' (6064 times or 16%), 'l' (5909 times or 15.6 %) and 'r' (5689 times or 15 %). Typical in word initial position are 'o' (8530 out of 37919 words or 22.5 %), 'c?' (6921 times or 18.3 %) and 'q' (5389 times or 14.2%).


RE: A Hidden Markov Model for the Linguistic Analysis of the Voynich Manuscript - ReneZ - 22-10-2019

It has not been possible for anyone (as least as far as I know) to repeat the result that was given in the paper by Reddy and Knight. I am not even certain whose analysis it is that is being summarised.

The result of Acedo is confirmed (and much more) at this page: You are not allowed to view links. Register or Login to view.
with visual evidence about the lack of separation between vowels and consonants for different transliteration alphabets.


RE: A Hidden Markov Model for the Linguistic Analysis of the Voynich Manuscript - MarcoP - 22-10-2019

I find the HMM approach difficult to fully understand, but obviously I lack the necessary mathematical background. You are not allowed to view links. Register or Login to view. seems much clearer to me: divide the set of all symbols into two classes (1 and 0), so that the number of consecutive symbols alternating between the two classes is maximized. This also results in what seems to me an intuitive quality measure: how many of the pairs of consecutive symbols correspond to an alternation (you want to have many 10 and 01 and as few 11 and 00 as possible).

Could someone please explain in simple words why an HMM approach is better?

Also, in Acedo's approach, I am not convinced of the meaningfulness of treating space as a symbol, together with characters. We want to classify symbols as vowels or consonants, and obviously "space" has nothing to do with sounds. Only considering consecutive characters "inside" words (as Hulden and Sukhotin do) seems to me the most sensible thing.