The Voynich Ninja
HMM automatic vowel detection - Printable Version

+- The Voynich Ninja (https://www.voynich.ninja)
+-- Forum: Voynich Research (https://www.voynich.ninja/forum-27.html)
+--- Forum: Analysis of the text (https://www.voynich.ninja/forum-41.html)
+--- Thread: HMM automatic vowel detection (/thread-2121.html)

Pages: 1 2 3 4 5


RE: HMM automatic vowel detection - MarcoP - 28-09-2017

I have run some tests on Voynichese, using the Currier B text in the Currier transcription as in the paper by Reddy and Knight. I extracted Currier's transcription using Rene's ivtt tool.

The Python scripts only consider a limited number of characters (I increased the number from 1000 to 2000). In the portion I analyzed, character i only appears in sequences that Currier maps to specific characters. Actually, this transcription groups all the most likely digraphs and trigraphs, with the exception of C (EVA:e).

3 EVA : iiin iiin
4 EVA : q q
6 EVA : g g
8 EVA : d d
9 EVA : y y
A EVA : a a
B EVA : p p
C EVA : e e
E EVA : l l
F EVA : k k
I EVA : i i
J EVA : m m
M EVA : iin iin
N EVA : in in
O EVA : o o
P EVA : t t
Q EVA : cth cth
R EVA : r r
S EVA : ch ch
T EVA : ir ir
U EVA : iir iir
V EVA : f f
W EVA : cph cph
X EVA : ckh ckh
Y EVA : cfh cfh
Z EVA : sh sh


I have been unable to replicate Reddy and Knight's results. They wrote:

Quote:Another method is to use a two-state bigram HMM (Knight et al., 2006; Goldsmith and Xanthos, 2009) over letters, and induce two clusters of letters with EM. In alphabetic languages like English, the clusters correspond almost perfectly to vowels and consonants. We find that a curious phenomenon occurs with the VMS – the last character of every word is generated by one of the HMM states, and all other characters by another; i.e., the word grammar is a ∗ b.

They clearly say that one of the two states only generates the very last character at the end of words. The other state should have a high probability loop allowing it to generate most of the word. A low probability link should lead to the other state that generates EVA:y EVA:r and not much more.

The result I obtain is very different. The two states alternate as we have seen in other languages. One of the two states generates Currier symbols 9, A, C, O (EVA: y a e o y a e o) and the space character. This state has a 42% probability of looping on itself (likely to account for C9 -ey- the various CC and CCC -eee- sequences).
   

This example shows how characters generated by the two states tend to alternate.
   

Other runs produce a lower-quality fit in which the two states tend to loop on themselves (the software outputs the number it is trying to optimize and these runs don't reach as good a value as those described above). This model is different from the examples we have seen. One of the states includes C, O, S, Z (EVA e o ch sh e o ch sh) and all the gallows and benched gallows. This state typically generates the first half of a word, with the frequent exception of the prefix 4 (EVA:q). The other state generates the other half of the word (suffixes and the prefix 4) and the space character. Words appear to be split into two separate parts, with a first half (or "core") generated by state 1 (in red in the example below).
   
   

Maybe it's useful to write once more that the main reason for these experiments is to understand more of what Reddy and Knight wrote. I wasn't very successful (but maybe there are variants of these tests that could come closer to what they observed). Anyway, there often is something to learn from failure.


RE: HMM automatic vowel detection - Emma May Smith - 28-09-2017

Marco, would it be possible to alter the input text to test some hypotheses?

Specifically, I'm interested in what difference it would make if [a] and [y] were transcribed with the same character, and further if a [y] was placed between [e] and [d] when they appeared in the string [ed].

If they lead to a stronger split between the two states, with less looping, would that be a positive sign?


RE: HMM automatic vowel detection - davidjackson - 28-09-2017

K-R say they are using characters A-Z, *, 1-9. you seem to be missing some of those in your explanation above - are they included? 
The source website transcription included in the paper is still live, you might want to use that one instead of creating a new transcription file.


RE: HMM automatic vowel detection - MarcoP - 28-09-2017

(28-09-2017, 09:01 PM)davidjackson Wrote: You are not allowed to view links. Register or Login to view.K-R say they are using characters A-Z, *, 1-9. you seem to be missing some of those in your explanation above - are they included? 
The source website transcription included in the paper is still live, you might want to use that one instead of creating a new transcription file.

Hi David,
I think that the ivtt file is consistent with the online version, but I will check in the next few days. As I wrote above, the python script I am using only considers 2000 chars and some of Currier's symbols are so rare that they are likely missing in that sample. I am afraid that these really are details and they cannot explain why I don't get anything similar to Reddy and Knight results. But you are certainly right that differences should be minimized when trying to replicate an experiment. I will see if I can fix both these issues.


RE: HMM automatic vowel detection - MarcoP - 28-09-2017

(28-09-2017, 08:56 PM)Emma May Smith Wrote: You are not allowed to view links. Register or Login to view.Marco, would it be possible to alter the input text to test some hypotheses?

Specifically, I'm interested in what difference it would make if [a] and [y] were transcribed with the same character, and further if a [y] was placed between [e] and [d] when they appeared in the string [ed].

If they lead to a stronger split between the two states, with less looping, would that be a positive sign?

Hi Emma,
of course it's easy to modify the input file. What is difficult is comparing results with different inputs. I think that the first result I discussed above already provided a strong split between states.

With the edits you suggest, the results change. The two states still alternate. Now one of the states generates 9, O, S, Z and Q (but this last is ambiguous). With respect to the non-edit model, space and C (EVA:e) are excluded in favor of EVA:ch and sh.
Since A and 9 were already generated by the same state, I don't think that making them identical can change things significantly. The changes must be due to  C8 ed becoming C98 eyd.


RE: HMM automatic vowel detection - Emma May Smith - 28-09-2017

Thanks Marco. I don't think my suggestions improve the split at all! It seems they create more problems.

Something to think about.


RE: HMM automatic vowel detection - ReneZ - 29-09-2017

<deleted>
Need to understand this better first.

Marco, when you mention a limit of 1000 characters, is this the length of the text?
From what I understand of the algorithm, the text should not have to be stored in memory, and should
only have to be read once, so such a low limit would not make much sense.

The overall VMS has about 160,000 characters (not counting spaces).


RE: HMM automatic vowel detection - MarcoP - 29-09-2017

(29-09-2017, 05:45 AM)ReneZ Wrote: You are not allowed to view links. Register or Login to view.Marco, when you mention a limit of 1000 characters, is this the length of the text?
From what I understand of the algorithm, the text should not have to be stored in memory, and should
only have to be read once, so such a low limit would not make much sense.

The overall VMS has about 160,000 characters (not counting spaces).

Hello Rene,
I am really using the software as a black-box, but I have the impression that the size has been limited to reduce processing time. I see empirically that the optimization algorithm takes longer when I increase the number of characters to consider. I guess the optimization must somehow match the output of the model with the actual data at each step: the bigger the set of data the slower the optimization process.

___________________________


I have made some new experiments on the basis of You are not allowed to view links. Register or Login to view. (thank you, David!).
Rene pointed me to an explanatory You are not allowed to view links. Register or Login to view. (thank you, Rene!).

I have downloaded the You are not allowed to view links. Register or Login to view. mentioned by Reddy and Knight.

Unluckily, the downloaded file is not identical to what is published in the paper. For instance, this is the first line of You are not allowed to view links. Register or Login to view. as published in the paper and in the online file:
Code:
        BAR ZC9 FCC89 ZCFAE 8AE 8AR OE BSC89 ZCF 8AN OVAE ZCF9
<f81v.1> BAR.ZC9.PCC89.ZCFAE.8AE.8AR.OE.BSC89.ZCF.8AN.OVAEZCF9-

As you can see, the first letter of the third word differs and the last word is split into two words in the paper.

Another problem is that Currier character 1 (corresponding to EVA:iiil iiil) doesn't occur in the voynich.now file (the sequence is encoded as IIIE). But these doubtful points are really marginal and are unlikely to have an impact on the results (iiil occurs 2 times in the whole manuscript, according to voynichese.com).
You are not allowed to view links. Register or Login to view.
Here I list the occurrences of each Currier character in the whole Voynich.now file:
You are not allowed to view links. Register or Login to view.

I have considerably increased the number of characters to be processed by the Python script (30000) and the results I obtain are once again different. The transition matrix results again in the two states alternating. Characters 9, O, A, S, Z (EVA: y o a ch sh y o a ch sh) are generated by one of the two states. The other state has a higher probability loop (43%) which I still think depends on the presence of Currier:C (EVA:e) which tends to occur multiple times consecutively. The different results could depend both on the new information in the much increased data set and on differences between the voynich.now file and the ivtt transcription I previously used.

While 9, O and A are consistently grouped together, it is not clear which other characters belong with them. C (EVA:e) or S/Z (EVA:ch/sh)?

You are not allowed to view links. Register or Login to view. (Sukhotin) applied on a different transcription pointed out EVA:o y a e (o y a e) as the characters most likely to correspond to vowels. But the issue is certainly difficult and, while I find the results of these algorithms interesting and instructive, I am sure they cannot provide any convincing result without further linguistic analysis.


RE: HMM automatic vowel detection - Davidsch - 29-09-2017

If you have a question you can simply e-mail Knight.  If your question is brief and the paper is a bit recent, he will reply. ime


RE: HMM automatic vowel detection - Koen G - 30-09-2017

Marco, I took text from some pages and transcribed them taking into account some possible digraphs on the one hand and developing the benches on the other - a scenario I deem possible. It's obviously not a proposed solution, just a test to see what happens when the text is written this way. If possible, could you check which results your program gives for it?

(I kept EVA q as q, though it's clear that in this transcription it would take on a vowel value, likely one already represented differently elsewhere)