28-09-2017, 08:32 PM
I have run some tests on Voynichese, using the Currier B text in the Currier transcription as in the paper by Reddy and Knight. I extracted Currier's transcription using Rene's ivtt tool.
The Python scripts only consider a limited number of characters (I increased the number from 1000 to 2000). In the portion I analyzed, character i only appears in sequences that Currier maps to specific characters. Actually, this transcription groups all the most likely digraphs and trigraphs, with the exception of C (EVA:e).
3 EVA : iiin iiin
4 EVA : q q
6 EVA : g g
8 EVA : d d
9 EVA : y y
A EVA : a a
B EVA : p p
C EVA : e e
E EVA : l l
F EVA : k k
I EVA : i i
J EVA : m m
M EVA : iin iin
N EVA : in in
O EVA : o o
P EVA : t t
Q EVA : cth cth
R EVA : r r
S EVA : ch ch
T EVA : ir ir
U EVA : iir iir
V EVA : f f
W EVA : cph cph
X EVA : ckh ckh
Y EVA : cfh cfh
Z EVA : sh sh
I have been unable to replicate Reddy and Knight's results. They wrote:
They clearly say that one of the two states only generates the very last character at the end of words. The other state should have a high probability loop allowing it to generate most of the word. A low probability link should lead to the other state that generates EVA:y EVA:r and not much more.
The result I obtain is very different. The two states alternate as we have seen in other languages. One of the two states generates Currier symbols 9, A, C, O (EVA: y a e o y a e o) and the space character. This state has a 42% probability of looping on itself (likely to account for C9 -ey- the various CC and CCC -eee- sequences).
[attachment=1729]
This example shows how characters generated by the two states tend to alternate.
[attachment=1731]
Other runs produce a lower-quality fit in which the two states tend to loop on themselves (the software outputs the number it is trying to optimize and these runs don't reach as good a value as those described above). This model is different from the examples we have seen. One of the states includes C, O, S, Z (EVA e o ch sh e o ch sh) and all the gallows and benched gallows. This state typically generates the first half of a word, with the frequent exception of the prefix 4 (EVA:q). The other state generates the other half of the word (suffixes and the prefix 4) and the space character. Words appear to be split into two separate parts, with a first half (or "core") generated by state 1 (in red in the example below).
[attachment=1732]
[attachment=1733]
Maybe it's useful to write once more that the main reason for these experiments is to understand more of what Reddy and Knight wrote. I wasn't very successful (but maybe there are variants of these tests that could come closer to what they observed). Anyway, there often is something to learn from failure.
The Python scripts only consider a limited number of characters (I increased the number from 1000 to 2000). In the portion I analyzed, character i only appears in sequences that Currier maps to specific characters. Actually, this transcription groups all the most likely digraphs and trigraphs, with the exception of C (EVA:e).
3 EVA : iiin iiin
4 EVA : q q
6 EVA : g g
8 EVA : d d
9 EVA : y y
A EVA : a a
B EVA : p p
C EVA : e e
E EVA : l l
F EVA : k k
I EVA : i i
J EVA : m m
M EVA : iin iin
N EVA : in in
O EVA : o o
P EVA : t t
Q EVA : cth cth
R EVA : r r
S EVA : ch ch
T EVA : ir ir
U EVA : iir iir
V EVA : f f
W EVA : cph cph
X EVA : ckh ckh
Y EVA : cfh cfh
Z EVA : sh sh
I have been unable to replicate Reddy and Knight's results. They wrote:
Quote:Another method is to use a two-state bigram HMM (Knight et al., 2006; Goldsmith and Xanthos, 2009) over letters, and induce two clusters of letters with EM. In alphabetic languages like English, the clusters correspond almost perfectly to vowels and consonants. We find that a curious phenomenon occurs with the VMS – the last character of every word is generated by one of the HMM states, and all other characters by another; i.e., the word grammar is a ∗ b.
They clearly say that one of the two states only generates the very last character at the end of words. The other state should have a high probability loop allowing it to generate most of the word. A low probability link should lead to the other state that generates EVA:y EVA:r and not much more.
The result I obtain is very different. The two states alternate as we have seen in other languages. One of the two states generates Currier symbols 9, A, C, O (EVA: y a e o y a e o) and the space character. This state has a 42% probability of looping on itself (likely to account for C9 -ey- the various CC and CCC -eee- sequences).
[attachment=1729]
This example shows how characters generated by the two states tend to alternate.
[attachment=1731]
Other runs produce a lower-quality fit in which the two states tend to loop on themselves (the software outputs the number it is trying to optimize and these runs don't reach as good a value as those described above). This model is different from the examples we have seen. One of the states includes C, O, S, Z (EVA e o ch sh e o ch sh) and all the gallows and benched gallows. This state typically generates the first half of a word, with the frequent exception of the prefix 4 (EVA:q). The other state generates the other half of the word (suffixes and the prefix 4) and the space character. Words appear to be split into two separate parts, with a first half (or "core") generated by state 1 (in red in the example below).
[attachment=1732]
[attachment=1733]
Maybe it's useful to write once more that the main reason for these experiments is to understand more of what Reddy and Knight wrote. I wasn't very successful (but maybe there are variants of these tests that could come closer to what they observed). Anyway, there often is something to learn from failure.