The Voynich Ninja - HMM automatic vowel detection

Pages: 1 2 3 4 5

While dong some experimentation in this area, I ran into the following complication.

When working with character pair statistics, which are relevant both for entropy calculations and HMM analyses, there are three different ways of treating word spaces.

1) One counts word spaces among the characters to be analysed.
2) One deletes word spaces completely
3) One treats word spaces as breaks in the string.

Option 2 seems the least interesting, for obvious reasons.

If one chooses option 3, one is analysing only the internal structures of the words. The number of character pairs is limited to pairs inside words. The problem is that the character frequencies for
- all occurrences of character X (i.e. any character)
- the occurrences of character X as the first of the pair
- the occurrences of character X as the second of the pair
are no longer the same.

This is of course expected, but what surprised me is that even for a "well-behaved" language like Latin, the differences are significant.
For Voynichese, with its strong position-dependent behaviour, the differences are completely overwhelming the results.

The clean way out is to treat spaces as characters throughout all calculations.
By 'circularising' the text, i.e. adding one space at the end, and pretending that the very first character of the text follows this, one achieves that there are the same number of character pairs as single characters, and also that all frequencies of character X:
- as single character
- as first of a pair
- as second of a pair
are the same.

While this solves the problem, the issue remains that calculations based on different treatments of spaces will yield significantly different results.

PS: the text should have been 'cleaned up', by replacing all punctuation by word spaces and making sure that there is exactly one space between each pair of words.

(03-10-2017, 08:14 AM)ReneZ Wrote: You are not allowed to view links. Register or Login to view.While dong some experimentation in this area, I ran into the following complication.

When working with character pair statistics, which are relevant both for entropy calculations and HMM analyses, there are three different ways of treating word spaces.

1) One counts word spaces among the characters to be analysed.
2) One deletes word spaces completely
3) One treats word spaces as breaks in the string.

Hello Rene,
most of the tests I have run followed method (1).
I believe that method (2) could be meaningful for languages (like Italian) that change word endings on the basis of the start of the following word. For instance, the Italian conjunction "e" becomes "ed" before a word starting with "e-": Adamo ed Eva; articles have different forms on the basis of the first sounds of the following words etc. Similar phenomena seem to happen in Voynichese.

I believe that method (3) was the one used by Reddy and Knight, but I haven't found a way to simulate it with the Python script I have been playing with.

Hi Marco.

I tried this tool, but got only the first image (B Matrix).

How did you get the second image (two sets, pointing to each other, with letters on the left)?

I see you drew them by hand, but how do you determine which letters belong to which set, and their probabilities?

(28-11-2018, 12:04 PM)ChenZheChina Wrote: You are not allowed to view links. Register or Login to view.Hi Marco.

I tried this tool, but got only the first image (B Matrix).

How did you get the second image (two sets, pointing to each other, with letters on the left)?

I see you drew them by hand, but how do you determine which letters belong to which set, and their probabilities?

Hi Zhe,
the probabilities on the links in the hand-made graph are the numbers in the A matrix. To output them in numeric form, I had to add a
print A
line at the end of the script.
I also had to update the script to assign each letter to one of the states. I have lost this script in a computer crash about one year ago, so I don't remember the details, but of course this was based on the B matrix. The simplest thing would be to assign each letter to that of the two states where that letter has the highest weight.

(28-11-2018, 06:07 PM)MarcoP Wrote: You are not allowed to view links. Register or Login to view.I also had to update the script to assign each letter to one of the states. I have lost this script in a computer crash about one year ago, so I don't remember the details, but of course this was based on the B matrix. The simplest thing would be to assign each letter to that of the two states where that letter has the highest weight.

Hi Marco,

Thanks for your reply.

I tried to print rdict (the mapping from indices to letters), newA (A Matrix) and newB (B Matrix) and got this for a small portion of English test data from Wikipedia:

Code:
======== rdict ========

{0: ' ', 1: 'a', 2: 'b', 3: 'c', 4: 'd', 5: 'e', 6: 'f', 7: 'g', 8: 'h', 9: 'i', 10: 'j', 11: 'k', 12: 'l', 13: 'm', 14: 'n', 15: 'o', 16: 'p', 17: 'q', 18: 'r', 19: 's', 20: 't', 21: 'u', 22: 'v', 23: 'w', 24: 'x', 25: 'y', 26: 'z'}

======== newA ========

[[0.209 0.791]

 [0.647 0.353]]

======== newB ========

[[0.    0.    0.02  0.057 0.072 0.014 0.036 0.04  0.    0.    0.    0.008 0.109 0.043 0.172 0.    0.023 0.    0.103 0.13  0.125 0.    0.017 0.023 0.005 0.    0.001]

 [0.287 0.152 0.001 0.    0.    0.163 0.003 0.036 0.071 0.116 0.    0.    0.    0.    0.    0.105 0.003 0.    0.    0.    0.004 0.043 0.    0.    0.    0.018 0.   ]]

rdict was not in the code. I added it. There was only cdict which maps from letters to indices:

Code:
cdict = dict([(x, i) for i, x in enumerate(chars)])

But it is easy to add a reversed one:

Code:
rdict = dict([(i, x) for i, x in enumerate(chars)])

Then we could assign letters to states where they have the maximum value:

Code:
highest_indices = np.argmax(newB, axis=0)

sets = dict([(i, []) for i, _ in enumerate(newB)])

for i, s in enumerate(highest_indices):

    sets[s].append(rdict[i])

In this case, if you want to assign letters based on minimum values, you could simply replace np.argmax with np.argmin.

Finally, we print sets, and get:

Code:
======== sets ========

{0: ['b', 'c', 'd', 'f', 'g', 'j', 'k', 'l', 'm', 'n', 'p', 'q', 'r', 's', 't', 'v', 'w', 'x', 'z'], 1: [' ', 'a', 'e', 'h', 'i', 'o', 'u', 'y']}

If I am not mistaking anything, this is how you get the “sets”, right?

(the attachment is some English test data, copied from Wikipedia)

(29-11-2018, 04:24 AM)ChenZheChina Wrote: You are not allowed to view links. Register or Login to view.Finally, we print sets, and get:

Code:
======== sets ======== {0: ['b', 'c', 'd', 'f', 'g', 'j', 'k', 'l', 'm', 'n', 'p', 'q', 'r', 's', 't', 'v', 'w', 'x', 'z'], 1: [' ', 'a', 'e', 'h', 'i', 'o', 'u', 'y']}

If I am not mistaking anything, this is how you get the “sets”, right?

Hi Zhe,
what you write seems totally correct to me.

Pages: 1 2 3 4 5