HMM automatic vowel detection - Printable Version

HMM automatic vowel detection - Printable Version

+- The Voynich Ninja (https://www.voynich.ninja)
+-- Forum: Voynich Research (https://www.voynich.ninja/forum-27.html)
+--- Forum: Analysis of the text (https://www.voynich.ninja/forum-41.html)
+--- Thread: HMM automatic vowel detection (/thread-2121.html)

Pages: 1 2 3 4 5

RE: HMM automatic vowel detection - MarcoP - 28-09-2017

I have run the Python script on an English text (A Christmas Carol, by Charles Dickens).

Vowels and consonant are still nicely separated in the runs when the optimization performs well. In some runs, they aren't.
The additional difficulty is of course the presence of several consonant digraphs. Also the results illustrated in the attachments are almost “confused” about 'c' which is generated with 2.2% probability by the consonant state and 1.8% probability by the vowel state.

Filename: en_histo.jpg Size: 39.82 KB 28-09-2017, 10:21 AM

For instance, in these cases 'c' appears between two consonants, a position typical of vowels:

watched
kitchens
exclaimed
included
church
scrooge

What this algorithm does is trying to predict which character will occur given the class of the preceding character. Setting the model to only two states basically implies that there only are two classes of characters. These tend to map to consonants and vowels because consonants and vowels tend to alternate. State transition probabilities make this clear: the loops mapping state 0 to itself and state 1 to itself have much smaller probabilities than moving from 0 to 1 and from 1 to 0. The loops actually represent consonant-consonant and vowel-vowel sequences.

RE: HMM automatic vowel detection - Davidsch - 28-09-2017

the colored matrix does seem wrong, or strange.
Why is the left block not associated with a letter, see also your first posting image, is that the space?

RE: HMM automatic vowel detection - MarcoP - 28-09-2017

(28-09-2017, 12:07 PM)Davidsch Wrote: You are not allowed to view links. Register or Login to view.the colored matrix does seem wrong, or strange.
Why is the left block not associated with a letter, see also your first posting image, is that the space?

Yes, the first block corresponds to the space character. I labeled it with an underscore in the histogram for English.
For both Latin and English, space is mainly associated with the "vowel" state, because words tend to end with a consonant (and the two states alternate).

The Italian matrix is quite similar, but space is associated with the "consonant" state, because words tend to end with a vowel.

RE: HMM automatic vowel detection - davidjackson - 28-09-2017

There is a fundamental error in using HMM models to try to detect vowels / consonants in the Voynich.
In layman's terms: garbage in, garbage out.

Since we don't know which is which, we can't train the algorithm and it is never anything better than a wild guess.

In brief, an HMM depends upon an observation curve. It outputs a prediction which can then be compared to reality within controlled parameters. The comptroller then adjusts the original HMM algorithm and sees if the new prediction is better.
Once the HMM is outputting data similar to the observation curve, it's let loose upon the unknown.

If you don't have an observation curve, you have no idea whether the information being returned is of any value.

RE: HMM automatic vowel detection - Emma May Smith - 28-09-2017

But it's not strictly predicting vowels and consonants, rather states. It doesn't have any linguistic understanding other that what we bring.

The transcription is the main problem that could occur, as some ways of transcribing different characters (such as [iii] and [ch]) might given different result. However, you can run the program with the different transcriptions and see how they match and how they don't.

RE: HMM automatic vowel detection - davidjackson - 28-09-2017

Quote:But it's not strictly predicting vowels and consonants, rather states. It doesn't have any linguistic understanding other that what we bring.

You put it much better than I did Big Grin

RE: HMM automatic vowel detection - Emma May Smith - 28-09-2017

Well, sure. But if we run the same model, as Marco is doing, over multiple languages where we know the consonants and vowels, then we have something to compare it against. A lot does depend on the text being linguistic in nature and upon the way in which it is written, of course.

RE: HMM automatic vowel detection - davidjackson - 28-09-2017

Quote:then we have something to compare it against

No, we have statistics for languages that we understand. I very much doubt it will bring up any insights into the Voynich.

Imagine we run university entrance exam results against high school education paths for public school students in the USA, France, Russia and Japan. We'll have a set of figures, each individual to the independent educational system.
We then try to match those figures against home schooled students from the UK. We'll get back figures within the same range - but they will be coincidental and not suitable for making predictions.

RE: HMM automatic vowel detection - Emma May Smith - 28-09-2017

I'm pretty sure that all languages have consonants and vowels. Even Ubykh...

RE: HMM automatic vowel detection - ReneZ - 28-09-2017

The Hidden Markov Model is a general method for analysing sequences (to put it simply).

One of many applications is language expressed as a sequence of characters.
For a language that has reasonably clear vowel-consonant alternation, a two-state HMM will tend to split the characters into a class of vowels and a class of consonants.

It's therefore worth applying a two-state HMM to the Voynich text.
Since the algorithm doesn't really know about vowels and consonants, it is unbiased.
In a way, it isn't going to force the identification of vowels and consonants if there aren't any (and even if the symbols o, a, i "look like" vowels).

The fact that it returns position-dependent states tells us that the Voynich text is "different".

That's the first step. A negative result, but one that at least fits with the observation that all people who have tried to turn the VMS text into Latin by simple substitution (and are still trying) have failed.

The next step can go in many directions:
- try more states
- play around with the text to see if there is something that makes the HMM return vowels and consonants
- both of the above
- see if there is other text that behaves like the VMS