HMM automatic vowel detection - Printable Version

HMM automatic vowel detection - Printable Version

+- The Voynich Ninja (https://www.voynich.ninja)
+-- Forum: Voynich Research (https://www.voynich.ninja/forum-27.html)
+--- Forum: Analysis of the text (https://www.voynich.ninja/forum-41.html)
+--- Thread: HMM automatic vowel detection (/thread-2121.html)

Pages: 1 2 3 4 5

HMM automatic vowel detection - MarcoP - 26-09-2017

After reading You are not allowed to view links. Register or Login to view., I was curious to understand more of Hidden Markov Models and language analysis.

I found a reference to a 1980 paper by Cave and Neuwirth. Apparently, they experimented with several HMM configurations, mapping the symbols more likely to be produced by the single states to specific phonetic properties.

A python implementation of their experiment is available online: You are not allowed to view links. Register or Login to view.

Most of the theory escapes me, but I have run some simple experiments with the Python software.

In order to reproduce something vaguely similar to what Reddy and Knight did, I set the number of nodes to 2. I made tests with Latin, Italian and English and the algorithm is rather consistent in assigning vowels and consonants to two different states. Since the initial parameters of the HMM (the transition probabilities between the two nodes and the probability for each node to generate each symbol) are initially randomly set, the optimization phase (Baum–Welch) can produce different results in different runs.

Here is an example of what I get for Latin (the XVI Century Matthioli herbal in You are not allowed to view links. Register or Login to view.).

These are the results for what the python implementation calls “matrix A”: they are the probabilities with which the model passes from the current state to the next state:

Code:
   0     1

0  0.203 0.797

1  0.785 0.215

The matrix is almost symmetrical. Both states have a higher probability of passing to the other state than remaining in the current state. The optimization has configured state 0 to emit consonants and state 1 to emit vowels: in Latin consonants and vowels tend to alternate, so the model alternates between the two states.

Filename: BMAT.JPG Size: 25.31 KB 26-09-2017, 09:18 AM

The color diagram above is the visual representation of Matrix B (the probability of emission of each symbol for each state) as produced by the python script. The top row corresponds to state 0, the bottom row corresponds to state 1. It should be clear that the two are complementary and that consonants are only emitted in state 0, while vowels are only emitted in state 1. It is also interesting to observe that space is the only symbol that is likely to be emitted by both states: this is because Latin words tend to end both with consonants and vowels: the probability of a consonant ending is higher, so the probability of state 1 emitting space is also higher. A typical pattern might be:

Code:
State1  State0  State1  State0   State1  State0   State1

SPACE Consonant  Vowel Consonant  Vowel Consonant SPACE

I have drawn by hand the black and white diagram, in order to summarize the configuration produced by the optimization algorithm. Symbols are sorted by decreasing emission probability. The 0-0 and 1-1 loops correspond to the generation of consonant-consonant and vowel-vowel digraphs.

RE: HMM automatic vowel detection - Koen G - 26-09-2017

This is like computer magic to me, but I understand the results which is the most important. So if you do this to Voynichese it picks out the vowels as well? It would be interesting to play around with this and various different transcriptions. For example, what would happen if we assign a new glyph to "cc"?

RE: HMM automatic vowel detection - MarcoP - 26-09-2017

(26-09-2017, 09:45 AM)Koen Gh. Wrote: You are not allowed to view links. Register or Login to view.This is like computer magic to me, but I understand the results which is the most important. So if you do this to Voynichese it picks out the vowels as well? It would be interesting to play around with this and various different transcriptions. For example, what would happen if we assign a new glyph to "cc"?

Thank you for your comments, Koen!
This stuff is mostly a black box to me too. Before applying the method to Voynichese, I would like to understand more of it, or at least see if it can be used in a way that more consistently detects vowels in different languages. Latin is certainly easier than English in this respect, but the papers I mentioned suggest that the approach should work for English too.
Of course, we don't know which of Voynichese symbols represent vowels so, whatever the output of the method, we can only hope that there is a correlation with vowels. The results of Reddy and Knight were not promising: they say the two states split the symbols in terminal and non-terminal. I must say that, at this stage, I would consider reproducing their results a success Smile

As I wrote, I am not at all sure of how similar this experiment is to what they did.

You are certainly right about the problem of transcription. This was also discussed by Stephen You are not allowed to view links. Register or Login to view.. I also agree that it will be interesting to experiment with different hypothetical digraphs.

RE: HMM automatic vowel detection - Koen G - 26-09-2017

Thanks, do keep us informed on your progress.

The thing about terminal vs. non-terminal is rather discouraging indeed. As I understand it, the system will look for sets of glyphs that behave similarly. This is the problem with Voynichese: it behaves so well-structured that sets of glyphs can be discerned with relative ease. I guess this is similar to the low entropy problem.

RE: HMM automatic vowel detection - Davidsch - 26-09-2017

@Marco,
I know there are no better vowel detection tools since the examination of them,
resulting in the Sukhotin method. Of course I did my own analysis of such and it's a long but unfruitful story.
If you're interested see for example : You are not allowed to view links. Register or Login to view.

Prof. Kevin Knight is very interested in the Voynich, in fact he is even using it in his classes as material see You are not allowed to view links. Register or Login to view.

But the level of Computational Linguistics in which he operates sees text as a vector. The principles which you must understand, are rather more complex and will need a vast study.

More papers & info on his research: You are not allowed to view links. Register or Login to view.

RE: HMM automatic vowel detection - ReneZ - 26-09-2017

Marco, it is great that you have been able to get a working version of this algorithm on such a short time scale, and make it produce realistic results for known plain texts.

The interpretation of results is the most difficult part of all statistical analyses.

Whenever I read comments on the entropy analyses of the Voynich MS text (and that is not only in fora), I see that there is a great risk of mis-interpretation.

These statistics work on a string of symbols. They provide information about this string of symbols, not about the language.
Of course, there is some relationship, since the string of symbols is one representation of a piece of text in a language.
For the Voynich MS we can't yet be sure that the string of symbols is actually a language.
Also, there are many different possible ways to represent the string of 'glyphs' in the MS by a string of symbols.

The analyses are there to provide comparisons.
Until now, they have not led to a clear answer about the nature of the Voynich MS text, but they have been able to show in many ways what cannot work, and which assumptions are necessarily incorrect.

While that may be disappointing, and the lack of a positive identification has occasionally been used to argue that statistics are useless, all this is still highly informative.

It is only the synthesis of all information that can lead to the answer how the text in the MS could have been composed. Many of these analyses are still to be done.

To come back to the Reddy/Knight 'result', this is just one output, and they draw no great conclusions from it (which is the correct thing to do).

RE: HMM automatic vowel detection - Emma May Smith - 26-09-2017

Very interesting Marco. I look forward to you testing it out on the Voynich text.

RE: HMM automatic vowel detection - Davidsch - 27-09-2017

@Marco, a question, does the software use the space as delimiter, or does it also have the option to analyze the text without?
If such info is not given: you could quickly parse a text twice, once without spaces and once with and compare results.

RE: HMM automatic vowel detection - MarcoP - 27-09-2017

(27-09-2017, 12:43 PM)Davidsch Wrote: You are not allowed to view links. Register or Login to view.@Marco, a question, does the software use the space as delimiter, or does it also have the option to analyze the text without?
If such info is not given: you could quickly parse a text twice, once without spaces and once with and compare results.

Hi David,
good question. I don't think Reddy and Knight had space as one of the characters, but this Python implementation does consider it as such (see the color matrix in the first post).

Of course it's possible to remove all spaces and train the model with a single string of characters, but this is not realistic (it implies that the first character of a word can phonetically follow the last character of the previous word, which is not necessarily true). After running a few experiments, I believe that presenting one word per line (with newlines but no spaces) is the most sensible approach. [Edit: after further checking the code, I now think that newlines are simply ignored. So there really are only two options: include space as a distinct character or concatenate all words with no spaces]

The results for Matthioli are very similar (but for the absence of the space character in the B matrix).

RE: HMM automatic vowel detection - -JKP- - 27-09-2017

If someone were creating ciphered text in which the vowels were hidden, then adding vowels as nulls, by hand, would create text that somewhat approximates the vowel-consonant balance of normal text.

Example:

The quick brown fox jumps over the lazy dog.

Simple 1-to-1 substitution that obscures the vowels:

ñs9 µqht∫ πckbr wkd ∂qnjf kv9c ñs9 8mzx lkg

Add fake vowels (doesn't even have to be a full set of vowels):

ñoßo9 µqhato∫ πackbor owcckd ∂aqnojfo kov9ca oñoß9 8amozax lokga

If you want to obscure it even more, without making it too hard to read back, shuffle the spaces slightly:

ñoso9 µqhato ∫πackbo rowcck d∂aq nojfo kov9ca oñas9 8 amozax lokga

Read it out loud. It almost makes sense in some old-language kind of a way.

Notice that the word "the" can be ñoso9, oñas9 and many other variations, which means if you try to follow a "word" through the whole document, it will never make any sense because other words will coincidentally look the same and the same word will often look different. The usual sentence patterns will not emerge.

If you want the letters to look more similar to one another, then use slight variations on a made-up shape (like a glyph that alternately has one loop or two).