After reading You are not allowed to view links.
Register or
Login to view., I was curious to understand more of Hidden Markov Models and language analysis.
I found a reference to a 1980 paper by Cave and Neuwirth. Apparently, they experimented with several HMM configurations, mapping the symbols more likely to be produced by the single states to specific phonetic properties.
A python implementation of their experiment is available online: You are not allowed to view links.
Register or
Login to view.
Most of the theory escapes me, but I have run some simple experiments with the Python software.
In order to reproduce something vaguely similar to what Reddy and Knight did, I set the number of nodes to 2. I made tests with Latin, Italian and English and the algorithm is rather consistent in assigning vowels and consonants to two different states. Since the initial parameters of the HMM (the transition probabilities between the two nodes and the probability for each node to generate each symbol) are initially randomly set, the optimization phase (Baum–Welch) can produce different results in different runs.
Here is an example of what I get for Latin (the XVI Century Matthioli herbal in You are not allowed to view links.
Register or
Login to view.).
These are the results for what the python implementation calls “matrix A”: they are the probabilities with which the model passes from the current state to the next state:
Code:
0 1
0 0.203 0.797
1 0.785 0.215
The matrix is almost symmetrical. Both states have a higher probability of passing to the other state than remaining in the current state. The optimization has configured state 0 to emit consonants and state 1 to emit vowels: in Latin consonants and vowels tend to alternate, so the model alternates between the two states.
The color diagram above is the visual representation of Matrix B (the probability of emission of each symbol for each state) as produced by the python script. The top row corresponds to state 0, the bottom row corresponds to state 1. It should be clear that the two are complementary and that consonants are only emitted in state 0, while vowels are only emitted in state 1. It is also interesting to observe that space is the only symbol that is likely to be emitted by both states: this is because Latin words tend to end both with consonants and vowels: the probability of a consonant ending is higher, so the probability of state 1 emitting space is also higher. A typical pattern might be:
Code:
State1 State0 State1 State0 State1 State0 State1
SPACE Consonant Vowel Consonant Vowel Consonant SPACE
I have drawn by hand the black and white diagram, in order to summarize the configuration produced by the optimization algorithm. Symbols are sorted by decreasing emission probability. The 0-0 and 1-1 loops correspond to the generation of consonant-consonant and vowel-vowel digraphs.