ReneZ > 03-10-2017, 08:14 AM
MarcoP > 03-10-2017, 07:39 PM
(03-10-2017, 08:14 AM)ReneZ Wrote: You are not allowed to view links. Register or Login to view.While dong some experimentation in this area, I ran into the following complication.
When working with character pair statistics, which are relevant both for entropy calculations and HMM analyses, there are three different ways of treating word spaces.
1) One counts word spaces among the characters to be analysed.
2) One deletes word spaces completely
3) One treats word spaces as breaks in the string.
ChenZheChina > 28-11-2018, 12:04 PM
MarcoP > 28-11-2018, 06:07 PM
(28-11-2018, 12:04 PM)ChenZheChina Wrote: You are not allowed to view links. Register or Login to view.Hi Marco.
I tried this tool, but got only the first image (B Matrix).
How did you get the second image (two sets, pointing to each other, with letters on the left)?
I see you drew them by hand, but how do you determine which letters belong to which set, and their probabilities?
ChenZheChina > 29-11-2018, 04:24 AM
(28-11-2018, 06:07 PM)MarcoP Wrote: You are not allowed to view links. Register or Login to view.I also had to update the script to assign each letter to one of the states. I have lost this script in a computer crash about one year ago, so I don't remember the details, but of course this was based on the B matrix. The simplest thing would be to assign each letter to that of the two states where that letter has the highest weight.
======== rdict ========
{0: ' ', 1: 'a', 2: 'b', 3: 'c', 4: 'd', 5: 'e', 6: 'f', 7: 'g', 8: 'h', 9: 'i', 10: 'j', 11: 'k', 12: 'l', 13: 'm', 14: 'n', 15: 'o', 16: 'p', 17: 'q', 18: 'r', 19: 's', 20: 't', 21: 'u', 22: 'v', 23: 'w', 24: 'x', 25: 'y', 26: 'z'}
======== newA ========
[[0.209 0.791]
[0.647 0.353]]
======== newB ========
[[0. 0. 0.02 0.057 0.072 0.014 0.036 0.04 0. 0. 0. 0.008 0.109 0.043 0.172 0. 0.023 0. 0.103 0.13 0.125 0. 0.017 0.023 0.005 0. 0.001]
[0.287 0.152 0.001 0. 0. 0.163 0.003 0.036 0.071 0.116 0. 0. 0. 0. 0. 0.105 0.003 0. 0. 0. 0.004 0.043 0. 0. 0. 0.018 0. ]]
cdict = dict([(x, i) for i, x in enumerate(chars)])
rdict = dict([(i, x) for i, x in enumerate(chars)])
highest_indices = np.argmax(newB, axis=0)
sets = dict([(i, []) for i, _ in enumerate(newB)])
for i, s in enumerate(highest_indices):
sets[s].append(rdict[i])
======== sets ========
{0: ['b', 'c', 'd', 'f', 'g', 'j', 'k', 'l', 'm', 'n', 'p', 'q', 'r', 's', 't', 'v', 'w', 'x', 'z'], 1: [' ', 'a', 'e', 'h', 'i', 'o', 'u', 'y']}
MarcoP > 30-11-2018, 11:38 AM
(29-11-2018, 04:24 AM)ChenZheChina Wrote: You are not allowed to view links. Register or Login to view.Finally, we print sets, and get:
Code:======== sets ========
{0: ['b', 'c', 'd', 'f', 'g', 'j', 'k', 'l', 'm', 'n', 'p', 'q', 'r', 's', 't', 'v', 'w', 'x', 'z'], 1: [' ', 'a', 'e', 'h', 'i', 'o', 'u', 'y']}
If I am not mistaking anything, this is how you get the “sets”, right?