I have run two similar clustering algorithms (PAM Partitioning Around Medoids and EM Expectation Maximization) using 24 dimensions vectors: for each character, I considered the percentage of words in which that character appears immediately before or after each of the 12 most commons characters.
The 12 characters used are:
a C d E e k l o r S t y
a ch d ee e k l o r sh t y
where E stands for EVA:ee
ee
(the other upper characters are as in the previous post).
I have rearranged the resulting clusters so that they can be compared. The percentage number is meant to provide a “weight” for the relevance of the character: it's the percentage of unique words that contain that character.
PAM EM
# Cluster 0 # Cluster 6
l l 30.90%
r r 21.46%
# Cluster 1 # Cluster 2
E E 17.35%
e e 34.70%
h # Cluster 8
h 1.95%
# Cluster 8 # Cluster 1
f f 3.83%
p p 9.60%
# Cluster 3 t 19.42%
k
t
# Cluster 7 # Cluster 4
C C 36.58%
S F 0.73%
# Cluster 5 k 27.74%
F K 3.31%
K P 1.47%
P S 15.79%
T T 2.85%
s s 13.17%
# Cluster 6 # Cluster 5
a c 1.31%
c o 64.04%
o # Cluster 0
y a 36.77%
y 44.97%
# Cluster 9 # Cluster 9
M M 7.66%
N N 1.92%
i i 8.44%
m m 3.56%
n n 1.00%
# Cluster 2 # Cluster 7
q q 12.00%
# Cluster 4 # Cluster 3
d d 35.20%
In summary:
- EVA:l and r (l r) are assigned to a same cluster by both algorithms. I found this result interesting, since in other cases the characters that are grouped together are graphically similar (while l and r aren't)
- E and e are clustered together: it seems that e behaves similarly when it is repeated and when it isn't; EM clusters h h together with E and e, but this character is basically irrelevant, since most occurrences make part of "bench" sequences;
- the simple “benchless” gallows cluster together, with some ambiguity about “k”, which the EM algorithm classifies together with the “benched” gallows.
- all “benches” sequences are grouped together in EM, with the addition of EVA: s and possibly k (see above); the results of PAM are similar, but C ch and S sh are assigned to a separate cluster;
- EVA: a c o y (a c o y) are grouped together. EM identifies two subgroups: a y and o c; the number of occurrences of c is very limited, since it is usually part of one of the sequences I represented with uppercase characters (ch and the "benched" gallows)
- a group of characters that usually appear at the end of words are grouped together: M N i m n. While M and N are made up of sequences of i and n, EVA:m m is graphically a distinct character.
- q and d are different from each other and from each other character.
For comparison, here are the results for King James Bible. The 12 most common characters are:
a c d e h i l n o r s t
PAM EM
# Cluster 3 # Cluster 0
b b 11.87%
c c 21.31%
f f 12.04%
j j 1.01%
l p 17.41%
m # Cluster 3
p l 28.48%
w m 14.45%
# Cluster 4 r 46.20%
v v 9.06%
w 10.85%
# Cluster 1 # Cluster 4
d d 30.87%
q q 0.92%
r s 47.14%
s t 45.29%
t
# Cluster 0 # Cluster 5
a a 40.90%
e e 70.88%
i i 41.46%
o o 34.70%
u u 19.74%
# Cluster 9 # Cluster 6
g g 18.67%
# Cluster 6 # Cluster 7
n n 40.04%
# Cluster 7 # Cluster 8
h h 25.44%
# Cluster 8 # Cluster 9
k k 6.48%
# Cluster 5 # Cluster 1
y y 9.01%
# Cluster 2 # Cluster 2
x x 1.35%
Notes:
- two groups with different composition (made up two distinct clusters for each algorithm) contain most consonants: b c f j l m p v w. The position of 'r' is unclear, since PAM assigns it to the next group
- a third consonant group is made up of d, q, s, t and possibly r (see above)
- the vowels a e i o u are recognized as distinct by both algorithms
- a number of characters are clustered as singletons with just one element: g n h k y x
King James Bible results are rather opaque to me. It seems significant that the vowels are group together, but I have no idea of what the other letters have in common.