The Voynich Ninja

Full Version: Character Classes
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Pages: 1 2 3 4 5 6
Marco you could, I'm sure, but I know it will not provide new information, for what it's worth.
Sean Palmer produced a study along similar lines a few years ago You are not allowed to view links. Register or Login to view., showing glyph affinity in general.
(06-10-2016, 07:49 PM)davidjackson Wrote: You are not allowed to view links. Register or Login to view.Sean Palmer produced a study along similar lines a few years ago You are not allowed to view links. Register or Login to view., showing glyph affinity in general.

Thank you, David! The colored matrix is very interesting.
As the author wrote:

Quote:"Comes before" and "comes after" are measured in a binary way, meaning that distance and number of occurrences are ignored. So for example "ab" ranks as "b-after-a" just as much as "acccbbb".

So this is all about the position of characters inside words: prefix characters produce blue horizontal lines; suffix characters produce red horizontal lines.

This analysis seems to me to be mostly relevant as a morphological investigation, while I think that Emma's character classes are most likely due to phonetics. But of course there could also be a significant overlap between morphology and phonetics: I guess that morphology is constrained by what is phonetically possible.
Some interesting things posted in this thread. Thanks especially to Marco. I will think about them and maybe comment.
I have run two similar clustering algorithms (PAM Partitioning Around Medoids and EM Expectation Maximization) using 24 dimensions vectors: for each character, I considered the percentage of words in which that character appears immediately before or after each of the 12 most commons characters. 
The 12 characters used are:
a  C  d  E  e  k  l  o  r  S  t  y 
a ch d ee e k l o r sh t y
where E stands for EVA:ee ee
(the other upper characters are as in the previous post).

I have rearranged the resulting clusters so that they can be compared. The percentage number is meant to provide a “weight” for the relevance of the character: it's the percentage of unique words that contain that character.

PAM                        EM

# Cluster 0                # Cluster 6
l                          l        30.90%
r                          r        21.46%

# Cluster 1                # Cluster 2
E                          E        17.35%
e                          e        34.70%
h                          # Cluster 8
                           h         1.95%

# Cluster 8                # Cluster 1        
f                          f         3.83%
p                          p         9.60%
# Cluster 3                t        19.42%
k                        
t                        
                        
# Cluster 7                # Cluster 4        
C                          C        36.58%
S                          F         0.73%
# Cluster 5                k        27.74%
F                          K         3.31%
K                          P         1.47%
P                          S        15.79%
T                          T         2.85%
s                          s        13.17%
                        
# Cluster 6                # Cluster 5        
a                          c         1.31%
c                          o        64.04%
o                          # Cluster 0        
y                          a        36.77%
                           y        44.97%
                        
# Cluster 9                # Cluster 9        
M                          M         7.66%
N                          N         1.92%
i                          i         8.44%
m                          m         3.56%
n                          n         1.00%
                        
# Cluster 2                # Cluster 7        
q                          q        12.00%
                        
# Cluster 4                # Cluster 3        
d                          d        35.20%

In summary:
  • EVA:l and r (l r) are assigned to a same cluster by both algorithms. I found this result interesting, since in other cases the characters that are grouped together are graphically similar (while l and r aren't)
  • E and e are clustered together: it seems that e behaves similarly when it is repeated and when it isn't; EM clusters h h together with E and e, but this character is basically irrelevant, since most occurrences make part of "bench" sequences;
  • the simple “benchless” gallows cluster together, with some ambiguity about “k”, which the EM algorithm classifies together with the “benched” gallows.
  • all “benches” sequences are grouped together in EM, with the addition of EVA: s and possibly k (see above); the results of PAM are similar, but C ch and S sh are assigned to a separate cluster;
  • EVA: a c o y (a c o y) are grouped together. EM identifies two subgroups: a y and o c; the number of occurrences of c is very limited, since it is usually part of one of the sequences I represented with uppercase characters (ch and the "benched" gallows)
  • a group of characters that usually appear at the end of words are grouped together: M N i m n. While M and N are made up of sequences of i and n, EVA:m m is graphically a distinct character.
  • q and d are different from each other and from each other character.

For comparison, here are the results for King James Bible. The 12 most common characters are:
a c d e h i l n o r s t

PAM                        EM        
# Cluster 3                # Cluster 0        
b                          b        11.87%
c                          c        21.31%
f                          f        12.04%
j                          j         1.01%
l                          p        17.41%
m                          # Cluster 3        
p                          l        28.48%
w                          m        14.45%
# Cluster 4                r        46.20%
v                          v         9.06%
                           w        10.85%
                        
# Cluster 1                # Cluster 4        
d                          d        30.87%
q                          q         0.92%
r                          s        47.14%
s                          t        45.29%
t                        
                        
# Cluster 0                # Cluster 5        
a                          a        40.90%
e                          e        70.88%
i                          i        41.46%
o                          o        34.70%
u                          u        19.74%
                        
# Cluster 9                # Cluster 6        
g                          g        18.67%
# Cluster 6                # Cluster 7        
n                          n        40.04%
# Cluster 7                # Cluster 8        
h                          h        25.44%
# Cluster 8                # Cluster 9        
k                          k         6.48%
# Cluster 5                # Cluster 1        
y                          y         9.01%
# Cluster 2                # Cluster 2        
x                          x         1.35%

Notes:
  • two groups with different composition (made up two distinct clusters for each algorithm) contain most consonants: b c f j l m p v w. The position of 'r' is unclear, since PAM assigns it to the next group
  • a third consonant group is made up of d, q, s, t and possibly r (see above)
  • the vowels a e i o u are recognized as distinct by both algorithms
  • a number of characters are clustered as singletons with just one element: g n h k y x
King James Bible results are rather opaque to me. It seems significant that the vowels are group together, but I have no idea of what the other letters have in common.
Thank you once again, Marco. You're providing an objective measurement of similarity which is incredibly helpful.

It doesn't surprised me that l and r are classed together. Even though they may not be considered as similar in appearance, I've always thought they worked in similar ways.

I am surprised, however, that a and y are classed together. They tend not to occur in the same places and are not functionally that similar. Do you know why the two algorithms might have done this?
Hi Emma,
I am glad you find these results interesting!

I will have to study the data more carefully in order to understand the results better and answer your question. I am looking forward at this activity, but I will not have time at least for a few days.
In the meantime, I attach the raw data on which the clustering was based (it's a csv file, but I had to add a txt extension in order to attach it).
Thanks for the data, that's wonderful of you, Marco.

It seems as though they match badly on the "before" scores but pretty well on the "after" scores. Considering that (I believe) the following character conditions the appearance of a or y, it makes a lot of sense.
(14-10-2016, 06:18 PM)MarcoP Wrote: You are not allowed to view links. Register or Login to view.Hi Emma,
I am glad you find these results interesting!

I will have to study the data more carefully in order to understand the results better and answer your question. I am looking forward at this activity, but I will not have time at least for a few days.
In the meantime, I attach the raw data on which the clustering was based (it's a csv file, but I had to add a txt extension in order to attach it).

Marco, was your analysis made with the text from the whole manuscript, or a subset of the folios? I wonder about conflating Language A and B ...
Thank you for your wonderful work Marco. The fact that <ee> and <e> behave similarly is very helpful. I'm starting to suspect that <e> can be doubled / tripled without any change in meaning.

I'm also not surprised that <l> and <r> are in the same group - both of them appear after <o> or <a> 80% of the time, and they are the only two signs with this distribution.

(14-10-2016, 06:28 PM)Emma May Smith Wrote: You are not allowed to view links. Register or Login to view.I tried adding the stats for a and y together, and they look (subjectively) a strong match for o.

Hi Emma - recently I've also been wondering about the relation between o and y - the most common vords beginning <ok> often have <yk> parallels (same with <ot> probably). And in Marco's work, they are in the same group...
Pages: 1 2 3 4 5 6