The Voynich Ninja

Full Version: Character Classes
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Pages: 1 2 3 4 5 6
Diane, thank you. Neal's regex is similar to the kind of thing I would like to build myself. I've managed a very wordy version of how words are built, but nothing quite so simple.

(20-09-2016, 08:34 AM)MarcoP Wrote: You are not allowed to view links. Register or Login to view.In general, I think the problem of defining word structure (similarly to Philip Neal's regex) is contiguous but not identical to the identification of "character classes". For instance, in Latin the phonetically similar "n" and "m" have very different positional statistics. The two letters have roughly the same number of occurrences, but "n" appears as the last letter in about 1% of the words that contain it, while "m" appears as the last letter in about 50% of the words with at least an "m".

That's true, presumably due to case endings. I suppose we need multiple tests to show whether two characters are alike or not.
The main thing to me is that you have one set of consonants written with a straight stroke, and then a corresponding set of consonants written with a curved stroke.

n b
j d
l y
r s
m g

The column on the left, with the straight stroke, follows a or i but not e.  The column on the right, with the curved stroke, follows e but not a or io is neutral with respect to this system and may precede letters from either column.

Also, b and j are rare, and it's not clear what to make of these letters.  We could remove them and pair d with n, but that would be slightly less symmetrical.
I've been working on something that may (or may not) be useful to this thread. There may be something odd involving [ch/sh] and [o/e] when they are around the gallows / Stolfi's "dealers". In some cases, it almost looks like [ch] and [sh] may substitute for each other, and [o] and [e] might variably appear once or twice, or not at all:


she--dyk-ain-qok
che--dyk-ain-qok

ch---dyk-ain
chee-dyk-ain
sho--dyk-ain
sho--dyk-aiin
cho--dyk-an

she-dyk-air
sh--dyk-air
che-dyk-air
ch--dyk-air

-che-dyk-ar
kche-dyk-ar
[font=Courier New]kch--dyk-ar
[/font]
pch--dyk-ar

kee--dyk-ar
tee--dyk-ar

ch--dyok-ar
cho-dyok-ar
che-dyok-ar
she-dyok-ar
ch--dyok-aiin
cho-dyok-aiin

ke--dy-ch---dyk-al-sh
kch-dy-chee-dyk-al-ch


(the dashes are my invention; they do not represent spaces in the manuscript)

This actually happens all over the place, when similar strings are compared:

qo-ko--dyqotedy
qo-ke--dyqotedy
qo-kee-dyqotedy
qo-te--dyqotedy
qo-tee-dyqotedy

op-she-dyqotedy
op-she-dyqoteedy
op-ch--dyqotedy
[font=Courier New]op-che-dyqotedy
op-che-dyqotody
[/font]
[quote pid='6290' dateline='1474399582']
Sam G

The main thing to me is that you have one set of consonants written with a straight stroke, and then a corresponding set of consonants written with a curved stroke.

[...]
j d
l y
r s
m g

The column on the left, with the straight stroke, follows a or i but not e.  The column on the right, with the curved stroke, follows e but not a or io is neutral with respect to this system and may precede letters from either column.

Also, b and j are rare, and it's not clear what to make of these letters.  We could remove them and pair d with n, but that would be slightly less symmetrical.

[/quote]


These are normal Latin abbreviations and this is how they happen to be written in Latin. The "r" with a tail is straight because it's based on Latin "r" with a tail. The curved-j is based on Latin abbreviation "-cis" and it's curved because a "c" in Latin is curved. The straight-j is Latin "-ris", so it follows that it will be straight.

I don't know what they mean in Voynichese, but the shapes were not invented by the VMS scribe, they were borrowed and thus the shapes already had these forms. I don't even think the VMS writer specifically chose letters that were straight or curved because the abbrevations appear to be selected based on the most common ones, not on specific categories of shapes.
I have made a “character classification” experiment based on the position of characters with respect to each other. In this case, I did not consider the position of characters inside words (e.g. if one character tends to occur as a prefix or a suffix).

I introduced upper case characters corresponding to these Voynichese sequences:
K ckh ckh
T cth cth
P cph cph
F cfh cfh
C ch ch
S sh sh
M iin iin
N in in

For each character, I have computed a set of 10 frequencies (in the range 0..1, corresponding to a percentage on the total number of occurrences of the character):
  1. occurrences before a
  2. occurrences after a
  3. occurrences before d
  4. occurrences after d
  5. occurrences before k
  6. occurrences after k
  7. occurrences before l
  8. occurrences after l
  9. occurrences before o
  10. occurrences after o
So, for each character, a 10-dimensional vector was generated. The vectors were fed to the K-Means clustering algorithm, that groups vectors on the basis of their distance from each other.
I used the Elki java software with this command:
 java -jar elki-bundle-0.7.1.jar KDDCLIApplication -dbc.in input.txt -algorithm clustering.kmeans.KMedoidsEM -kmeans.k K

I tried several values for K (the number of output classes). The clustering software is of course very affected by the value of this parameter.

These are the results for 4 and 10 clusters:

4 clusters:
# Cluster: Cluster 0
i N M n m 
# Cluster: Cluster 1
t s r p l k f d T P K F
# Cluster: Cluster 2
y o h g e c a S C
# Cluster: Cluster 3
q


10 clusters:
# Cluster: Cluster 0
r l
# Cluster: Cluster 1
g c
# Cluster: Cluster 2
C S h e
# Cluster: Cluster 3
a y
# Cluster: Cluster 4
m n
# Cluster: Cluster 5
T s P p K F
# Cluster: Cluster 6
k t f d
# Cluster: Cluster 7
q
# Cluster: Cluster 8
o
# Cluster: Cluster 9
N i M

I attach a 2D plot based on two of the 10 numeric values I used (X axis “AFTER O %”, Y axis “AFTER L %”). 
For instance, k occurs after o 39% of the times and after l 14% of the times.
Since the plot only represents 1/5 of the numeric values used, the distribution of the clusters might not be completely compatible with what we see in the plot, but it should give an idea of how clustering works: points that are close in the 10-dimensional space are assigned to the same cluster.
[Image: attachment.php?aid=699]


For comparison, these are the results I get by applying the same method to the English in King James Bible:

4 clusters:
# Cluster: Cluster 0
y
# Cluster: Cluster 1
z x v t r n m k h
# Cluster: Cluster 2
u s q o i g e d a
# Cluster: Cluster 3
p w l j f c b

10 clusters:
# Cluster: Cluster 0
z
# Cluster: Cluster 1
p l f c b
# Cluster: Cluster 2
e
# Cluster: Cluster 3
u
# Cluster: Cluster 4
w
# Cluster: Cluster 5
k r n m v
# Cluster: Cluster 6
t s q h g d x
# Cluster: Cluster 7
j
# Cluster: Cluster 8
i o a
# Cluster: Cluster 9
y
Thanks, Marco! I wish I was more literate in statistics, so I could understand more - could you tell me what you found in a more plaintext format? Big Grin
(04-10-2016, 08:36 AM)MarcoP Wrote: You are not allowed to view links. Register or Login to view.For each character, I have computed a set of 10 frequencies (in the range 0..1, corresponding to a percentage on the total number of occurrences of the character):
  1. occurrences before a
  2. occurrences after a
  3. occurrences before d
  4. occurrences after d
  5. occurrences before k
  6. occurrences after k
  7. occurrences before l
  8. occurrences after l
  9. occurrences before o
  10. occurrences after o
Interesting... can I ask why you chose these five letters?

As far as the results, I found it surprising that g is more common after l than it is after o.
(04-10-2016, 02:55 PM)ThomasCoon Wrote: You are not allowed to view links. Register or Login to view.Thanks, Marco! I wish I was more literate in statistics, so I could understand more - could you tell me what you found in a more plaintext format? Big Grin

Hi Thomas, I wouldn't say I have found anything, but maybe this approach could provide something helpful, if applied in a less random way. The basic concept is simple. Each character is mapped to a vector of 10 numbers. Each vector can be imagined as a point in a 10-dimensional space (a vector of 2 numbers corresponds to a point in 2D, 3 numbers to a point in space...).
The K-Means algorithm splits the set of 10 dimensional points into K groups (called "clusters") made up of points that are close to each other.

So, characters that are assigned to the same cluster are in some way "similar" to each other.

For instance, considering the King James Bible example:

# Cluster: Cluster 1
p l f c b

groups a subset of the consonants, including 3 labials.

# Cluster: Cluster 8
i o a

groups three vowels

Other clusters from the King James Bible are more difficult to analyze. At best, we can say that the 10 Cluster example seems to correctly separate vowels from consonants, but the grouping of consonants is often unclear.
If one could define a more meaningful way of assigning numeric vectors to Voynichese characters, probably we would get more reliable results. The groupings produced by this experiment in its current form must be taken with care Smile

(04-10-2016, 03:10 PM)Sam G Wrote: You are not allowed to view links. Register or Login to view.Interesting... can I ask why you chose these five letters?

Hi Sam,
the choice of these letters was rather random:
I picked o and l because ol is the most common sequence of two characters.
d because it often combines with ol as both old and dol
k because it's the most common gallow.
Finally I added a because it's another common character.

(04-10-2016, 03:10 PM)Sam G Wrote: You are not allowed to view links. Register or Login to view.As far as the results, I found it surprising that g is more common after l than it is after o.

I should have mentioned that the numbers are based on the "dictionary" of words, not occurrences in the MS.
lg occurs in 4 words, og in 3 words. These numbers are so small that they are likely irrelevant.
Nice work Marco, please take into account that there are more positions of the characters possible.

Your view shows only those positions that you choosed,
this means that the distribution is distorted because they are not relational to All other letters and positions.
(04-10-2016, 03:40 PM)Davidsch Wrote: You are not allowed to view links. Register or Login to view.Nice work Marco, please take into account that there are more positions of the characters possible.

Your view shows only those positions that you choosed,
this means that the distribution is distorted because they are not relational to All other letters and positions.

Thank you, David!
I thought of the possibility of considering all the combinations at once. This would produce a number of dimension higher than the number of vectors. I don't know if this is compatible with this simple clustering approach, but I can give it a try and see what happens. It's good that we can always check the results with known languages Smile
Pages: 1 2 3 4 5 6