The Voynich Ninja

Full Version: Similarity of Voynichese glyphs according to their immediate statistical environment
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Pages: 1 2
I posted You are not allowed to view links. Register or Login to view. about the 'similarity' of Voynich glyphs with each other and I'd like to elaborate further on that. Not sure if it's useful and/or a new thing (I doubt), but anyway..

For each glyph I calculate the frequency distributions of the preceding glyph: this gives a vector of numbers (adding up to 1) for each glyph. Then I calculate the Euclidean distance between each couple of vectors (root-mean-square: the square root of the sum of the squares of the differences). I do the same for the distributions of the following glyph. By construction, each distance can be as a minimum zero, and as a maximum SQRT(2) =~ 1.414

What does this mean in practice? If we find that EVA 'k' and EVA 't' have small distances, both of them, and, say, we find the sequence 'oka' in the text, then it's probable we'll also find 'ota', and moreover the ratio between 'oka's and 'ota's will probably be similar to the ratio between 'k's and 't's.

These are the most similar glyph couplets, as measured by the average of distance_previous and distance_following, considering the whole RF1a-n transcription and excluding rare glyphs (defined as all the glyphs with a frequency lower or equal to that of EVA 'g'):

[attachment=15970]

If you want to consider also the rare glyphs, add the following couplets:

[attachment=15971]

For reference (excluding rare characters), the two most dissimilar glyphs are, unsurprisingly, 'q' and 'n'. Average distance = 1.39, previous = 1.39, following = 1.38. Almost maximally orthogonal.

Notice: the above analysis considers 'ch', 'sh', 'ckh', 'cth', 'cph' and 'cfh' to be stand-alone glyphs. This is arbitrary of course (but I think there are good reasons for it). Also, there was some manual work involved in creating the results tables, so excuse for any errors or omissions.

When I can, I'll try to get the same data for each section of the VMS.
[quote="Mauro" pid='85467' dateline='1781012206']

I have long suspected that  ee is a single glyph in the same class as Ch and Sh (or maybe just an error for Ch); while an e alone is a modifier for the previous k, t, Ch, Sh, CKh, or CTh; and eee is an ee modified by e

I also believe that the I in Ih,IKh, ITh is an error, and should be C; that CTHh and CKHh should be CThe and CKhe; that ir should be iin; that  m is an abbreviation for iin; and that b, u, g are just badly shaped versions of other glyphs.

And finally I suspect that p and f are fancy forms of te and ke, respectively.

Would your analysis be compatible with some or all of these hunches?

All the best, --stolfi
Some more data, and an answer to Stolfi.

I re-did the same distance analysis separately on the Balneological and the Herbal A sections. Beyond considering 'ch', 'sh', ckh', 'cth', 'cph' and 'cfh' as stand-alone glyphs this time I also collapsed every multiple occurence of 'i' and 'e' to a single 'i' and 'e' (I don't think this changed anything in the results).

These are all the most similar couplets, in green those with a distance < sqrt(2)/8 (the 12.5% percentile), in yellow with a distance  < sqrt(2)/4 (the 25% percentile), only the average distances, sorry:

[attachment=15975]

Now answering Stolfi:
Quote:I have long suspected that  ee is a single glyph in the same class as Ch and Sh (or maybe just an error for Ch); while an e alone is a modifier for the previous ktChShCKh, or CTh; and eee is an ee modified by e

I also believe that the I in Ih,IKhITh is an error, and should be C; that CTHh and CKHh should be CThe and CKhe; that ir should be iin; that  m is an abbreviation for iin; and that bug are just badly shaped versions of other glyphs.

And finally I suspect that p and f are fancy forms of te and ke, respectively.

Would your analysis be compatible with some or all of these hunches?

From the data above, I would say there's some (weak) support for 'm' being a variant of 'r', but I did not test 'm' vs. 'iin', nor the other cases you pose. But in general I can 'easily' apply any kind of substitution, for instance 'iin' = 'X' and then calculate the distance from 'm'. Just it's not fully automated, I need to copy two big tables in Excel and then get the final results manually from there, so it takes time and I cannot do it now (and surely I will not code anything for the foreseeable future). So I'm sorry but you'll need to be patient, and I'll check if 'ee' is close to 'ch/sh', 'r' to 'iin' and 'm' to 'iin'.

Or you can download my software tool from GitHub You are not allowed to view links. Register or Login to view. and do it yourself (ask for directions in case, but it's easier to do than to explain)
I tested if 'te' is similar to 'p' and 'ke' to 'f' ('te' and 'ke' were the only substitutions made), on Herbal A. They don't look similar, average distances are rather high (distance_following and distance_previous are high too). 'te' and 'ke', instead, strongly resemble... themselves, and are rather distant from any other character (in Herbal A, at least)

[attachment=15978]
Emma started a thread about the subject 10 years ago:
You are not allowed to view links. Register or Login to view.

Here is an experiment I ran that gave similar results (PAM column):
You are not allowed to view links. Register or Login to view.

Uppercase characters You are not allowed to view links. Register or Login to view.
(09-06-2026, 07:29 PM)MarcoP Wrote: You are not allowed to view links. Register or Login to view.Emma started a thread about the subject 10 years ago:
You are not allowed to view links. Register or Login to view.

Here is an experiment I ran that gave similar results (PAM column):
You are not allowed to view links. Register or Login to view.

Uppercase characters You are not allowed to view links. Register or Login to view.

Thank you, I did not know that!
I will add the even earlier paper by Timm You are not allowed to view links. Register or Login to view. (2014-2015). Of course the observation that similarly looking glyphs behave similarly is extremely interesting.

Timm Wrote:Based on the observation that it is possible to generate other words, which exist in the VMS, by replacing similar shaped glyphs, it is possible to list the following rules:

"in", "iin" and "iiin" can replace each other (in - iin - iiin) 
"e", "ee" and "eee" can replace each other (e - ee - eee) 
"ee" and "ch" can replace each other (ee - ch) 
"ch" and "sh" can replace each other (ch - Sh) 
"k", "t", "p" and "f" can replace each other (k - t - p - f) 
"chk", "ckh" and "eke" can replace each other (chk - cKh - eke) 
"o" and "a" can replace each other (o - a) 
"o" and "y" can replace each other (o - y) 
"n", "r" and "m" can replace each other (n - R - m) 
"l", "r" and "m" can replace each other (l - R - m) 
"r" and "s" can replace each other (R - s) 
"s" and "d" can replace each other (s - d)
I recently checked the glyphs preceding the respective “e,” “ee,” and “eee.” I no longer have the test, though. From memory: Before a single “e,” “ch” and ‘sh’ appear about 63 percent of the time; before “ee,” the gallow symbols also appears about 63 percent of the time. With three “e”s (“eee”), the variation is only slight. In Currier B, this effect is slightly stronger than in A.

In that respect, one shouldn’t merge these structures Wink
(09-06-2026, 07:45 PM)MarcoP Wrote: You are not allowed to view links. Register or Login to view.I will add the even earlier paper by Timm You are not allowed to view links. Register or Login to view. (2014-2015). Of course the observation that similarly looking glyphs behave similarly is extremely interesting.

Timm Wrote:Based on the observation that it is possible to generate other words, which exist in the VMS, by replacing similar shaped glyphs, it is possible to list the following rules:

"in", "iin" and "iiin" can replace each other (in - iin - iiin) 
"e", "ee" and "eee" can replace each other (e - ee - eee) 
"ee" and "ch" can replace each other (ee - ch) 
"ch" and "sh" can replace each other (ch - Sh) 
"k", "t", "p" and "f" can replace each other (k - t - p - f) 
"chk", "ckh" and "eke" can replace each other (chk - cKh - eke) 
"o" and "a" can replace each other (o - a) 
"o" and "y" can replace each other (o - y) 
"n", "r" and "m" can replace each other (n - R - m) 
"l", "r" and "m" can replace each other (l - R - m) 
"r" and "s" can replace each other (R - s) 
"s" and "d" can replace each other (s - d)

Thank you.

Timm is probably using a different definition than mine because my results are quite different. For instance I find too that 'r' and 'm' have a moderately low average distance (0.26 in Herbal A), but 'n' is pretty far both from 'r' (av. distance = 0.63) and from 'm' (av. distance = 0.55). This because the distance according to the following character of 'n' from 'm' and 'r' is quite small (they are all mostly followed by space), but the distance according to the preceding character is high (1.1 and 1.08 respectively).

I find 'o' and 'a' to be 0.55 apart, moderately high, same for 'o' and 'y' (0.50): they are not grouped together. Same for 'r' and 's (0.52), while 's' and 'd' are moderately similar (0.39, they barely escaped the tables of my previous post, the threshold was 0.35).

I will check 'ee' vs. 'ch/sh', but I very much doubt they are similar. 'in', 'iin', 'iiin' might well be similar, but I need to test it before being sure.

One caveat is that the results depend on the section of the manuscript examined, ie. 'k' 't' 'p' and 'f' could indeed be seen as a single group in Herbal A (and 'ch', 'sh', 'cth', ckh', 'cph' and cfh' can be added to the mix, see the table You are not allowed to view links. Register or Login to view.), but in Balneological 'k/t', 'p/f' and 'ch/sh' are well separated.
(09-06-2026, 10:13 PM)Mauro Wrote: You are not allowed to view links. Register or Login to view.Timm is probably using a different definition than mine because my results are quite different. For instance I find too that 'r' and 'm' have a moderately low average distance (0.26 in Herbal A), but 'n' is pretty far both from 'r' (av. distance = 0.63) and from 'm' (av. distance = 0.55). This because the distance according to the following character of 'n' from 'm' and 'r' is quite small (they are all mostly followed by space), but the distance according to the preceding character is high (1.1 and 1.08 respectively).

I find 'o' and 'a' to be 0.55 apart, moderately high, same for 'o' and 'y' (0.50): they are not grouped together. Same for 'r' and 's (0.52), while 's' and 'd' are moderately similar (0.39, they barely escaped the tables of my previous post, the threshold was 0.35).

I will check 'ee' vs. 'ch/sh', but I very much doubt they are similar. 'in', 'iin', 'iiin' might well be similar, but I need to test it before being sure.

One caveat is that the results depend on the section of the manuscript examined, ie. 'k' 't' 'p' and 'f' could indeed be seen as a single group in Herbal A (and 'ch', 'sh', 'cth', ckh', 'cph' and cfh' can be added to the mix, see the table You are not allowed to view links. Register or Login to view.), but in Balneological 'k/t', 'p/f' and 'ch/sh' are well separated.

Thank you for this analysis. The discrepancy between your distributional distances and my substitution rules is informative rather than contradictory — because I describe context-dependent rules:

"These rules for similar glyphs only apply with some restrictions. For instance 'o' and 'y' can replace each other only as the first or as the last sign. Another example is that 'o' is interchangeable with 'a' before 'l' and 'r' but not after 'q' or before 'k'." (You are not allowed to view links. Register or Login to view., p. 5). A word grid documenting the most frequent substitution relationships across the VMS vocabulary is available at You are not allowed to view links. Register or Login to view. (see also Timm 2014, pp. 66-82).

So "o" and "a" substitute in specific positions — as prefix elements before "l" and "r" — but not in all contexts. Their global distributional distance (0.55 in your measurement) is high because "o" after "q" has no "a" equivalent, pulling the global distances apart. But in the specific positions where they do substitute — "ol"/"al", "or"/"ar", "chol"/"chal" — they are interchangeable.

The same applies to "n" and "r" (0.63 in your measurement). Both appear word-finally, and in that position they substitute — dain/dair, sain/sair, okain/okair etc. But "r" also appears in other positions where "n" doesn't, making their global distributions different.

Your core pairs — ch/sh, k/t, p/f, r/l — show low distances because they substitute freely across many contexts. The pairs with higher distances in your analysis — o/a, o/y, n/r — substitute only in restricted positions, which dilutes their global similarity.

Note: Currier already noted in 1976 that the Voynich glyphs are constructed from shared base strokes — 'you can make up almost any of the other letters out of these two symbols i and e' (see The Nature of the Symbols in You are not allowed to view links. Register or Login to view.). Your distributional distances quantify this observation: glyph pairs with low distance are the pairs that share stroke structure.
Pages: 1 2