The Voynich Ninja

Full Version: Similarity of Voynichese glyphs according to their immediate statistical environment
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Pages: 1 2
(09-06-2026, 11:26 PM)Torsten Wrote: You are not allowed to view links. Register or Login to view.Thank you for this analysis. The discrepancy between your distributional distances and my substitution rules is informative rather than contradictory — because I describe context-dependent rules:

"These rules for similar glyphs only apply with some restrictions. For instance 'o' and 'y' can replace each other only as the first or as the last sign. Another example is that 'o' is interchangeable with 'a' before 'l' and 'r' but not after 'q' or before 'k'." (You are not allowed to view links. Register or Login to view., p. 5). A word grid documenting the most frequent substitution relationships across the VMS vocabulary is available at You are not allowed to view links. Register or Login to view. (see also Timm 2014, pp. 66-82).

So "o" and "a" substitute in specific positions — as prefix elements before "l" and "r" — but not in all contexts. Their global distributional distance (0.55 in your measurement) is high because "o" after "q" has no "a" equivalent, pulling the global distances apart. But in the specific positions where they do substitute — "ol"/"al", "or"/"ar", "chol"/"chal" — they are interchangeable.

The same applies to "n" and "r" (0.63 in your measurement). Both appear word-finally, and in that position they substitute — dain/dair, sain/sair, okain/okair etc. But "r" also appears in other positions where "n" doesn't, making their global distributions different.

Your core pairs — ch/sh, k/t, p/f, r/l — show low distances because they substitute freely across many contexts. The pairs with higher distances in your analysis — o/a, o/y, n/r — substitute only in restricted positions, which dilutes their global similarity.

Note: Currier already noted in 1976 that the Voynich glyphs are constructed from shared base strokes — 'you can make up almost any of the other letters out of these two symbols i and e' (see The Nature of the Symbols in You are not allowed to view links. Register or Login to view.). Your distributional distances quantify this observation: glyph pairs with low distance are the pairs that share stroke structure.

Thank you for your explanations and remarks. I broadly agree with you, maybe with just some different nuance. Ie. I agree there seem to be two different kinds of 'y', a common one at the end of many words and a rarer one at the beginning of some, so it's quite possible that the 'similarity rules' are different in the two cases. Instead in the case of 'n' vs. 'r/m' I suspect the difference is not driven by words such as 'raiin', with a non-final 'r',  but by common words such as 'ar', 'dar', where 'r' is final but preceded by 'a', while most of the final 'n' are preceded by 'i' ('an' and 'dan' being instead quite rare). These behaviours might be interesting to explore (I'll see if I can when I get the time).

About the 'shared base strokes': it's true of course that VMS symbols are constructed with just a few of them, but I've never been conviced of the significance of this fact because I find it rather expected that any writing system will tend to re-use the same basic strokes. Ie. in the normal block letters Latin alphabet 'o', 'p', 'q', 'b', 'd', 'l' are all made with a circle (or none)+ a vertical bar (or none), but this does not mean they are related to each other.
(10-06-2026, 07:41 AM)Mauro Wrote: You are not allowed to view links. Register or Login to view.About the 'shared base strokes': it's true of course that VMS symbols are constructed with just a few of them, but I've never been conviced of the significance of this fact because I find it rather expected that any writing system will tend to re-use the same basic strokes. Ie. in the normal block letters Latin alphabet 'o', 'p', 'q', 'b', 'd', 'l' are all made with a circle (or none)+ a vertical bar (or none), but this does not mean they are related to each other.

Whether other writing systems reuse the same basic strokes — that is actually the point: the VMS behaves differently. For instance the shape of a glyph must be compatible with the shape of the previous one, and is also influenced by its position within a word or a line. Schwerdtfeger described in 2008 four design rules: (1) line-glyphs can follow line-glyphs or 'a'; (2) curve-glyphs and 'a' can follow curve-glyphs; (3) the 'l'-glyph can be used as a curve-glyph or as a line-glyph; and (4) gallows glyphs count as curve glyphs (see Timm & Schinner 2020, p. 10). See also the description of the You are not allowed to view links. Register or Login to view. by Brian Cham from 2014.

In the Latin alphabet, "p" and "b" share a circle + bar but have completely different distributional contexts. In the VMS, your measurements show that glyph pairs sharing stroke structure — ch/sh (0.13), k/t (0.09), p/f (0.11) — have nearly identical distributional contexts. Visual similarity predicts distributional interchangeability. That relationship doesn't exist in writing systems encoding natural language. However, it does exist in the Voynich text.
I think my recent token-level results may help reconcile Mauro's and Timm's positions. If I am not wrong, Mauro measures global glyph distributions, while Timm argues that many substitutions are local and context-dependent. Instead of looking at glyphs, I looked at whole-word variants.

I built a dataset of token pairs at Levenshtein distance 1 and compared them against controls matched for frequency, length and line-position behaviour. Even after controlling for these factors, Lev-1 pairs still show significantly higher:
  • section similarity
  • page similarity
  • paragraph similarity
  • local context similarity

MetricControlLev-1Increase
Section similarity0.5200.645+24%
Page similarity0.0810.158+95%
Paragraph similarity0.0310.069+123%
Context similarity0.0380.084+121%

This suggests that similar words are not only visually related. They also tend to behave similarly within the manuscript. Interestingly, I also tested whether preserving prefixes is more important than preserving suffixes (or vice versa), and found very little difference. Both seem to preserve behaviour to a similar extent.

One detail I found particularly interesting is that some of the strongest relationships involve expansions rather than substitutions:

ain → aiin → aiiin
qokey → qokeey
shedy → sheedy

These variants tend to occur in very similar contexts. I don't think this proves a generative process, but it does suggest that these transformations are not random. They seem to preserve some aspect of token behaviour across the manuscript. The strongest signal I found is not generic visual similarity, but specifically the expansion series involving repeated i and e.

So my current impression is:
  • Mauro is right that some substitution classes are not globally equivalent.
  • Timm is right that many relationships are local rather than global.
  • But there is also measurable evidence that small orthographic changes tend to preserve the functional behaviour of tokens (if there is one).
The remaining question is whether this reflects genuine morphology or a word-generation process of the kind proposed by Timm.
(10-06-2026, 09:37 PM)quimqu Wrote: You are not allowed to view links. Register or Login to view....

A note on your observation that the "strongest signal" involves the "expansion" series — ain/aiin/aiiin, qokey/qokeey, shedy/sheedy. I would describe these as similar rather than as expansions, because at the stroke level the modification is the same as in other substitutions:

- "ch" → "sh": one stroke added to the curve. Transcribed as a character substitution.
- "in" → "iin": one stroke added (another minim). Transcribed as a character expansion.
- "edy" → "eedy": one stroke added (another "e" curve). Transcribed as a character expansion.

All three are the same physical operation — adding one stroke. The transcription system represents the first as substitution and the other two as expansion. But on the parchment, the scribe is doing the same thing in each case. The distinction between "substitution" and "expansion" is an artifact of the transcription, not a property of the writing.

This is why in Timm & Schinner 2020 (p. 9) both operations are described under the same rule — 'Replace one or more glyphs by similar ones.' The ligature 'ch' consists of two 'e'-glyphs connected by a dash, so adding an additional 'e'-glyph leading to 'cheol' from 'chol' is the same kind of modification as changing 'ch' to 'sh'.

You can see the relations between similar word types in the global word frequencies:
"For instance, in most cases, words with 'sh' are less frequent than the corresponding variant using 'ch'. Also words using 'p' or 'f' instead of 'k' and 't' are generally less frequent. Similar  relations can be described for words using a 'a' instead of 'o', 'ee' instead of 'e' etc. Furthermore, it seems that if a word is spelled similarly to 'daiin', 'ol' or 'chedy', it is more frequent than a word, which is spelled in a less similar way. To quantify this effect, the edit distance, defined as the number of steps required to  transform two words into each other, can be used. For  instance, it is possible to transform 'daiin' into 'dain' by deleting one i-glyph and into 'dair' by  deleting an 'i' and by replacing 'n' with 'r'. ...

These observations make it possible to predict the occurrence and the frequency of similarly spelled words. For instance, if it is known that 'chedy' is frequent, it is possible to predict that 'shedy' is also frequently used although less frequently than 'chedy'. And if we know that 'ychy' only occurs four times it is possible to predict that a glyph group 'ochy' should also exist and that the groups 'ychdy', 'yshy' and 'osheedy' probably occur less than four times.
The grid reveals that the words of the VMS are connected to each other. It is possible to generate another word from the word pool by replacing a glyph by a similar one, or by adding or deleting a glyph. How was it possible to construct a language with "generated" words and to write a text containing over 37,000 words with determinable word frequencies? Was the scribe counting the words he was writing? This seems very unlikely. A better explanation would be the assumption that it is an unintended side effect of the manufacturing or encoding process that similarly spelled words occur with predictable frequencies." (Timm 2014, p. 6f).

With other words, similar words co-occur in the Voynich text and therefore have similar word frequencies. The global "glyph distributions" are caused by the local context-dependent behavior.
(10-06-2026, 08:42 PM)Torsten Wrote: You are not allowed to view links. Register or Login to view.In the Latin alphabet, "p" and "b" share a circle + bar but have completely different distributional contexts. In the VMS, your measurements show that glyph pairs sharing stroke structure — ch/sh (0.13), k/t (0.09), p/f (0.11) — have nearly identical distributional contexts. Visual similarity predicts distributional interchangeability. That relationship doesn't exist in writing systems encoding natural language. However, it does exist in the Voynich text.

Letters with similar shapes in "classical" alphabets -- Phoenician, Demotic, Greek, Latin, etc. -- have wildly different origins; that is why they have very different statistics.  The same "feature" holds for later alphabets inspired by them.

But alphabets constructed from scratch -- like Shavian, Inukitut, and Hangul -- often intentionally used similar shapes for similar sounds, according to logical rules. Like a certain graphical detail systematically used to distinguish voiced from unvoiced consonants (like Latin "b" from "p",  "d" from "t", "v" from "f", etc), or "front" from "middle" from "back" articulation (like Mandarin pinyin "z" from "zh" from "j", or "c" from "ch" from "q", etc.), or "lax", "aspirated",and "fortis" (like in Korean "p" from "ph" from "pp" etc.). 

And that is often the case for shorthand systems, that are usually phonetic but designed from scratch so that they are faster to write than any Latin-based script.  They often use similar "logical" assignments of shapes to sounds to make the system easier to learn.

Voynichese was obviously designed from scratch, with no connection to previous alphabets; and may well have been more of a shorthand than an accurate phonetic alphabet.  So it is not surprising that glyphs with similar shapes have similar context statistics.

All the best, --stolfi
Pages: 1 2