The Voynich Ninja

Full Version: About the generation of similar words
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Pages: 1 2 3 4
This weekend I have been exploring the internal structure of word variants in the Voynich manuscript with a fairly simple approach. The goal was to try to answer the question of the generation of words from the same family. When very similar words appear next to each other, do they generate in successive chains of transformation, or do they form radial families around a base form?

I think this distinction is important because some models of generation proposed for the Voynich (such as Torsten Timm's model) assume that a word is often generated by modifying the immediately preceding word, that is, a successive or sequential generation.

Below I briefly explain the method used and the main results.


Basic idea: When reading the Voynich text it is common to find groups of very similar words within a short space. For example: qokedy, qokeedy, qokeey, qokeeydy.

This type of sequence suggests that they could be variants of the same base form. To study this systematically, the text is simply treated as a linear sequence of words. For each word:

- It is considered a potential core.
- The next 40 or so words are examined.
- If any word has a small edit distance (Levenshtein ≤ 2), it is added to the same group. I call these local groups of variants "bursts".

Reconstructing relationships: Once a burst has been identified, the next step is to look at how the variants relate to each other. For each new variant, we look at whether it is more similar to the core or to a previous variant within the same burst. From this, we can reconstruct a small dependency structure. Three main patterns emerge.

- Radial (star) structure:

            variant
                 |
variant — core — variant
                 |
            variant

Most variants are derived directly from the core.

- Linear (chain) structure

core → variant → variant → variant

Each form is derived from the previous variant.

- Mixed structure: a combination of the two.

Control texts: To find out what is normal in real texts, the same procedure was applied to three control texts: De Docta Ignorantia (Latin), Alchemical herbal (Latin), Culpepper's herbal (English). In these texts the bursts tend to be: small, mostly radial, with very few linear chains. Typical values:

- star structures: ~0.68–0.80
- linear chains: ~0.003–0.007
- average burst size: ~2–3 variants

Results in the Voynich: Applying exactly the same procedure to the Voynich manuscript, we obtain approximately these global values:

- non-isolated bursts: ~4512
- average burst size: ~3.45
- average depth: ~1.56

Structural distribution:

- star: ~0.63
- mixed: ~0.36
- chain: ~0.01

This means that bursts are larger than in the control texts, linear chains are still very rare and there are more mixed structures. Mixed structures usually have this form:

core → variant
core → variant
variant → variant

That is, many variants derive directly from the core, but some also derive from other variants.

To test whether this structure depended only on the Voynich vocabulary or also on the actual order of the text, I repeated exactly the same analysis on a permuted version of the manuscript, keeping the same words within each page but shuffling their order. The result is that the bursts do not disappear, but they do become slightly smaller, shallower, and more radial. In the actual text there is a higher proportion of mixed structures and slightly more internal links between variants. This suggests that the Voynich vocabulary already favors families of similar forms, but that the actual order of the text adds additional local organization.

The analysis also shows that the phenomenon is not homogeneous throughout the manuscript. The astronomical and zodiacal sections tend to have more radial bursts, while the biological-balneological section, and partly also Herbal and Text-only, show more mixed and deeper structures. This suggests that the local behavior of variant families changes according to the section.

Interpretation: The results suggest that local families of variants in the Voynich are not primarily organized as long chains of sequential transformation, that is, the dominant pattern does not appear to be simply a word repeatedly modified step by step. Instead, they are more like local fields of related forms, in which several variants form around a base form, and sometimes some of these variants also give rise to further modifications.

Schematically, the pattern would look more like this:

core
├ variant
├ variant
│ └ variant
└ variant

than a simple chain of the type:

core → variant → variant → variant

What mechanism might generate this? One possible interpretation is that the scribe was working with local families of forms rather than a simple linear succession of modified copies. In this scenario, a word appears and, within the same immediate context, variants are generated by:

- small orthographic modifications
- reuse of similar forms
- occasional modifications of other recent variants

Such a mechanism would naturally produce:

- local groupings of similar words
- larger bursts than in normal control texts
- partially interwoven, not strictly linear structures

Furthermore, the permuted text control suggests that this pattern does not depend solely on the manuscript's word repertoire. The Voynich vocabulary already favors families of similar forms, but the actual order of the text seems to add a somewhat deeper and more mixed local organization.
Quote:I think this distinction is important because some models of generation proposed for the Voynich (such as Torsten Timm's model) assume that a word is often generated by modifying the immediately preceding word, that is, a successive or sequential generation.

There is no preference for the immediately preceding word IIRC. Maybe some option in Torsten Timm's generator?

I was thinking about a possible practical way to prevent long chains of transformations with gallows, as a generation counter or "mutated" mark. Not sure how...
This is a good observation.
Actually it is not a novel idea but rather it confirms what many people (including me) noticed.

Voynichese words make visual clusters on the pages and it's not chain direction only but rather it is 2-dimensional. So a word may be "inspired" not only by previous word but for example by the word above it.

But human "impressions" is one thing and proper statistical proof with Levenshtein distances is another thing.

You could even write some article about - a proof that Voynichese has much "stronger" clusters of similar words than natural languages.
We could then always quote it when someone doubts it  Wink
@qimqu

This is an interesting analysis. I would like to clarify one point about the self-citation model, since I think there may be a misunderstanding about how it works.

The self-citation model does not assume that a word is generated by modifying the immediately preceding word. It describes a scribe who copies from what is already visible on the page. The source word could be the previous word, but it could equally be a word three lines up or at the beginning of the same paragraph — any word within the scribe's visual field.

This means the model naturally predicts the radial/mixed pattern you found rather than strictly linear chains. A scribe looking at a page with "qokedy" visible in multiple places would produce variants that branch from that core form — not a sequential chain where each modification derives only from the word immediately before it. The "core" in your burst analysis corresponds to a frequently visible word that the scribe uses as a source multiple times. It is also handy to copy a word from the same position some lines above, which would further contribute to radial structures around a common source. In Timm (2014) I found that similar glyph groups appear above each other (in consecutive lines at the same position) twice as often as they appear side by side. This vertical copying pattern would naturally produce the radial bursts you observe.

Additionally, the model includes the possibility of combining parts of two visible source words to create a new word. In Timm & Schinner (2020) we describe this as: "Combine two source words to create a new word. As an example, the two words <chol> and <daiin> combine to <choldaiin> or <cholaiin>." This mechanism would also contribute to the mixed structures you observe, since a variant derived from two sources wouldn't fit neatly into either a purely radial or a purely linear pattern.

Your finding that bursts are larger in the Voynich than in control texts (3.45 vs 2-3 variants) and that the actual word order adds local organization beyond what the vocabulary alone produces is consistent with this mechanism. The vocabulary is built through self-citation across the text, and the local order reflects the scribe's immediate visual context.

Your permutation test is a nice way to separate these two effects.
@quimqu,

did you check whether the core word tend to be more frequent words than the variants?
I can think of some reasons why that would be somewhat expected, and it would perhaps be surprising if not.
Also 'no tendency' could be an interesting situation.
(15-03-2026, 08:17 PM)quimqu Wrote: You are not allowed to view links. Register or Login to view.Basic idea: When reading the Voynich text it is common to find groups of very similar words within a short space. For example: qokedy, qokeedy, qokeey, qokeeydy.
[...]
Reconstructing relationships: Once a burst has been identified, the next step is to look at how the variants relate to each other. For each new variant, we look at whether it is more similar to the core or to a previous variant within the same burst. From this, we can reconstruct a small dependency structure.

The problem with "bursts" detection by edit distance ≤ 2 is that it adds a lot of false positives to the group, unrelated words that happen to be "similar", especially short words like "or" and "dy" that are as dissimilar as they can be.

With such an unreliable definition I'm afraid that the dependency reconstruction is mostly illusory.
Quote:The problem with "bursts" detection by edit distance ≤ 2 is that it adds a lot of false positives to the group

That is a valid point. On the other hand Voynich doesn't have so many 2 letter words, right?

I would expect that the observed effect will survive even if we totally eliminate 2 letter words or even 3 letter words from the analysis.

@Quimqu, maybe you could perform such analysis?
(16-03-2026, 12:04 PM)Rafal Wrote: You are not allowed to view links. Register or Login to view.@Quimqu, maybe you could perform such analysis?

It's in the oven  Wink
(16-03-2026, 08:41 AM)nablator Wrote: You are not allowed to view links. Register or Login to view.The problem with "bursts" detection by edit distance ≤ 2 is that it adds a lot of false positives to the group, unrelated words that happen to be "similar", especially short words like "or" and "dy" that are as dissimilar as they can be.

With such an unreliable definition I'm afraid that the dependency reconstruction is mostly illusory.

The criticism about false positives with raw Levenshtein distance ≤ 2 is valid, especially for short words. In my own implementation I addressed this by using a weighted edit distance that takes into account which glyphs the scribe would plausibly substitute for each other. Substituting similar glyphs (e.g. o/a/y, d/l/s, k/t/p/f) costs 1, while substituting dissimilar glyphs costs 2. This reduces false positives because two short words like "or" and "dy" score a high distance even though their raw Levenshtein distance is low.

The implementation is available at:
You are not allowed to view links. Register or Login to view.
Hi everyone,

First of all, thank you for your comments.

I have continued some experiments on the proximity between Voynich-like words and wanted to share some results that I think can contribute to the discussion.

Following Nablator's observation, I have repeated the experiments excluding all tokens of less than three characters and using a stricter similarity rule for short words. Specifically, for three-character words I only consider Levenshtein distance = 1, while for longer words I allow up to Levenshtein = 2.


With this filtering all connections between short words disappear. As expected, the bursts become noticeably smaller (the average size drops to about ~2.4, when before it was around ~5–6). However, several thousand small local clusters of similar words still appear. Therefore, the phenomenon does not seem to be simply an artifact caused by very short words.

Note: I worked with CUVA transliteration as it seems to slightly accentuate the results. As you know, CUVA is essentially a visual transformation of EVA that groups together some graphically very similar characters. For an analysis based on formal similarity it seems more appropriate to me. But as said, results in EVA are very simmilar.

Experiments:

1 Spatial proximity of similar words: The first experiment measures the spatial distance between similar words within the page. For each word I search for the closest similar word and measure the distance in two dimensions.

To check whether the result could simply be an effect of the overall word frequencies, I compared the real text with permuted versions of the text that maintain the same word distribution and line structure. The result is that in the real text similar words appear systematically closer than in the permutations.

For example:
mean Manhattan distance
real text ≈ 3.11
permuted text ≈ 3.39

The median also drops from 3 in the permuted text to 2 in the real text. In addition, there are more neighbors at very small distances (1 to 2 positions).

2. Direction of this proximity: I also looked at the direction of the nearest similar neighbor. If the main mechanism were copying words from the previous line, we would expect to see many exact column matches, that is, variants located exactly under the original word.

What appears is a little different. Exact column matches do not increase much with respect to permutations. However, the probability that the similar word appears in a very close column, for example a column to the right or left, does increase.

In other words, variants tend to appear in the same vertical area of the text, but not necessarily exactly under the previous word. The pattern seems more like local proximity than strict vertical copying.

3. Proximity in the flow of the text: If we look at the distance between variants in the linear flow of the text (in number of tokens), a similar pattern also appears. Variants tend to appear within relatively short windows of words.

This is compatible with a process with short textual memory, where one word can influence those written shortly afterwards.


4. Regarding René's question: René was asking if the “core word” tends to be more frequent than the variants.

In this study, I defined the core of the burst as the first word that appears in the sequence, because the goal is to see if the variants are generated from previous words in the text flow. Defining the core as the medoid (the word with the smallest total distance to the set), for example, completely eliminates directionality and is therefore not suitable for this type of question.

With this definition the results are:

Number of bursts analyzed: 3774
average frequency of the core ≈ 6
average frequency of the variants ≈ 70

Only approximately between 5% and 10% of the bursts have a core that is more frequent than its members.
That is, on average the variants are much more frequent than the initial word of the burst.

5. Provisional conclusion: The experiments seem to show three fairly consistent properties:

- variants appear spatially clustered
- this clustering is mostly local
- the first word of a burst is rarely the most frequent form of the set

This suggests some kind of local variant generation mechanism, but does not clearly point to strict vertical copying.

My (repetitive) intuition says that I have captured in numbers what other researchers had already detected in a more visual and less computational way. Nothing new with the analysis of the Vihnich text…

If anyone can think of other controls or tests that could help to better discriminate between possible models, they would be most welcome.
Pages: 1 2 3 4