About the generation of similar words
quimqu > 2 hours ago
This weekend I have been exploring the internal structure of word variants in the Voynich manuscript with a fairly simple approach. The goal was to try to answer the question of the generation of words from the same family. When very similar words appear next to each other, do they generate in successive chains of transformation, or do they form radial families around a base form?
I think this distinction is important because some models of generation proposed for the Voynich (such as Torsten Timm's model) assume that a word is often generated by modifying the immediately preceding word, that is, a successive or sequential generation.
Below I briefly explain the method used and the main results.
Basic idea: When reading the Voynich text it is common to find groups of very similar words within a short space. For example: qokedy, qokeedy, qokeey, qokeeydy.
This type of sequence suggests that they could be variants of the same base form. To study this systematically, the text is simply treated as a linear sequence of words. For each word:
- It is considered a potential core.
- The next 40 or so words are examined.
- If any word has a small edit distance (Levenshtein ≤ 2), it is added to the same group. I call these local groups of variants "bursts".
Reconstructing relationships: Once a burst has been identified, the next step is to look at how the variants relate to each other. For each new variant, we look at whether it is more similar to the core or to a previous variant within the same burst. From this, we can reconstruct a small dependency structure. Three main patterns emerge.
- Radial (star) structure:
variant
|
variant — core — variant
|
variant
Most variants are derived directly from the core.
- Linear (chain) structure
core → variant → variant → variant
Each form is derived from the previous variant.
- Mixed structure: a combination of the two.
Control texts: To find out what is normal in real texts, the same procedure was applied to three control texts: De Docta Ignorantia (Latin), Alchemical herbal (Latin), Culpepper's herbal (English). In these texts the bursts tend to be: small, mostly radial, with very few linear chains. Typical values:
- star structures: ~0.68–0.80
- linear chains: ~0.003–0.007
- average burst size: ~2–3 variants
Results in the Voynich: Applying exactly the same procedure to the Voynich manuscript, we obtain approximately these global values:
- non-isolated bursts: ~4512
- average burst size: ~3.45
- average depth: ~1.56
Structural distribution:
- star: ~0.63
- mixed: ~0.36
- chain: ~0.01
This means that bursts are larger than in the control texts, linear chains are still very rare and there are more mixed structures. Mixed structures usually have this form:
core → variant
core → variant
variant → variant
That is, many variants derive directly from the core, but some also derive from other variants.
To test whether this structure depended only on the Voynich vocabulary or also on the actual order of the text, I repeated exactly the same analysis on a permuted version of the manuscript, keeping the same words within each page but shuffling their order. The result is that the bursts do not disappear, but they do become slightly smaller, shallower, and more radial. In the actual text there is a higher proportion of mixed structures and slightly more internal links between variants. This suggests that the Voynich vocabulary already favors families of similar forms, but that the actual order of the text adds additional local organization.
The analysis also shows that the phenomenon is not homogeneous throughout the manuscript. The astronomical and zodiacal sections tend to have more radial bursts, while the biological-balneological section, and partly also Herbal and Text-only, show more mixed and deeper structures. This suggests that the local behavior of variant families changes according to the section.
Interpretation: The results suggest that local families of variants in the Voynich are not primarily organized as long chains of sequential transformation, that is, the dominant pattern does not appear to be simply a word repeatedly modified step by step. Instead, they are more like local fields of related forms, in which several variants form around a base form, and sometimes some of these variants also give rise to further modifications.
Schematically, the pattern would look more like this:
core
├ variant
├ variant
│ └ variant
└ variant
than a simple chain of the type:
core → variant → variant → variant
What mechanism might generate this? One possible interpretation is that the scribe was working with local families of forms rather than a simple linear succession of modified copies. In this scenario, a word appears and, within the same immediate context, variants are generated by:
- small orthographic modifications
- reuse of similar forms
- occasional modifications of other recent variants
Such a mechanism would naturally produce:
- local groupings of similar words
- larger bursts than in normal control texts
- partially interwoven, not strictly linear structures
Furthermore, the permuted text control suggests that this pattern does not depend solely on the manuscript's word repertoire. The Voynich vocabulary already favors families of similar forms, but the actual order of the text seems to add a somewhat deeper and more mixed local organization.