The Voynich Ninja
The structure of the Voynich text and how it may be generated - Printable Version

+- The Voynich Ninja (https://www.voynich.ninja)
+-- Forum: Voynich Research (https://www.voynich.ninja/forum-27.html)
+--- Forum: Analysis of the text (https://www.voynich.ninja/forum-41.html)
+--- Thread: The structure of the Voynich text and how it may be generated (/thread-5500.html)

Pages: 1 2


The structure of the Voynich text and how it may be generated - quimqu - 01-04-2026

Most discussions about the Voynich already agree on a few basic points: the text is not random, similar words tend to appear near each other, and position within the line matters. What is less clear is how this structure is actually generated.

I have been working in a pipeline that lets me analyze the structure of the MS. The goal of this analysis was not to describe patterns, but to test concrete generative hypotheses against the data. The question was always the same: if this were the real mechanism, would it reproduce what we observe? Running different models through the same pipeline makes it possible to discard entire classes of explanations, not just speculate about them.

HypothesisExpected behaviorObserved failure
Random / weak structureNo stable local similarity or positional effectsStrong clustering and positional patterns persist
Sequential (Markov-like)Next token predictable from previous onesBigram/HMM models add little or collapse
Copy–modify (parent-based)Clear local derivations, strong nearest neighborGenerative models produce too much similarity
Single dominant parentOne best local candidate per tokenMultiple candidates with similar scores, no clear winner

The important point is not just that these models fail, but how they fail. Copy-and-modify mechanisms generate too much similarity, producing tight chains of derived forms that are not observed in the real text. Sequential models fail in the opposite direction, missing most of the structure entirely. The idea of a single dominant parent breaks down because the local neighborhood is too ambiguous: for most tokens, several nearby forms are equally plausible, with no clear winner. These are structural mismatches, not minor errors, and they rule out a large class of simple generative explanations.

At the same time, some effects are very robust. Local similarity is real and strong: words share substrings and cluster in form space. Position within the line has a clear impact on length, prefixes, and suffixes. But these signals do not translate into a simple mechanism where one word determines the next. Token-level models struggle precisely because the system is not organized primarily as a chain of local decisions.

The structure becomes clearer when moving to the level of the full line. If lines are represented as whole objects, using their internal properties (number of tokens, length distributions, entropy, positional patterns), they fall into a small number of latent types. These types are not imposed manually, but learned directly from the text. They correspond broadly to different functional roles, but also reveal variation within them. These data-driven line types also show persistence across consecutive lines, suggesting that the manuscript is organized as sequences of line-level states, not just as a stream of loosely connected tokens.
  • The text is not governed by simple sequential rules. Token-to-token models fail to capture the structure, even when extended beyond basic Markov assumptions.
  • It is not generated by copy-and–modify or parent-based derivation. These mechanisms overproduce similarity and impose chains that are not present in the data.
  • There is no single dominant local source for most tokens. The local neighborhood is too ambiguous, with multiple equally plausible candidates.
  • The strongest and most stable structure appears at the level of the full line. Lines form a small number of latent types with distinct formal profiles and non-random sequencing.

A useful way to think about it is the following: 

  1. The line type defines a space of possible forms. 
  2. The local context restricts this space further by favoring forms that are compatible with nearby words. 
  3. But within that constrained space, the final choice is weakly determined. Many candidates are acceptable, and no single one is strongly preferred. 

This explains why local similarity is strong but does not translate into clear parent-child relationships, and why token-level models struggle while line-level structure is much more stable.

With this analysis I try not to show that the Voynich is structured, which was already suspected, but to narrow down the class of mechanisms that can plausibly generate it. Simple sequential models and naive copy-and-modify processes do not fit. Models that operate at the level of line-level states, combined with a local compatibility field and weak selection within that field, are much more consistent with the data.


RE: The structure of the Voynich text and how it may be generated - quimqu - 01-04-2026

I know that my posts are quite dense and "hard" to read. So I try to summarize here:

What remains consistent with the data is not a chain of transformations, but a constrained selection process. The line defines a space of valid forms, the local context restricts this space further, and the final choice is made among multiple equally compatible candidates. This shifts the problem from “how is the next token generated” to “how is the space of valid tokens defined”.

The one can think of concrete and testable (I will work on them) assumptions about the mechanism. Instead of trying to predict a single next token, the model should first approximate a candidate set, and only then model selection within that set.

AssumptionOperational test
Candidate set constrained by line positionPredict top-k tokens from line features and measure recall
Local context restricts form, not identityBuild candidates by similarity and check if real token is included
Selection within the set is weakTrain ranking model and measure score flatness
Substructures drive compatibilityUse prefixes/suffixes only and test predictive power

With this, a generative model would not sample tokens directly from a sequence model, but proceed in two steps: first generate or retrieve a set of compatible forms given the line and its context, then sample from this set with relatively low discrimination. The key quantity to model is no longer the probability of a token given the previous one, but the probability that a token belongs to the current feasible set.

This leads to a different type of validation. A good model is not one that predicts the exact next token, but one that consistently places the real token inside a small, well-defined candidate set.

I hope to have explained myself better  Rolleyes


RE: The Structure of the Voynich Text and How It May Be Generated - nablator - 01-04-2026

(01-04-2026, 12:16 PM)quimqu Wrote: You are not allowed to view links. Register or Login to view.Simple sequential models and naive copy-and-modify processes do not fit.

I tried to propose a less naïve self-citation generation method including sparse initialization causing bottlenecks that would explain local inhomogeneities, lazy source words selection patterns, non-sequential writing, and generation rules optimized to fit the transliteration data. It seems that no one knew how to test the hypothesis. What do you think about it? You are not allowed to view links. Register or Login to view.

Such a method, if it can be shown to replicate all known properties of Voynichese, does not exclude the possibility of a cipher such as You are not allowed to view links. Register or Login to view., but unnecessary complications can usually be discarded by Occam's razor when there is no evidence specifically pointing at them.


RE: The Structure of the Voynich Text and How It May Be Generated - quimqu - 01-04-2026

(01-04-2026, 12:50 PM)nablator Wrote: You are not allowed to view links. Register or Login to view.I tried to propose a less naïve self-citation generation method including sparse initialization causing bottlenecks that would explain local inhomogeneities, lazy source words selection patterns, non-sequential writing, and generation rules optimized to fit the transliteration data. It seems that no one knew how to test the hypothesis. What do you think about it? You are not allowed to view links. Register or Login to view.

Such a method, if it can be shown to replicate all known properties of Voynichese, does not exclude the possibility of a cipher such as You are not allowed to view links. Register or Login to view., but unnecessary complications can usually be discarded by Occam's razor when there is no evidence specifically pointing at them.

I think your proposal is interesting, especially because it goes beyond naïve copy–modify and introduces more realistic selection patterns.

The key issue, in my view, is whether it can reproduce two constraints at the same time: strong local similarity and weak determinism. In the data, many nearby forms are compatible, but there is usually no clear source or dominant parent. Most direct self-citation models tend to produce too much chaining and identifiable source–target links.

I would suggest testing two versions of your idea: a strong one where tokens are explicitly derived from source words, and a weaker one where nearby words define a candidate set, but the final choice is not tied to a specific source. The question is whether the strong version can avoid producing detectable parent-child structure.

If you agree, I will try to model them and put them in my pipeline. Will let you know!


RE: The structure of the Voynich text and how it may be generated - nablator - 01-04-2026

(01-04-2026, 01:12 PM)quimqu Wrote: You are not allowed to view links. Register or Login to view.Most direct self-citation models tend to produce too much chaining and identifiable source–target links.

I don't know what you mean by "identifiable". I believed that the frequency of some very unlikely sequential transfer patterns between two lines would almost certainly identify source-target links but then I was disappointed to find a similar frequency in Torsten Timm's generated_text file (see my last post in the other thread).

I should try something else. For example the length of the Longest Common Subsequence (LCS) may be better suited than the edit distance to the detection of the sequential transfer pattern between lines, with similarity(line1, line2) = length of LCS(line1, line2) / average(length of line1, length of line2).

Quote:If you agree, I will try to model them and put them in my pipeline. Will let you know!

Sure. Thank you for taking the time to conduct an in-depth analysis!


RE: The structure of the Voynich text and how it may be generated - quimqu - 01-04-2026

(01-04-2026, 03:32 PM)nablator Wrote: You are not allowed to view links. Register or Login to view.I don't know what you mean by "identifiable". I believed that the frequency of some very unlikely sequential transfer patterns between two lines would almost certainly identify source-target links but then I was disappointed to find a similar frequency in Torsten Timm's generated_text file (see my last post in the other thread).

In a typical self-citation model, each token tends to have a dominant source, even if it is not always easy to detect.

What I observe instead is that no single source dominates: multiple nearby candidates contribute weakly, and the structure appears distributed (cloud-like) rather than chain-based.


RE: The structure of the Voynich text and how it may be generated - quimqu - 05-04-2026

I have been doing some testing with different models.

The experiment explores how far simple generative rules can go in reproducing measurable properties of the Voynich text. Instead of focusing on a single metric, several constraints were evaluated together, including repetition rate, local similarity within a fixed window, entropy, hapax proportion, vocabulary size, and positional differences within lines.

Several model families were tested. The simplest ones rely on global sampling, with or without small random variations. More structured versions introduce a local pool of tokens, dependence on the previous word, and eventually a bias depending on the position within the line. The most complex variant also allows a small amount of controlled novelty, generating slightly new forms rather than only reusing existing ones.

ModelCore ideaWhat it gets roughly rightMain failures
base Samples tokens from global vocabulary with weak repetition penalty Basic entropy levels, approximate repetition Too little structure, wrong similarity, poor hapax and vocabulary control, no positional behavior
base_variant Global sampling + occasional independent small edits Repetition and some local similarity Variation not tied to context, wrong hapax rate, no positional asymmetry, weak vocabulary structure
prev_morph New token often derived from previous token (local edit process) Short-range similarity and chaining effects Over-relies on sequential edits, fails on repetition balance, hapax and vocabulary distribution
positional_prev_morph Local pool + dependence on previous token + bias by position in line Repetition, similarity, partial positional effects Still poor hapax rate, vocabulary size, and incorrect start/end of line behavior
positional_prev_morph_new Same as above + small probability of generating new forms (controlled novelty) Best overall balance across repetition, similarity and entropy Large errors remain in hapax rate, vocabulary, word-length asymmetry and positional distributions

The results show that basic properties such as repetition and short-range similarity are relatively easy to reproduce. Many different models can be tuned to reach similar values. The situation changes when additional constraints are included. Differences between models become clearer, and none of them manages to satisfy all constraints at once.

The models that perform best are those combining local reuse, dependence on the previous token, and some positional sensitivity. Adding a small amount of new word formation improves the fit slightly. However, even these models still deviate significantly in several key aspects, especially in hapax rate, vocabulary size, and how words behave at the beginning and end of lines.

The main conclusion is limited but somehow clear: simple local mechanisms are sufficient to reproduce part of the observable structure, but they fail when multiple constraints are considered simultaneously. The difficulty lies not in matching individual statistics, but in matching all of them at once.


RE: The structure of the Voynich text and how it may be generated - quimqu - 05-04-2026

Since the last experiment, I moved away from purely sequential models and tried a different approach based on local compatibility rather than direct copying. Instead of generating each token from the previous one, the model builds a local pool of compatible forms and selects from it, with a weak bias from recent context.

AspectEarlier modelsNew local compatibility modelRemaining gap
Generation mechanism Sequential or global sampling Selection from local compatible pool Internal structure of the pool still unclear
Exact repetition Too high or unstable Very close to real text Largely solved
Local similarity Either too weak or too forced Close to real behavior Slightly too diffuse
Vocabulary control Collapse or explosion Balanced regime Still somewhat smaller than real
Family persistence Too strong (chaining) Close to real low persistence Mostly solved
Number of active families Very low Moderate increase Still far below real text
Family distribution Highly concentrated Less dominated by a single family Entropy still too low
Within-family coherence Not well modeled Partially captured Families too loose internally
Overall conclusion Sequential rules insufficient Local selection works better Need better structure of overlapping families


This changes the behavior quite a lot. The model no longer gets stuck repeating the same forms, and exact repetition is now very close to the real text. It also avoids the opposite problem of exploding vocabulary. In that sense, the global balance is much better than in the earlier models.

More importantly, it reproduces the idea that tokens are chosen from a local set of similar options rather than derived step by step from a single source. This matches earlier observations that unordered local context works as well as, or better than, strictly sequential context.
However, new limitations appear. The model still uses fewer active families than the real text and distributes probability too unevenly across them. At the same time, tokens within a family are not as tightly related as in the manuscript, and local similarity is slightly too diffuse.

So, it is possible to reproduce repetition, local similarity and entropy at the same time, but only if generation is treated as selection from a constrained local space. What is still missing is how that space is structured internally, in particular how many overlapping groups remain active and how tightly they are defined.


RE: The structure of the Voynich text and how it may be generated - quimqu - 06-04-2026

From my point of view, these experiments would suggest two different things, and it is important to keep them separate.

On the one hand, some patterns seem relatively robust. The Voynich text would appear to combine very low exact repetition with clear local similarity. Words would tend to resemble nearby words, but not in a strictly sequential or chain-like way. In that sense, models based on global sampling or simple copying from the previous token would not be sufficient, while a model based on selecting from a small local set of compatible forms would seem to reproduce these properties more naturally.

On the other hand, I am aware that part of the structure we observe may come from the assumptions we impose. What counts as a “word” depends on the transliteration and the use of spaces. Similarity is measured here using edit distance, which is only one possible definition. The notion of “families” is also introduced by the model and may not correspond to real discrete units in the manuscript. Different choices in these aspects could lead to different results.

So the model would mainly be useful in a limited sense. It would help rule out simpler mechanisms and suggest that local constraints are important. But it would not justify stronger claims about the exact generative process or the existence of well-defined families. Some of that apparent structure could still be an artifact of how the text is represented and analyzed


RE: The structure of the Voynich text and how it may be generated - quimqu - 06-04-2026

I run some more tests and models today:

CategoryFindingWhat it suggests
Robust Local similarity is higher than in global and positional shuffles (across multiple similarity measures) There is a real local constraint structure, not an artefact of one metric
Robust Best match comes from the local window, not from the previous token (2.19 vs 3.89) The system behaves like a local pool, not a Markov chain
Robust Past vs future is symmetric (difference ≈ 0) No strong directional generation signal
Robust Within-line shuffle preserves most of the signal Order inside the line is less important than the set of nearby tokens
Representation-dependent Character windows (char4, char5) still show structure The effect is not purely due to spaces or tokenization
Representation-dependent Token bigrams destroy the signal Not every segmentation preserves the structure
Weak effects Middle of line slightly more similar and dense than start/end There is positional influence, but it is limited
Weak effects Candidate pool size is stable (~19) across positions No strong evidence that position changes the size of the allowed set
Model limitation Local model overshoots similarity (too low Lev, too high density) Simple local reuse is too strong and too compact

The text behaves like it is generated from a local set of compatible forms. What matters is the neighbourhood, not the exact order and not just the previous word. This is why bag-of-context works much better than sequential dependence.

At the same time, the system is not trivial. The signal survives different similarity definitions and even different segmentations, but not all of them. So it is not just noise or a simple reshuffling effect.

The main limitation of the current models is that they are too “tight”. They create clusters that are more coherent than in the real text. The Voynich seems to operate in a looser space, where many forms are allowed but still constrained.

So the useful takeaway is not that I have a good generator yet (?), but that the tests are narrowing the space of plausible mechanisms.