Most discussions about the Voynich already agree on a few basic points: the text is not random, similar words tend to appear near each other, and position within the line matters. What is less clear is
how this structure is actually generated.
I have been working in a pipeline that lets me analyze the structure of the MS. The goal of this analysis was not to describe patterns, but to test concrete generative hypotheses against the data. The question was always the same:
if this were the real mechanism, would it reproduce what we observe? Running different models through the same pipeline makes it possible to discard entire classes of explanations, not just speculate about them.
| Hypothesis | Expected behavior | Observed failure |
| Random / weak structure | No stable local similarity or positional effects | Strong clustering and positional patterns persist |
| Sequential (Markov-like) | Next token predictable from previous ones | Bigram/HMM models add little or collapse |
| Copy–modify (parent-based) | Clear local derivations, strong nearest neighbor | Generative models produce too much similarity |
| Single dominant parent | One best local candidate per token | Multiple candidates with similar scores, no clear winner |
The important point is not just that these models fail, but
how they fail. Copy-and-modify mechanisms generate too much similarity, producing tight chains of derived forms that are not observed in the real text. Sequential models fail in the opposite direction, missing most of the structure entirely. The idea of a single dominant parent breaks down because the local neighborhood is too ambiguous: for most tokens, several nearby forms are equally plausible, with no clear winner. These are structural mismatches, not minor errors, and they rule out a large class of simple generative explanations.
At the same time, some effects are very robust. Local similarity is real and strong: words share substrings and cluster in form space. Position within the line has a clear impact on length, prefixes, and suffixes. But these signals do not translate into a simple mechanism where one word determines the next. Token-level models struggle precisely because the system is not organized primarily as a chain of local decisions.
The structure becomes clearer when moving to the level of the full line. If lines are represented as whole objects, using their internal properties (number of tokens, length distributions, entropy, positional patterns), they fall into a small number of latent types. These types are not imposed manually, but learned directly from the text. They correspond broadly to different functional roles, but also reveal variation within them. These data-driven line types also show persistence across consecutive lines, suggesting that the manuscript is organized as sequences of line-level states, not just as a stream of loosely connected tokens.
- The text is not governed by simple sequential rules. Token-to-token models fail to capture the structure, even when extended beyond basic Markov assumptions.
- It is not generated by copy-and–modify or parent-based derivation. These mechanisms overproduce similarity and impose chains that are not present in the data.
- There is no single dominant local source for most tokens. The local neighborhood is too ambiguous, with multiple equally plausible candidates.
- The strongest and most stable structure appears at the level of the full line. Lines form a small number of latent types with distinct formal profiles and non-random sequencing.
A useful way to think about it is the following:
- The line type defines a space of possible forms.
- The local context restricts this space further by favoring forms that are compatible with nearby words.
- But within that constrained space, the final choice is weakly determined. Many candidates are acceptable, and no single one is strongly preferred.
This explains why local similarity is strong but does not translate into clear parent-child relationships, and why token-level models struggle while line-level structure is much more stable.
With this analysis I try not to show that the Voynich is structured, which was already suspected, but to narrow down the class of mechanisms that can plausibly generate it. Simple sequential models and naive copy-and-modify processes do not fit. Models that operate at the level of line-level states, combined with a local compatibility field and weak selection within that field, are much more consistent with the data.