The structure of the Voynich text and how it may be generated

The structure of the Voynich text and how it may be generated - Printable Version

+- The Voynich Ninja (https://www.voynich.ninja)
+-- Forum: Voynich Research (https://www.voynich.ninja/forum-27.html)
+--- Forum: Analysis of the text (https://www.voynich.ninja/forum-41.html)
+--- Thread: The structure of the Voynich text and how it may be generated (/thread-5500.html)

Pages: 1 2 3 4 5 6

The structure of the Voynich text and how it may be generated - quimqu - 01-04-2026

Most discussions about the Voynich already agree on a few basic points: the text is not random, similar words tend to appear near each other, and position within the line matters. What is less clear is how this structure is actually generated.

I have been working in a pipeline that lets me analyze the structure of the MS. The goal of this analysis was not to describe patterns, but to test concrete generative hypotheses against the data. The question was always the same: if this were the real mechanism, would it reproduce what we observe? Running different models through the same pipeline makes it possible to discard entire classes of explanations, not just speculate about them.

Hypothesis	Expected behavior	Observed failure
Random / weak structure	No stable local similarity or positional effects	Strong clustering and positional patterns persist
Sequential (Markov-like)	Next token predictable from previous ones	Bigram/HMM models add little or collapse
Copy–modify (parent-based)	Clear local derivations, strong nearest neighbor	Generative models produce too much similarity
Single dominant parent	One best local candidate per token	Multiple candidates with similar scores, no clear winner

The important point is not just that these models fail, but how they fail. Copy-and-modify mechanisms generate too much similarity, producing tight chains of derived forms that are not observed in the real text. Sequential models fail in the opposite direction, missing most of the structure entirely. The idea of a single dominant parent breaks down because the local neighborhood is too ambiguous: for most tokens, several nearby forms are equally plausible, with no clear winner. These are structural mismatches, not minor errors, and they rule out a large class of simple generative explanations.

At the same time, some effects are very robust. Local similarity is real and strong: words share substrings and cluster in form space. Position within the line has a clear impact on length, prefixes, and suffixes. But these signals do not translate into a simple mechanism where one word determines the next. Token-level models struggle precisely because the system is not organized primarily as a chain of local decisions.

The structure becomes clearer when moving to the level of the full line. If lines are represented as whole objects, using their internal properties (number of tokens, length distributions, entropy, positional patterns), they fall into a small number of latent types. These types are not imposed manually, but learned directly from the text. They correspond broadly to different functional roles, but also reveal variation within them. These data-driven line types also show persistence across consecutive lines, suggesting that the manuscript is organized as sequences of line-level states, not just as a stream of loosely connected tokens.

The text is not governed by simple sequential rules. Token-to-token models fail to capture the structure, even when extended beyond basic Markov assumptions.
It is not generated by copy-and–modify or parent-based derivation. These mechanisms overproduce similarity and impose chains that are not present in the data.
There is no single dominant local source for most tokens. The local neighborhood is too ambiguous, with multiple equally plausible candidates.
The strongest and most stable structure appears at the level of the full line. Lines form a small number of latent types with distinct formal profiles and non-random sequencing.

A useful way to think about it is the following:

The line type defines a space of possible forms.
The local context restricts this space further by favoring forms that are compatible with nearby words.
But within that constrained space, the final choice is weakly determined. Many candidates are acceptable, and no single one is strongly preferred.

This explains why local similarity is strong but does not translate into clear parent-child relationships, and why token-level models struggle while line-level structure is much more stable.

With this analysis I try not to show that the Voynich is structured, which was already suspected, but to narrow down the class of mechanisms that can plausibly generate it. Simple sequential models and naive copy-and-modify processes do not fit. Models that operate at the level of line-level states, combined with a local compatibility field and weak selection within that field, are much more consistent with the data.

RE: The structure of the Voynich text and how it may be generated - quimqu - 01-04-2026

I know that my posts are quite dense and "hard" to read. So I try to summarize here:

What remains consistent with the data is not a chain of transformations, but a constrained selection process. The line defines a space of valid forms, the local context restricts this space further, and the final choice is made among multiple equally compatible candidates. This shifts the problem from “how is the next token generated” to “how is the space of valid tokens defined”.

The one can think of concrete and testable (I will work on them) assumptions about the mechanism. Instead of trying to predict a single next token, the model should first approximate a candidate set, and only then model selection within that set.

Assumption	Operational test
Candidate set constrained by line position	Predict top-k tokens from line features and measure recall
Local context restricts form, not identity	Build candidates by similarity and check if real token is included
Selection within the set is weak	Train ranking model and measure score flatness
Substructures drive compatibility	Use prefixes/suffixes only and test predictive power

With this, a generative model would not sample tokens directly from a sequence model, but proceed in two steps: first generate or retrieve a set of compatible forms given the line and its context, then sample from this set with relatively low discrimination. The key quantity to model is no longer the probability of a token given the previous one, but the probability that a token belongs to the current feasible set.

This leads to a different type of validation. A good model is not one that predicts the exact next token, but one that consistently places the real token inside a small, well-defined candidate set.

I hope to have explained myself better Rolleyes

RE: The Structure of the Voynich Text and How It May Be Generated - nablator - 01-04-2026

(01-04-2026, 12:16 PM)quimqu Wrote: You are not allowed to view links. Register or Login to view.Simple sequential models and naive copy-and-modify processes do not fit.

I tried to propose a less naïve self-citation generation method including sparse initialization causing bottlenecks that would explain local inhomogeneities, lazy source words selection patterns, non-sequential writing, and generation rules optimized to fit the transliteration data. It seems that no one knew how to test the hypothesis. What do you think about it? You are not allowed to view links. Register or Login to view.

Such a method, if it can be shown to replicate all known properties of Voynichese, does not exclude the possibility of a cipher such as You are not allowed to view links. Register or Login to view., but unnecessary complications can usually be discarded by Occam's razor when there is no evidence specifically pointing at them.

RE: The Structure of the Voynich Text and How It May Be Generated - quimqu - 01-04-2026

(01-04-2026, 12:50 PM)nablator Wrote: You are not allowed to view links. Register or Login to view.I tried to propose a less naïve self-citation generation method including sparse initialization causing bottlenecks that would explain local inhomogeneities, lazy source words selection patterns, non-sequential writing, and generation rules optimized to fit the transliteration data. It seems that no one knew how to test the hypothesis. What do you think about it? You are not allowed to view links. Register or Login to view.

Such a method, if it can be shown to replicate all known properties of Voynichese, does not exclude the possibility of a cipher such as You are not allowed to view links. Register or Login to view., but unnecessary complications can usually be discarded by Occam's razor when there is no evidence specifically pointing at them.

I think your proposal is interesting, especially because it goes beyond naïve copy–modify and introduces more realistic selection patterns.

The key issue, in my view, is whether it can reproduce two constraints at the same time: strong local similarity and weak determinism. In the data, many nearby forms are compatible, but there is usually no clear source or dominant parent. Most direct self-citation models tend to produce too much chaining and identifiable source–target links.

I would suggest testing two versions of your idea: a strong one where tokens are explicitly derived from source words, and a weaker one where nearby words define a candidate set, but the final choice is not tied to a specific source. The question is whether the strong version can avoid producing detectable parent-child structure.

If you agree, I will try to model them and put them in my pipeline. Will let you know!

RE: The structure of the Voynich text and how it may be generated - nablator - 01-04-2026

(01-04-2026, 01:12 PM)quimqu Wrote: You are not allowed to view links. Register or Login to view.Most direct self-citation models tend to produce too much chaining and identifiable source–target links.

I don't know what you mean by "identifiable". I believed that the frequency of some very unlikely sequential transfer patterns between two lines would almost certainly identify source-target links but then I was disappointed to find a similar frequency in Torsten Timm's generated_text file (see my last post in the other thread).

I should try something else. For example the length of the Longest Common Subsequence (LCS) may be better suited than the edit distance to the detection of the sequential transfer pattern between lines, with similarity(line1, line2) = length of LCS(line1, line2) / average(length of line1, length of line2).

Quote:If you agree, I will try to model them and put them in my pipeline. Will let you know!

Sure. Thank you for taking the time to conduct an in-depth analysis!

RE: The structure of the Voynich text and how it may be generated - quimqu - 01-04-2026

(01-04-2026, 03:32 PM)nablator Wrote: You are not allowed to view links. Register or Login to view.I don't know what you mean by "identifiable". I believed that the frequency of some very unlikely sequential transfer patterns between two lines would almost certainly identify source-target links but then I was disappointed to find a similar frequency in Torsten Timm's generated_text file (see my last post in the other thread).

In a typical self-citation model, each token tends to have a dominant source, even if it is not always easy to detect.

What I observe instead is that no single source dominates: multiple nearby candidates contribute weakly, and the structure appears distributed (cloud-like) rather than chain-based.

RE: The structure of the Voynich text and how it may be generated - quimqu - 05-04-2026

I have been doing some testing with different models.

The experiment explores how far simple generative rules can go in reproducing measurable properties of the Voynich text. Instead of focusing on a single metric, several constraints were evaluated together, including repetition rate, local similarity within a fixed window, entropy, hapax proportion, vocabulary size, and positional differences within lines.

Several model families were tested. The simplest ones rely on global sampling, with or without small random variations. More structured versions introduce a local pool of tokens, dependence on the previous word, and eventually a bias depending on the position within the line. The most complex variant also allows a small amount of controlled novelty, generating slightly new forms rather than only reusing existing ones.

Model	Core idea	What it gets roughly right	Main failures
base	Samples tokens from global vocabulary with weak repetition penalty	Basic entropy levels, approximate repetition	Too little structure, wrong similarity, poor hapax and vocabulary control, no positional behavior
base_variant	Global sampling + occasional independent small edits	Repetition and some local similarity	Variation not tied to context, wrong hapax rate, no positional asymmetry, weak vocabulary structure
prev_morph	New token often derived from previous token (local edit process)	Short-range similarity and chaining effects	Over-relies on sequential edits, fails on repetition balance, hapax and vocabulary distribution
positional_prev_morph	Local pool + dependence on previous token + bias by position in line	Repetition, similarity, partial positional effects	Still poor hapax rate, vocabulary size, and incorrect start/end of line behavior
positional_prev_morph_new	Same as above + small probability of generating new forms (controlled novelty)	Best overall balance across repetition, similarity and entropy	Large errors remain in hapax rate, vocabulary, word-length asymmetry and positional distributions

The results show that basic properties such as repetition and short-range similarity are relatively easy to reproduce. Many different models can be tuned to reach similar values. The situation changes when additional constraints are included. Differences between models become clearer, and none of them manages to satisfy all constraints at once.

The models that perform best are those combining local reuse, dependence on the previous token, and some positional sensitivity. Adding a small amount of new word formation improves the fit slightly. However, even these models still deviate significantly in several key aspects, especially in hapax rate, vocabulary size, and how words behave at the beginning and end of lines.

The main conclusion is limited but somehow clear: simple local mechanisms are sufficient to reproduce part of the observable structure, but they fail when multiple constraints are considered simultaneously. The difficulty lies not in matching individual statistics, but in matching all of them at once.

RE: The structure of the Voynich text and how it may be generated - quimqu - 05-04-2026

Since the last experiment, I moved away from purely sequential models and tried a different approach based on local compatibility rather than direct copying. Instead of generating each token from the previous one, the model builds a local pool of compatible forms and selects from it, with a weak bias from recent context.

Aspect	Earlier models	New local compatibility model	Remaining gap
Generation mechanism	Sequential or global sampling	Selection from local compatible pool	Internal structure of the pool still unclear
Exact repetition	Too high or unstable	Very close to real text	Largely solved
Local similarity	Either too weak or too forced	Close to real behavior	Slightly too diffuse
Vocabulary control	Collapse or explosion	Balanced regime	Still somewhat smaller than real
Family persistence	Too strong (chaining)	Close to real low persistence	Mostly solved
Number of active families	Very low	Moderate increase	Still far below real text
Family distribution	Highly concentrated	Less dominated by a single family	Entropy still too low
Within-family coherence	Not well modeled	Partially captured	Families too loose internally
Overall conclusion	Sequential rules insufficient	Local selection works better	Need better structure of overlapping families

This changes the behavior quite a lot. The model no longer gets stuck repeating the same forms, and exact repetition is now very close to the real text. It also avoids the opposite problem of exploding vocabulary. In that sense, the global balance is much better than in the earlier models.

More importantly, it reproduces the idea that tokens are chosen from a local set of similar options rather than derived step by step from a single source. This matches earlier observations that unordered local context works as well as, or better than, strictly sequential context.
However, new limitations appear. The model still uses fewer active families than the real text and distributes probability too unevenly across them. At the same time, tokens within a family are not as tightly related as in the manuscript, and local similarity is slightly too diffuse.

So, it is possible to reproduce repetition, local similarity and entropy at the same time, but only if generation is treated as selection from a constrained local space. What is still missing is how that space is structured internally, in particular how many overlapping groups remain active and how tightly they are defined.

RE: The structure of the Voynich text and how it may be generated - quimqu - 06-04-2026

From my point of view, these experiments would suggest two different things, and it is important to keep them separate.

On the one hand, some patterns seem relatively robust. The Voynich text would appear to combine very low exact repetition with clear local similarity. Words would tend to resemble nearby words, but not in a strictly sequential or chain-like way. In that sense, models based on global sampling or simple copying from the previous token would not be sufficient, while a model based on selecting from a small local set of compatible forms would seem to reproduce these properties more naturally.

On the other hand, I am aware that part of the structure we observe may come from the assumptions we impose. What counts as a “word” depends on the transliteration and the use of spaces. Similarity is measured here using edit distance, which is only one possible definition. The notion of “families” is also introduced by the model and may not correspond to real discrete units in the manuscript. Different choices in these aspects could lead to different results.

So the model would mainly be useful in a limited sense. It would help rule out simpler mechanisms and suggest that local constraints are important. But it would not justify stronger claims about the exact generative process or the existence of well-defined families. Some of that apparent structure could still be an artifact of how the text is represented and analyzed

RE: The structure of the Voynich text and how it may be generated - quimqu - 06-04-2026

I run some more tests and models today:

Category	Finding	What it suggests
Robust	Local similarity is higher than in global and positional shuffles (across multiple similarity measures)	There is a real local constraint structure, not an artefact of one metric
Robust	Best match comes from the local window, not from the previous token (2.19 vs 3.89)	The system behaves like a local pool, not a Markov chain
Robust	Past vs future is symmetric (difference ≈ 0)	No strong directional generation signal
Robust	Within-line shuffle preserves most of the signal	Order inside the line is less important than the set of nearby tokens
Representation-dependent	Character windows (char4, char5) still show structure	The effect is not purely due to spaces or tokenization
Representation-dependent	Token bigrams destroy the signal	Not every segmentation preserves the structure
Weak effects	Middle of line slightly more similar and dense than start/end	There is positional influence, but it is limited
Weak effects	Candidate pool size is stable (~19) across positions	No strong evidence that position changes the size of the allowed set
Model limitation	Local model overshoots similarity (too low Lev, too high density)	Simple local reuse is too strong and too compact

The text behaves like it is generated from a local set of compatible forms. What matters is the neighbourhood, not the exact order and not just the previous word. This is why bag-of-context works much better than sequential dependence.

At the same time, the system is not trivial. The signal survives different similarity definitions and even different segmentations, but not all of them. So it is not just noise or a simple reshuffling effect.

The main limitation of the current models is that they are too “tight”. They create clusters that are more coherent than in the real text. The Voynich seems to operate in a looser space, where many forms are allowed but still constrained.

So the useful takeaway is not that I have a good generator yet (?), but that the tests are narrowing the space of plausible mechanisms.