![]() |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
The structure of the Voynich text and how it may be generated - Printable Version +- The Voynich Ninja (https://www.voynich.ninja) +-- Forum: Voynich Research (https://www.voynich.ninja/forum-27.html) +--- Forum: Analysis of the text (https://www.voynich.ninja/forum-41.html) +--- Thread: The structure of the Voynich text and how it may be generated (/thread-5500.html) Pages:
1
2
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
The structure of the Voynich text and how it may be generated - quimqu - 01-04-2026 Most discussions about the Voynich already agree on a few basic points: the text is not random, similar words tend to appear near each other, and position within the line matters. What is less clear is how this structure is actually generated. I have been working in a pipeline that lets me analyze the structure of the MS. The goal of this analysis was not to describe patterns, but to test concrete generative hypotheses against the data. The question was always the same: if this were the real mechanism, would it reproduce what we observe? Running different models through the same pipeline makes it possible to discard entire classes of explanations, not just speculate about them.
The important point is not just that these models fail, but how they fail. Copy-and-modify mechanisms generate too much similarity, producing tight chains of derived forms that are not observed in the real text. Sequential models fail in the opposite direction, missing most of the structure entirely. The idea of a single dominant parent breaks down because the local neighborhood is too ambiguous: for most tokens, several nearby forms are equally plausible, with no clear winner. These are structural mismatches, not minor errors, and they rule out a large class of simple generative explanations. At the same time, some effects are very robust. Local similarity is real and strong: words share substrings and cluster in form space. Position within the line has a clear impact on length, prefixes, and suffixes. But these signals do not translate into a simple mechanism where one word determines the next. Token-level models struggle precisely because the system is not organized primarily as a chain of local decisions. The structure becomes clearer when moving to the level of the full line. If lines are represented as whole objects, using their internal properties (number of tokens, length distributions, entropy, positional patterns), they fall into a small number of latent types. These types are not imposed manually, but learned directly from the text. They correspond broadly to different functional roles, but also reveal variation within them. These data-driven line types also show persistence across consecutive lines, suggesting that the manuscript is organized as sequences of line-level states, not just as a stream of loosely connected tokens.
A useful way to think about it is the following:
This explains why local similarity is strong but does not translate into clear parent-child relationships, and why token-level models struggle while line-level structure is much more stable. With this analysis I try not to show that the Voynich is structured, which was already suspected, but to narrow down the class of mechanisms that can plausibly generate it. Simple sequential models and naive copy-and-modify processes do not fit. Models that operate at the level of line-level states, combined with a local compatibility field and weak selection within that field, are much more consistent with the data. RE: The structure of the Voynich text and how it may be generated - quimqu - 01-04-2026 I know that my posts are quite dense and "hard" to read. So I try to summarize here: What remains consistent with the data is not a chain of transformations, but a constrained selection process. The line defines a space of valid forms, the local context restricts this space further, and the final choice is made among multiple equally compatible candidates. This shifts the problem from “how is the next token generated” to “how is the space of valid tokens defined”. The one can think of concrete and testable (I will work on them) assumptions about the mechanism. Instead of trying to predict a single next token, the model should first approximate a candidate set, and only then model selection within that set.
With this, a generative model would not sample tokens directly from a sequence model, but proceed in two steps: first generate or retrieve a set of compatible forms given the line and its context, then sample from this set with relatively low discrimination. The key quantity to model is no longer the probability of a token given the previous one, but the probability that a token belongs to the current feasible set. This leads to a different type of validation. A good model is not one that predicts the exact next token, but one that consistently places the real token inside a small, well-defined candidate set. I hope to have explained myself better
RE: The Structure of the Voynich Text and How It May Be Generated - nablator - 01-04-2026 (01-04-2026, 12:16 PM)quimqu Wrote: You are not allowed to view links. Register or Login to view.Simple sequential models and naive copy-and-modify processes do not fit. I tried to propose a less naïve self-citation generation method including sparse initialization causing bottlenecks that would explain local inhomogeneities, lazy source words selection patterns, non-sequential writing, and generation rules optimized to fit the transliteration data. It seems that no one knew how to test the hypothesis. What do you think about it? You are not allowed to view links. Register or Login to view. Such a method, if it can be shown to replicate all known properties of Voynichese, does not exclude the possibility of a cipher such as You are not allowed to view links. Register or Login to view., but unnecessary complications can usually be discarded by Occam's razor when there is no evidence specifically pointing at them. RE: The Structure of the Voynich Text and How It May Be Generated - quimqu - 01-04-2026 (01-04-2026, 12:50 PM)nablator Wrote: You are not allowed to view links. Register or Login to view.I tried to propose a less naïve self-citation generation method including sparse initialization causing bottlenecks that would explain local inhomogeneities, lazy source words selection patterns, non-sequential writing, and generation rules optimized to fit the transliteration data. It seems that no one knew how to test the hypothesis. What do you think about it? You are not allowed to view links. Register or Login to view. I think your proposal is interesting, especially because it goes beyond naïve copy–modify and introduces more realistic selection patterns. The key issue, in my view, is whether it can reproduce two constraints at the same time: strong local similarity and weak determinism. In the data, many nearby forms are compatible, but there is usually no clear source or dominant parent. Most direct self-citation models tend to produce too much chaining and identifiable source–target links. I would suggest testing two versions of your idea: a strong one where tokens are explicitly derived from source words, and a weaker one where nearby words define a candidate set, but the final choice is not tied to a specific source. The question is whether the strong version can avoid producing detectable parent-child structure. If you agree, I will try to model them and put them in my pipeline. Will let you know! RE: The structure of the Voynich text and how it may be generated - nablator - 01-04-2026 (01-04-2026, 01:12 PM)quimqu Wrote: You are not allowed to view links. Register or Login to view.Most direct self-citation models tend to produce too much chaining and identifiable source–target links. I don't know what you mean by "identifiable". I believed that the frequency of some very unlikely sequential transfer patterns between two lines would almost certainly identify source-target links but then I was disappointed to find a similar frequency in Torsten Timm's generated_text file (see my last post in the other thread). I should try something else. For example the length of the Longest Common Subsequence (LCS) may be better suited than the edit distance to the detection of the sequential transfer pattern between lines, with similarity(line1, line2) = length of LCS(line1, line2) / average(length of line1, length of line2). Quote:If you agree, I will try to model them and put them in my pipeline. Will let you know! Sure. Thank you for taking the time to conduct an in-depth analysis! RE: The structure of the Voynich text and how it may be generated - quimqu - 01-04-2026 (01-04-2026, 03:32 PM)nablator Wrote: You are not allowed to view links. Register or Login to view.I don't know what you mean by "identifiable". I believed that the frequency of some very unlikely sequential transfer patterns between two lines would almost certainly identify source-target links but then I was disappointed to find a similar frequency in Torsten Timm's generated_text file (see my last post in the other thread). In a typical self-citation model, each token tends to have a dominant source, even if it is not always easy to detect. What I observe instead is that no single source dominates: multiple nearby candidates contribute weakly, and the structure appears distributed (cloud-like) rather than chain-based. RE: The structure of the Voynich text and how it may be generated - quimqu - 05-04-2026 I have been doing some testing with different models. The experiment explores how far simple generative rules can go in reproducing measurable properties of the Voynich text. Instead of focusing on a single metric, several constraints were evaluated together, including repetition rate, local similarity within a fixed window, entropy, hapax proportion, vocabulary size, and positional differences within lines. Several model families were tested. The simplest ones rely on global sampling, with or without small random variations. More structured versions introduce a local pool of tokens, dependence on the previous word, and eventually a bias depending on the position within the line. The most complex variant also allows a small amount of controlled novelty, generating slightly new forms rather than only reusing existing ones.
The results show that basic properties such as repetition and short-range similarity are relatively easy to reproduce. Many different models can be tuned to reach similar values. The situation changes when additional constraints are included. Differences between models become clearer, and none of them manages to satisfy all constraints at once. The models that perform best are those combining local reuse, dependence on the previous token, and some positional sensitivity. Adding a small amount of new word formation improves the fit slightly. However, even these models still deviate significantly in several key aspects, especially in hapax rate, vocabulary size, and how words behave at the beginning and end of lines. The main conclusion is limited but somehow clear: simple local mechanisms are sufficient to reproduce part of the observable structure, but they fail when multiple constraints are considered simultaneously. The difficulty lies not in matching individual statistics, but in matching all of them at once. RE: The structure of the Voynich text and how it may be generated - quimqu - 05-04-2026 Since the last experiment, I moved away from purely sequential models and tried a different approach based on local compatibility rather than direct copying. Instead of generating each token from the previous one, the model builds a local pool of compatible forms and selects from it, with a weak bias from recent context.
This changes the behavior quite a lot. The model no longer gets stuck repeating the same forms, and exact repetition is now very close to the real text. It also avoids the opposite problem of exploding vocabulary. In that sense, the global balance is much better than in the earlier models. More importantly, it reproduces the idea that tokens are chosen from a local set of similar options rather than derived step by step from a single source. This matches earlier observations that unordered local context works as well as, or better than, strictly sequential context. However, new limitations appear. The model still uses fewer active families than the real text and distributes probability too unevenly across them. At the same time, tokens within a family are not as tightly related as in the manuscript, and local similarity is slightly too diffuse. So, it is possible to reproduce repetition, local similarity and entropy at the same time, but only if generation is treated as selection from a constrained local space. What is still missing is how that space is structured internally, in particular how many overlapping groups remain active and how tightly they are defined. RE: The structure of the Voynich text and how it may be generated - quimqu - 06-04-2026 From my point of view, these experiments would suggest two different things, and it is important to keep them separate. On the one hand, some patterns seem relatively robust. The Voynich text would appear to combine very low exact repetition with clear local similarity. Words would tend to resemble nearby words, but not in a strictly sequential or chain-like way. In that sense, models based on global sampling or simple copying from the previous token would not be sufficient, while a model based on selecting from a small local set of compatible forms would seem to reproduce these properties more naturally. On the other hand, I am aware that part of the structure we observe may come from the assumptions we impose. What counts as a “word” depends on the transliteration and the use of spaces. Similarity is measured here using edit distance, which is only one possible definition. The notion of “families” is also introduced by the model and may not correspond to real discrete units in the manuscript. Different choices in these aspects could lead to different results. So the model would mainly be useful in a limited sense. It would help rule out simpler mechanisms and suggest that local constraints are important. But it would not justify stronger claims about the exact generative process or the existence of well-defined families. Some of that apparent structure could still be an artifact of how the text is represented and analyzed RE: The structure of the Voynich text and how it may be generated - quimqu - 06-04-2026 I run some more tests and models today:
The text behaves like it is generated from a local set of compatible forms. What matters is the neighbourhood, not the exact order and not just the previous word. This is why bag-of-context works much better than sequential dependence. At the same time, the system is not trivial. The signal survives different similarity definitions and even different segmentations, but not all of them. So it is not just noise or a simple reshuffling effect. The main limitation of the current models is that they are too “tight”. They create clusters that are more coherent than in the real text. The Voynich seems to operate in a looser space, where many forms are allowed but still constrained. So the useful takeaway is not that I have a good generator yet (?), but that the tests are narrowing the space of plausible mechanisms. |