The Voynich Ninja

Full Version: About the generation of similar words
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Pages: 1 2 3 4 5 6 7 8
(22-03-2026, 08:25 AM)dashstofsk Wrote: You are not allowed to view links. Register or Login to view.The GC transliteration [...]

( Why is it that people don't seem to be making more use of this transliteration? )

I am pretty sure this is due to the alphabet, which has numerous other disadvantages, first of all (but not just) being hard to remember. 

The transliteration itself is valuable, as it is nearly complete and very consistent.
(22-03-2026, 08:25 AM)dashstofsk Wrote: You are not allowed to view links. Register or Login to view.( Why is it that people don't seem to be making more use of the [GC] transliteration? )

As for me: I believe that all proposed 'Voynichese alphabets' are 'wrong' (including You are not allowed to view links. Register or Login to view.), so in my analyses I try to use formulas and algorithms that do not depend on the alphabet. 

All the best, --stolfi
I’ve been trying to answer a simple question: are Voynich bursts just the same thing we see in normal language, or is something structurally different going on inside them?

First thing. Bursts are not unique to the Voynich. If you run the same detection on normal texts, you also get bursts, and they also show some local structure. So just saying “the Voynich has bursts” is not enough.

The difference shows up when you remove short words and keep only tokens of length ≥ 3. In normal texts, the burst signal drops a lot. In the Voynich, it barely changes. That already tells you something important: in natural language, a big part of the burst effect comes from short, frequent words. In the Voynich, the structure sits in the main vocabulary itself.

Then I looked inside the bursts.

In the Voynich, the real order of tokens does three things compared to a shuffled version of the same bursts. It creates more local parent-like matches, those matches are closer in sequence, and the words are slightly more similar to each other. In natural texts, you only see the first two effects, not the third. The real order does not make words more similar than shuffle.

That’s a key difference. Voynich bursts are not just clusters of similar words. They behave more like a local walk through nearby variants.
I also checked directionality. If this was a simple chain where each word generates the next one, you would expect the best match to be more often in the past than in the future. That does not happen. Not in the Voynich, not in normal texts. So this is not a simple left-to-right generation process. It looks more like a dense local network.

The most interesting signal comes from looking at triples A → B → C inside bursts.
In the Voynich, you more often see small step + small step = bigger step patterns, and you also see a stronger tendency for length to grow or shrink consistently across two steps. In natural texts, what you mostly see are compact triangular families, which is exactly what you expect from normal morphology.

So the picture that emerges is quite different.

CorpusBurst share (len ≥ 3)Key observation
Voynich0.6877Most of the text is inside bursts, even without short words
Culpepper English0.2707Bursts exist but depend more on common short words
De docta ignorantia Latin0.1613Weaker burst structure overall
Alchemical Herbal Latin0.0675Very limited burst behavior

CorpusReal vs shuffle (similarity effect)Interpretation
VoynichReal order makes words more similarLocal sequence pushes toward nearby variants
Natural textsNo similarity gain vs shuffleClustering exists, but not sequential transformation

CorpusRadial chains (A→B→C)Compact familiesLength monotonicity
VoynichHigherLowerMuch higher
Natural textsLowerHigherLow

So I wouldn’t say “bursts prove the Voynich is special”, because they don’t. But once you control for short words and look at what happens inside bursts, the behavior is clearly different.

Natural texts look like clusters of related forms. The Voynich looks more like something moving step by step through a space of very similar variants.

That doesn’t prove a specific mechanism yet, but it does narrow things down quite a lot.
What I wanted to know next was whether Voynich bursts behave like ordinary local repetition, or whether they are better understood as a system of local reuse with variation.

The Voynich still stands out immediately at the structural level. Nearly 69% of its tokens of length ≥ 3 fall inside bursts, compared with 27% in Culpepper, 16% in De docta ignorantia, and only 7% in the Alchemical Herbal text. Its bursts are also much larger, with a mean size of 15.6 tokens and a maximum of 162. In the control texts, bursts are much smaller.

But the most important result is not just that Voynich bursts are bigger. It is how they behave internally.

If you look at prediction inside bursts, the local neighborhood matters a lot, but the past does not beat the future. In other words, the current token is strongly constrained by nearby tokens, yet there is no clear left-to-right generation signal. The best predictor is not the previous context alone, but the unordered local pool. That pushes against a simple chain model where one token directly produces the next one.

The memory tests point in the same direction. Exact reuse in the Voynich is relatively low, but reuse by similarity is very high. Within the previous 10 tokens, exact reuse is only about 0.12, yet reuse through Levenshtein ≤ 2 rises to about 0.84, and reuse at the family level to about 0.64. So the system is not mainly repeating the same word. It is reusing the same local variant space.

That is where the contrast with normal texts becomes interesting. Culpepper and De docta ignorantia show more exact reuse than the Voynich, but they do not show the same combination of very large bursts, long paths, and extremely strong variant-based reuse. The Voynich seems less repetitive in the literal sense, but more repetitive in terms of families and near-neighbors.

Family clustering also helps. Once tokens are grouped into tight local families, the Voynich does not collapse into noise. Instead, it reveals a handful of large lexical centers such as chedy, daiin, qokeey, chol, and okal, each with many nearby variants. That suggests that a substantial part of the text may be generated not from isolated word forms, but from active local families that remain available for reuse over long spans.

So the emerging picture is this. Natural texts certainly have bursts, and they also have local similarity. But in those texts, much of the effect is tied to ordinary lexical repetition and familiar morphological clustering. In the Voynich, the effect is much more dominated by persistent local neighborhoods of similar forms. It looks less like ordinary repetition, and more like controlled movement inside a dense local variant pool.

CorpusBurst share (len ≥ 3)Mean burst sizeMax burst size
Voynich0.687715.60162
Culpepper English0.27076.4173
De docta ignorantia Latin0.16134.1926
Alchemical Herbal Latin0.06753.679

CorpusMean path lengthLong paths ≥ 5Long paths ≥ 8
Voynich5.340.44260.2096
Culpepper English4.050.28770.0877
De docta ignorantia Latin3.000.10030.0104
Alchemical Herbal Latin2.670.02800.0000

CorpusExact reuse, last 10Reuse by Lev ≤ 1, last 10Reuse by Lev ≤ 2, last 10Family reuse, last 10
Voynich0.11710.43130.83580.6433
Culpepper English0.42220.49430.79700.7287
De docta ignorantia Latin0.34080.41040.72340.5962
Alchemical Herbal Latin0.18180.23750.68040.5249

CorpusBag predictor exactPast predictor exactFuture predictor exactInterpretation
Voynich0.19980.11710.1171Strong local constraint, but no temporal directionality
Culpepper English0.58960.42220.4222Local pool matters more than ordered sequence
De docta ignorantia Latin0.54480.34080.3408Same pattern, weaker local density than Voynich
Alchemical Herbal Latin0.32260.18180.1818Small local pools, limited burst structure

So after this second pass, I would say that the Voynich does not look like a text that simply repeats words, and it does not look like a straightforward left-to-right rewrite chain either. It looks more like a system that keeps a dense local pool of related forms active and keeps takining from it with small variations. That still does not tell us the exact mechanism. The core behavior seems to be local reuse with controlled variation, not ordinary repetition and not simple sequential derivation.
I need to rethink a bit about what I just posted... sorry for the inconvenience.
(31-03-2026, 10:16 AM)quimqu Wrote: You are not allowed to view links. Register or Login to view.I need to rethink a bit about what I just posted... sorry for the inconvenience.

Yes, too much for a single post. Smile

You might want to explain what you mean exactly by form, shape, cluster...
(31-03-2026, 10:28 AM)nablator Wrote: You are not allowed to view links. Register or Login to view.
(31-03-2026, 10:16 AM)quimqu Wrote: You are not allowed to view links. Register or Login to view.I need to rethink a bit about what I just posted... sorry for the inconvenience.

Yes, too much for a single post. Smile

You might want to explain what you mean exactly by form, shape, cluster...

The problem is that I think I started with a wrong assumption. The bursts is something that I detect but might not be real or the generation key...
The results I shared earlier were not reliable, so I removed them (sorry) and restarted the analysis from scratch with a stricter and more agnostic approach.

The main issue was methodological. I was implicitly assuming that similar tokens have a "parent" in the recent past, and then analyzing the data under that assumption. That can easily create patterns that look real but are actually imposed by the pipeline.

This time I avoided that completely. No parents, no bursts, no direction assumed. What I did instead was simple:
  • For each token, I measured how many similar tokens (Levenshtein ≤1 and ≤2) appear nearby.
  • I compared past vs future symmetrically.
  • I checked whether similarity is concentrated around one specific neighbor or spread across many.
  • I validated everything against multiple null models.

Two things are clearly true:
  • There is strong local clustering of similar forms. Tokens are not placed randomly. They tend to appear in regions where many similar tokens exist.
  • This clustering is symmetric in time. Similar tokens are just as likely to appear after a token as before it.

I then looked at how that local similarity is structured.

[attachment=14969]

This plot shows how much of the local similarity is explained by the single best match. If a “parent-like” mechanism dominated, we would expect values close to 1. Instead, most of the mass is well below that. The best individual match typically explains only a small fraction of the local similarity. In other words, tokens do not seem to be anchored to a single dominant neighbor.

[attachment=14968]

This makes it even clearer. As the local field becomes denser (more similar tokens nearby), the importance of the best individual match systematically decreases. This is a very strong pattern:
  • when there is only one similar token, it trivially dominates,
  • but as soon as multiple similar tokens appear, the structure becomes distributed,
  • the system behaves more like a “cloud” than a chain.
First conclusion: The data does not support a simple "copy from a previous token and modify it" mechanism. This is solid because:
  • it does not rely on defining parents or bursts,
  • it holds across different similarity thresholds and window sizes,
  • it survives null model comparisons,
  • and it is confirmed by multiple independent signals: symmetry in time, absence of directional bias, and lack of a dominant local match.

What this does NOT rule out? The absence of a simple directional signal does not mean there is no structure or no generation process. The data is still compatible with:
  • reversible or cyclic transformations,
  • selection within a local repertoire of similar forms,
  • systems with latent states activating families of tokens,
  • or other non-linear, non-directional mechanisms.

So, the structure looks more like a local space of related forms than a linear chain of derivations.

I also compared two explicit models on held-out pages: a local cloud model, where a token is supported by the density of similar forms around it, and a copy-modify model, where it is supported by specific similar tokens in the past. Both outperform a unigram baseline, but the local cloud model wins consistently across the entire parameter grid. This does not prove that the text is generated non-directionally, but it does show that local distributed similarity explains the data better than a simple retrospective copy-modify mechanism.

I will keep posting the next steps as I go, focusing on what is actually supported by the data and separating it from hypotheses. Sorry for my previous deleted post.
(31-03-2026, 11:51 AM)quimqu Wrote: You are not allowed to view links. Register or Login to view.So, the structure looks more like a local space of related forms than a linear chain of derivations.

Your observation that "tokens do not seem to be anchored to a single dominant neighbor" and that "the system behaves more like a cloud than a chain" matches the findings in my 2016 paper "Co-Occurrence Patterns in the Voynich Manuscript": You are not allowed to view links. Register or Login to view.

The paper contains two-dimensional heat maps showing the probability that a word at position {n,m} is identical or similar (edit distance 1 and 2) to words at every surrounding position. The pattern is symmetric — similarity increases toward the current position from all directions, not just from the left. The strongest similarity is at {n-1,m} (same position, previous line) and {n,m-1} (previous position, same line), but the entire neighborhood is elevated.
I extended the analysis a bit further, still keeping the same constraints: no parents, no direction, no predefined structure.

The question I wanted to answer next was simple:if tokens are locally similar, do they also behave similarly in context?

To test this, I stopped grouping tokens into “families” (that was too unstable) and instead looked only at direct pairs of similar tokens (Levenshtein = 1). Then I compared their contextual profiles. The result is quite strong: tokens that are formally similar have much more similar contexts than expected. Not slightly more. A lot more.

GroupMean cosineMedianQ25Q75
Formal pairs (lev = 1)0.69170.70030.59750.7716
Matched random controls0.46640.46880.34890.5577

The average similarity between real similar pairs is around 0.69, while matched random pairs (same length and frequency constraints) are around 0.46. And this survives a permutation test very easily.

So this adds something important to the previous post: local similarity is not just geometric. It is also functional. Form and context are clearly coupled.

At the same time, the “cloud vs chain” picture still holds. Even though similar tokens share context, that similarity is not organized around a single dominant neighbor. It is still distributed across the local neighborhood.

I also revisited the sequential side more carefully (using interpolated bigrams and proper train/test splits). There is a small sequential signal, but it is weak, and it is not stronger than what you get after shuffling within pages. So sequence alone does not seem to explain much of the structure.

I also tried a simple HMM to see if a few latent states could explain the patterns. That did not work well at all. The model collapses and does not generalize to test data (it assigns zero probability in several cases), and the inferred states are not clean or interpretable. So at least in this form, a small-state discrete model does not seem to capture the structure.

Summarizing, we could say:
  • There is strong local clustering of similar forms.
  • This clustering is symmetric in time (reading direction).
  • Similar forms share similar contexts (with a strong signal).
  • The structure is not dominated by nearest-neighbor copying.
  • Simple sequential models are weak.
  • Simple latent-state models fail.
So the picture that emerges is still the same, but now more constrained: it really looks like a local space of related forms, where tokens are selected from a structured neighborhood, rather than generated through a linear derivation process.

This is also where your comment fits in well, @Torsten. Your heatmaps already showed that similarity increases symmetrically around a token, not just from the left. What I think these additional results add is that this symmetric field is not only spatial, but also contextual, and that it does not reduce to a single dominant predecessor even locally.

The open question, at least for me, is now this: what kind of process generates something that behaves like a structured local cloud of related forms, rather than a sequence where each token clearly comes from the previous one?
Pages: 1 2 3 4 5 6 7 8