The Voynich Ninja

Pages: 1 2 3 4 5 6

I fully share the notion that parsing that relies on word spaces may be wrong.

This is a very complicated topic. There are arguments that would suggest that word spaces are significant, for example the fact that label words also appear as stand-alone words in the text. This argument (and others) may not be valid. They may have several alternative explanations.

More importantly, even if word spaces are largely significant, I am not at all convinced that the bits that are contained between the spaces are actually words. Even in the scenario that there is a meaningful text.

@nablator, @ReneZ,

I think the concern about spaces is completely valid. If the spacing is unreliable, then anything based on "words" becomes questionable. So, I tried to remove spaces entirely and see what happens.

The test is simply this: I take only paragraph lines, strip all spaces, and learn what line endings look like and what line beginnings look like, but purely at the character level. For this, I used a simple character-level n-gram model trained on line endings and line beginnings, and scored each possible cut by how well its surrounding characters match those distributions. Then I join many lines into one long continuous string and ask: can we still detect where the real line breaks were? The answer is yes, very clearly.

Here are the main numbers:

Dataset	Mean real score	Mean nearby	Mean random	Diff vs nearby	AUC
Same sections (Herbal/Biological)	2.62	-3.07	-3.24	+5.69	~0.93
Other sections	2.95	-2.71	-3.32	+5.65	~0.92

Real line breaks score about 5–6 points higher than nearby non-break positions, and the separation is strong (AUC around 0.9+). Also, in about 99% of cases, the real cut looks more like a line break than the surrounding positions.

So even after removing all spaces, the signal is still there, and it is strong.

This matters for the point you both raise. Maybe spaces are not fully reliable. Maybe what we call "words" are not real words. Maybe something like daram should indeed be split as dar am in some cases. But whatever the correct segmentation is, the line structure does not disappear when spaces are removed.

What the model is actually picking up are not “words” but character patterns. Certain endings keep appearing at line ends, like …dy, …dain, …dal, …dar, …ol. And certain patterns prefer the start of a line, like sol-, sor-, qot-, qok-, dch-, dsh-. In real examples you see things like:

…daroly | salche
…hdyoly | dcheyl
…eyqoly | dsheyq

These are not clean lexical units. They are just recurring fragments that behave differently at boundaries.

Even if spacing is imperfect, and even if tokens are not true words, there is still a real constraint at the level of characters that marks where lines end and begin. That constraint is strong enough to recover line breaks from a continuous no-space string.

-----------

As an extra check, I ran exactly the same character-level model again, but this time keeping spaces as ordinary characters instead of removing them. The result is also clear: the signal is already strong without spaces, so the line-break effect does not depend on tokenization. But if spaces are kept as characters, detection becomes even stronger. I am not saying that all spaces are correct, or that the spaced units are necessarily true words. It only means that the transmitted spacing is not random noise. It carries additional structural information that helps identify real line breaks.

Version	Dataset	Mean real score	Mean nearby score	Diff real-nearby	AUC vs nearby
No spaces	Held-out Herbal + Biological	2.62	-3.07	+5.69	~0.93
Spaces kept as characters	Held-out Herbal + Biological	3.46	-3.31	+6.77	~0.94-0.95
No spaces	Other sections	2.95	-2.71	+5.65	~0.92
Spaces kept as characters	Other sections	3.93	-3.01	+6.94	~0.94-0.95

Removing spaces does not kill the effect, and keeping them improves the model, which suggests that the spacing, even if imperfect, still reflects some real structure in the text.

(17-04-2026, 09:44 AM)quimqu Wrote: You are not allowed to view links. Register or Login to view.Removing spaces does not kill the effect, and keeping them improves the model, which suggests that the spacing, even if imperfect, still reflects some real structure in the text.

Spaces are mostly "correct" relative to the obvious word structure, but sometimes when several choices seem plausible for space placement (for example gallows and y can be either the first or the last character) the scribes seemed to have been ignorant or careless, either appending the character to the word on the left, or prefixing it to the word on the right, or leaving it floating between words, as if they had no idea where it should be, or putting a small space to signify their indecision, or skipping the 2 spaces entirely. In these cases transliterations are all over the place, randomly ignoring some visible spaces and adding some invisible spaces... So it seems to me that at the heart of the word structure mystery is a hidden "correct" tokenization that scribes were only approximating and didn't know how (or care) to enforce strictly.

Note: the Zattera 12-slot sequence is ambiguous for the purpose of re-spacing space-less Voynichese, unless a greedy pattern matching is done, but the result would be different if the matching is done right-to-left instead of left-to-right, this is what I mean exactly by "ambiguous". Which unambiguous sequence is the "correct" one and why (if such a thing has a usefulness for defining the actual building blocks of Voynichese) is a difficult question, some sequences might be optimal for some metric... Perhaps some compression algorithm could suggest an optimal choice of tokens, as has been suggested recently, but every algorithm's choice of text chunks for its dictionary is different. Perhaps a better way to approach the problem would be to find the best unambiguous slot sequence (used in a simple regular expression for greedily matching the space-less text to generate chunks) by the hill climbing algorithm, using Shannon's entropy to calculate the theoretic compression factor, then check if there is a universal slot sequence, generally optimal or if several locally optimal sequences exist for different sections of the VMS. This is something I wanted to try You are not allowed to view links. Register or Login to view. but was unsure how to proceed. I think I'll do it this week-end.

Sorry for hijacking your thread for this off-topic discussion. Smile

I tested a simple question: how predictable is the text, depending on where you look?

Same setup everywhere. Take a fragment, try to predict what comes next, and rank the real continuation against many alternatives. If there is structure, the real one should rank high. If not, it should look random.

First, across lines. I used the end of a line to predict the start of the next one. The signal is weak. The real continuation behaves almost like a random candidate. Top-1 is only a few percent and the average rank sits near the middle. So line-to-line continuity is poor.

Then I did the same inside the line. I split each line roughly in half, always on a space, and used the left part to predict the right part.

Here the behaviour changes a lot. The average rank drops to ~16 out of ~100 candidates, the median is 5, top-1 is ~30%, and more than half of the real continuations fall in the top 5. The real continuation is clearly better than average, although not always the best possible one.

So the contrast is clear. Across lines, almost random. Inside lines, stronger signal.

There are caveats. Frequency effects and position inside the line can inflate predictability, and this does not imply a single deterministic continuation. But the gap is large enough that it is unlikely to be just noise.

One detail matters. The real continuation beats the average candidate, but often not the best one. That suggests multiple plausible continuations rather than a fixed sequence.

There seem to be real local structure inside lines, but it does not carry cleanly across line boundaries, and even locally the system seems to select among compatible variants rather than follow a single path.

Note: I can make all my code available if someone wants to check it.

(17-04-2026, 06:37 PM)quimqu Wrote: You are not allowed to view links. Register or Login to view.I can make all my code available if someone wants to check it.

I would love to see your code not to only to check it but to learn from it. I don't understand what you are talking about exactly: what kind of model you train, how you score alternatives etc.

(17-04-2026, 07:40 PM)nablator Wrote: You are not allowed to view links. Register or Login to view.I would love to see your code not to only to check it but to learn from it. I don't understand what you are talking about exactly: what kind of model you train, how you score alternatives etc.

I cleaned the code and wrote references and comments. You are not allowed to view links. Register or Login to view.

I checked again line-internal continuity more strictly.

If I split lines in the middle, the real continuation looks very strong at first. It often ranks near the top, with about 30% top-1 and over 50% top-5. So there is clearly structure inside the line.

But once I control for global frequency, most of that signal disappears. Comparing only against equally frequent alternatives, the real continuation is no longer special. Top-1 drops to ~2% and top-5 to ~5%, and the excess score is near or below zero. The same happens even if I switch to rough morphological patterns instead of exact chunks.

So there is local regularity, but it looks mostly formulaic. The line interior is structured, but not strongly driven by immediate context.

(17-04-2026, 09:44 AM)quimqu Wrote: You are not allowed to view links. Register or Login to view.So even after removing all spaces, the signal is still there, and it is strong.

I am not sure I understood your point... But this experiment does not imply that "the line-end anomalies are not caused by or related to spaces".

The line breaking algorithms -- that split a stream of tokens into lines -- directly change the statistics of token lengths at both ends of the lines. But different token length distributions imply different word distributions, and this implies different character and digraph distributions. And these differences will persist even if you remove spaces from the resulting lines.

For instance, the frequency of the digraph "th" in English is largely determined by the frequency of the common words that use it: "the", "this", "that", "they", "them", "thus", "there", "then", etc. But those words are shorter than average, so the TLA will make them more common at the end of lines and less common at the start. Thus, even if you remove spaces from English text after running the TLA, you should still see the frequency of "th" enhanced near the end of lines and depressed near the beginning of lines.

And that is even more true of the more sophisticated line-breaking algorithm (SLA) that the scribe actually used, which included the use of abbreviations and stretching/compression. The effects of SLA too will persist even if you remove the spaces on the resulting lines. In the case of the VMS, the occurrence of an m will be a good hint that there may have been a line break just after it.

And if you join those lines into a single character stream without any spaces, a good pattern detector should be able to locate the line breaks, based on those frequency anomalies. What is happening is that the pattern recognition algorithm learns to see several short words before one long word, and learns that a line break is more likely to occur there than elsewhere.

But this is true only if TLA or SLA is given a stream of tokens. If you give it just a stream of characters, without the spaces, obviously there not be any statistical anomalies around the resulting line breaks, which will be blindly inserted every N characters, exactly;

Quote:Maybe what we call "words" are not real words. [... but] Removing spaces does not kill the effect, and keeping them improves the model, which suggests that the spacing, even if imperfect, still reflects some real structure in the text.

Well, you know my (very strong now) beliefs about the nature of the text.

But independently of my theories, I propose that the following claims are true, and should be confirmed by experiments:

Whatever their function, word spaces were important to the Author. Thus the draft that the Scribe received was a sequence of tokens separated by spaces (and not just a sequence of non-blank characters, that the Scribe would split into words at will).
The Scribe was supposed to ignore the line breaks of the draft, within each parag, and insert new line breaks so as to properly fill the space between the text rails.
The Scribe did not have permission to:
.. join tokens by discarding word spaces.
.. break tokens by inserting new spaces
.. split a token across a line break.
The Scribe had a handful of abbreviations that he could use to help avoid bad breaks or rail overflow. Changing iin to m was one. Any others?

The arguments for (1) are many. The argument for (2) is that the lines (other than parag tails) generally fill the space between the rails more or less neatly -- even when the rails are slanted, bent, or broken because of vellum defects that not even the Scribe could have predicted. Note that (2), in particular, denies the LAAFU theory.

I don't have arguments at hand for items (3)-(6), but I believe that they have been verified long ago.

The argument for (7), of course, is that the frequency of all words ending in iin is depressed at line end, while that of the corresponding words with iin replaced by m is enhanced.

Is there significant evidence against claims (1)-(7)? Again, the mere observation of statistical anomalies around line breaks is not a valid argument, unless it can be shown that such anomalies cannot be simply side effects of the SLA. I don't think this has been shown yet.

All the best, --stolfi

Jorge,

Good point, and I agree with your reasoning about SLA effects surviving space removal.

But just to clarify what I did, after removing spaces I’m not working with tokens anymore. I’m looking at character sequences directly, so in principle I’m removing the “word layer” and any token-based bias.

Even then, the model can still detect line starts and ends quite clearly from character patterns alone. That’s the part I find hard to explain purely as a side effect of line breaking on tokens.

Maybe SLA can still induce that indirectly through token composition, but then it would have to leave a pretty strong and structured footprint at the character level.

(18-04-2026, 07:13 PM)quimqu Wrote: You are not allowed to view links. Register or Login to view.Even then, the model can still detect line starts and ends quite clearly from character patterns alone. That’s the part I find hard to explain purely as a side effect of line breaking on tokens.

Again, by changing the distribution of token lengths around line breaks, the TLA changes the frequencies of words at those places, and therefore the frequencies of characters and n-grams at those places. See the example of "th" in English.

Thus, even of you delete spaces and join the lines into a single string of characters, the character and n-gram frequencies will still be anomalous around the deleted line breaks.

And this effect is even stronger with the SLA, since the enhanced occurrence of m and reduced occurrence of iin will still be anomalies of the character stream a the places of the deleted breaks.

That is why I still don't consider the LAAFU theory proved. I still don't see evidence that the statistical anomalies around line breaks cannot be explained as side effects of the SLA.

All the best, --stolfi

Pages: 1 2 3 4 5 6

ReneZ

quimqu

nablator

quimqu

nablator

quimqu

quimqu

Jorge_Stolfi

quimqu

Jorge_Stolfi