The Voynich Ninja

Pages: 1 2 3 4 5 6

(18-04-2026, 01:36 AM)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view.The Scribe had a handful of abbreviations that he could use to help avoid bad breaks or rail overflow. Changing iin to m was one.

Hello Jorge,

I tried the m -> iin substitution in the notebook, both in the line-break detection and in the mid-line continuation tests.

Short version. It doesn’t improve things. If anything, it slightly degrades the signal.

In the line-break test, AUC stays basically the same, with only a tiny local gain vs nearby non-cuts. But in the continuation tests, the effect is clearly negative. Real continuations become less distinguishable from alternatives, and final-token end scores drop consistently across sections.

So overall, expanding m to iindoes not make the structure more visible in these models, and tends to weaken it.

What kind of change would you have expected to see if the abbreviation hypothesis were correct? I can test any ideas of yours.

(18-04-2026, 11:32 PM)quimqu Wrote: You are not allowed to view links. Register or Login to view.So overall, expanding m to iindoes not make the structure more visible in these models, and tends to weaken it.

I am not sure I understand you explanation (what was AOC, again?). But the expectation was that the replacement of m by iin would indeed make the line-end anomalies weaker.

The test I was thinking was: in each parag,

Replace m by iin everywhere.
Join lines into a single token stream.
Feed that stream to the SLA with different rail width W.
Measure the anomalies around the new line breaks.
Compare with the anomalies seen in the original text.

The SLA is what I described previously: like the basic TLA, but with some low-probability random abbreviation of iin to m everywhere, and mandatory abbreviation at line end if it would delay the line break.

As before, the line length limit W should be specified as a max number of EVA characters, glyphs, or mm; not as max number of words.

The new length limit W should be either ~62% of the parag's average line length, or ~162%, to minimize the chance of new breaks coinciding with the old ones.

I predict that with this test the anomalies around the new line breaks should be stronger than those created by the plain TLA, but probably weaker than those seen in the current line beaks. That's because the actual Scribe's algorithm surely used more tricks than just "abbreviate iin to m."

All the best, --stolfi

(19-04-2026, 01:03 AM)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view.The test I was thinking was: in each parag,

Replace m by iin everywhere.

Join lines into a single token stream.

Feed that stream to the SLA with different rail width W.

Measure the anomalies around the new line breaks.

Compare with the anomalies seen in the original text.

I tried the test you suggested.

As I understood it:

TLA is just a simple line-breaking algorithm. You take a continuous text stream and cut lines at a fixed width WWW, without any special tricks.
SLA is the same basic algorithm, but with an added abbreviation mechanism. It allows occasional iin -> m, and at line end it forces that abbreviation if it helps delay the break.

So I expanded m -> iin, joined each paragraph into one token stream, then relaid it with two width limits, using both a plain TLA and an SLA-style version with random iin -> m plus forced abbreviation at line end.

What I get is that the real line breaks are still much stronger than the new ones in all cases, which is fine. But the important part is that the SLA does not come out stronger than the plain TLA. On average it is actually slightly worse.

So with this implementation I am not seeing the pattern real > SLA > plain TLA. I am seeing more like real >> plain, and real >> SLA, with SLA <= plain
overall.

(19-04-2026, 06:29 PM)quimqu Wrote: You are not allowed to view links. Register or Login to view.What I get is that the real line breaks are still much stronger than the new ones in all cases, which is fine.

I expected them to be stronger... but "much"?

Quote:But the important part is that the SLA does not come out stronger than the plain TLA. On average it is actually slightly worse.

This is strange. I thought that the final m was be an important part of the "LAAFU anomalies". The SLA should create more m at the new line ends, and suppress them at the old ones. How could it be worse than TLA?

I suppose you are using a neural network to measure the anomalies, so you cannot tell what exactly it is measuring?

All the best, --stolfi

(20-04-2026, 01:18 AM)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view.I suppose you are using a neural network to measure the anomalies, so you cannot tell what exactly it is measuring?

I’m not using a neural net. It’s a simple score.

Roughly, I learn what typical line endings look like and what typical line beginnings look like, at the character level. Then for any cut I score how “end-like” the left side is and how “start-like” the right side is, compared to the overall background. Real breaks tend to have high scores on both sides.

AUC (Area Under the Curve) here just means: probability that a real break scores higher than a fake one. 0.5 is no signal, 1.0 is perfect.

On the m → iin change:

AUC barely moves:

heldout ~0.969 → ~0.968
other ~0.966 → ~0.967

So essentially unchanged.

But in the harder comparisons (real vs controls), it clearly drops:

heldout ~0.88 → ~0.65
other ~1.10 → ~0.45

End-of-line scores also go down everywhere, sometimes a lot (e.g. ~1.17 → ~0.70 in Cosmological).

And in your test, SLA does not beat TLA. On average it’s slightly worse (around −0.17 in heldout at W≈0.62, smaller differences elsewhere).

So expanding m to iin does not make the structure more visible in this setup. If anything it weakens it, including at real line ends.

What kind of effect did you expect. Slight improvement, or clearly SLA > TLA?

(20-04-2026, 07:41 AM)quimqu Wrote: You are not allowed to view links. Register or Login to view.It’s a simple score.

Score (start, end) = base 2 logarithm of ratio of (start, end) probability to interior probability

Python code:

Code:
for _, r in fam_pos.iterrows():

    fam = r["family"]

    p_end = r["end"] / tot_end if tot_end else 0.0

    p_start = r["start"] / tot_start if tot_start else 0.0

    p_int = r["interior"] / tot_int if tot_int else 0.0

    # Positive value = enriched at line edge relative to interior.

    A_end_score[fam] = np.log2(p_end / p_int) if p_end > 0 and p_int > 0 else np.nan

    A_start_score[fam] = np.log2(p_start / p_int) if p_start > 0 and p_int > 0 else np.nan

(20-04-2026, 07:41 AM)quimqu Wrote: You are not allowed to view links. Register or Login to view.
(20-04-2026, 01:18 AM)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view.I suppose you are using a neural network to measure the anomalies, so you cannot tell what exactly it is measuring?
I’m not using a neural net. It’s a simple score. Roughly, I learn what typical line endings look like and what typical line beginnings look like, at the character level. Then for any cut I score how “end-like” the left side is and how “start-like” the right side is, compared to the overall background.

Sorry, I was confused by your use of the word "learn". What do you mean by that?

Do mean that for, each character c, you compute

F0( c) = frequency of c as the word-final character for "middle" words
FV( c) = frequency of c as the line-final character in the actual VMS text
FT( c) = frequency of c as line-final character after re-justifying with TLA
FS( c) = frequency of c as line-final character after re-justifying with SLA

then you compute scores mV, mS, mT that compare the distributions FV, FT, FS with F0?

Or do you use those distributions to build three "line break predictors" for V, T, and S, and then compute scores that compare their performances somehow?

I ask because I wonder if the results you see are being determined by some other specific character, besides m (or n), that is more (or less) frequent in FV but is not created (or suppressed) by SLA.

All the best, --stolfi

(20-04-2026, 11:50 AM)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view.Sorry, I was confused by your use of the word "learn". What do you mean by that?

Hello Jorge,

by "learn" I just meant estimating empirical frequencies from the training set.

In this version, I am not using a neural net or a full character n-gram model. Instead, I map tokens to coarse "families" (based on length bin and edge substrings), and estimate how often each family appears in three positions: line start, line end, and line interior.

From that I compute two scores:
end score = log2(P(family | line end) / P(family | interior))
start score = log2(P(family | line start) / P(family | interior))

These scores are then used in several tests to evaluate whether candidate boundaries look more like real line breaks than controls.

So the basic idea is: does the token on the left look unusually line-end-like, and does the token on the right look unusually line-start-like, relative to interior tokens?

Given the prvious m→iin experiment, it looks like the signal is not driven mainly by that specific abbreviation. The structure stays strong even when we remove it, and SLA still doesn’t reproduce it. So it seems that whatever is creating the boundary signal involves multiple patterns, not just m (or n). I can try to check whether specific characters or short patterns are disproportionately contributing to the effect. I'll let you know...

(18-04-2026, 01:36 AM)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view.Is there significant evidence against claims (1)-(7)? Again, the mere observation of statistical anomalies around line breaks is not a valid argument, unless it can be shown that such anomalies cannot be simply side effects of the SLA.

I believe there is evidence that speaks against claim (2) in particular.

Since I don't want to be accused of forcing my own work on anyone, I'll point here to a finding by a couple other researchers -- namely, statistical anomalies among the second words of lines in Quire 20, as discovered by Emma May Smith and Marco Ponzi:

You are not allowed to view links. Register or Login to view.

How could the tendency of the second words of lines to begin disproportionately with [Sh] and [s] be explained as a byproduct of an algorithm for splitting lines based on word length?

I've been running code to test something very specific: to know if when a line is cut in the Voynich it simply "has to" put a word of a certain type, or if there is a finer choice within that type.

First I took all the lines and converted them into pairs of words. Those that are separated by a real line break, and those that are within the same line (as a control). Two comparable groups.

Then I removed the most obvious signal: typical prefixes and suffixes that we already know mark the beginning or end of a line. This is important because otherwise the model only detects trivial things like "words that end in X tend to go at the end".

Once this strong signal was removed, I looked at what was left. This is where the idea of "family" comes in: similar words (similar length, same beginning and end, etc.) as variants of the same type.

The key question: within the same family, do all variants behave the same?

The result is clear: no.

Within a family, there are words that appear much more often than they should in line-breaking positions, and others that appear less often. For example, very common forms like "daiin" tend to avoid these positions, while less common variants of the same family appear more often.

This is not a small effect. Even after removing the obvious start/end markers, the real word still ranks near the top among ~100 candidates about 60–65% of the time (top-1), and almost always within the top 5 (~95–99%). But when you force the model to choose only within the same frequency band or family, this advantage drops sharply, and many “equally similar” variants are clearly not interchangeable.

This means that it is not enough to say "here is a word of this type". The system also decides which specific variant to put there.

And this effect does not disappear when you remove the obvious patterns. There is still structure. Not only at the level of the whole word, but also in internal fragments.

The conclusion: the text is not generated by randomly choosing words from a similar set. There is a kind of internal selection. First, a family of possible forms is chosen, and then, within this family, a specific variant is chosen according to the context—especially near line boundaries.

In other words: lines are not just closed with "words of the right type," but with specific forms within that type. This points to a richer structure than would appear if you were just looking at frequencies or overall similarities.

PS. All tests are available in the Kaggle notebook I attached in a You are not allowed to view links. Register or Login to view..

Pages: 1 2 3 4 5 6

quimqu

Jorge_Stolfi

quimqu

Jorge_Stolfi

quimqu

nablator

Jorge_Stolfi

quimqu

pfeaster

quimqu