The Voynich Ninja

Pages: 1 2 3 4 5 6

(23-04-2026, 10:33 PM)Rafal Wrote: You are not allowed to view links. Register or Login to view.I agree. This is unnatural. There is such concept in Voynich studies as LAAFU - Line as a Functional Unit. I believe this observation may be considered as a part of LAAFU. It is indeed a big puzzle and something not happening in real texts.

There are definitely many statistical and structural anomalies at the start and end of the lines.

The question is whether those anomalies are due to line breaks being "functionally" significant, either semantically (as they would be in poetry or tables), or for the "encryption algorithm"; which is the LAAFU theory.

The alternative BAAA ("Break Anomalies Are Accidental") theory is that those anomalies are due to "non-functional" causes; specifically, that they are consequences of the line breaking process itself.

This alternative assumes a Scribe who merely copied on the stream of tokens of the Author's draft, without understanding or caring for the meaning of the text. It assumes that the Scribe disregarded the line breaks of the draft and introduced his own breaks based on space considerations alone. These assumptions would exclude any "functional" role for the line breaks.

BAAA also takes into account the possibility that the Scribe often modified the words around breaks in various "non-functional" ways, including abbreviation, compression, stretching, puffing, etc. The abbreviation of iin as m seems well-established, but there may be many other abbreviations being used at line end. Conversely, the writing generally seems to be more stretched out at the beginning of each line, and that apparently increases the frequency of "detached initials" there (whereby ytchey would become y.tchey, for example)

I believe that BAAA has not yet been properly investigated, and is far from being excluded. It seems that many LAAFU proponents are still ignoring the effect of line breaking (even with teh basic TLA) on the distribution of words at line start.

All the best, --stolfi

(24-04-2026, 07:00 AM)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view.
(23-04-2026, 10:33 PM)Rafal Wrote: You are not allowed to view links. Register or Login to view.I agree. This is unnatural. There is such concept in Voynich studies as LAAFU - Line as a Functional Unit. I believe this observation may be considered as a part of LAAFU. It is indeed a big puzzle and something not happening in real texts.

There are definitely many statistical and structural anomalies at the start and end of the lines.

The question is whether those anomalies are due to line breaks being "functionally" significant, either semantically (as they would be in poetry or tables), or for the "encryption algorithm"; which is the LAAFU theory.

The alternative BAAA ("Break Anomalies Are Accidental") theory is that those anomalies are due to "non-functional" causes; specifically, that they are consequences of the line breaking process itself.

This alternative assumes a Scribe who merely copied on the stream of tokens of the Author's draft, without understanding or caring for the meaning of the text. It assumes that the Scribe disregarded the line breaks of the draft and introduced his own breaks based on space considerations alone. These assumptions would exclude any "functional" role for the line breaks.

BAAA also takes into account the possibility that the Scribe often modified the words around breaks in various "non-functional" ways, including abbreviation, compression, stretching, puffing, etc. The abbreviation of iin as m seems well-established, but there may be many other abbreviations being used at line end. Conversely, the writing generally seems to be more stretched out at the beginning of each line, and that apparently increases the frequency of "detached initials" there (whereby ytchey would become y.tchey, for example)

I believe that BAAA has not yet been properly investigated, and is far from being excluded. It seems that many LAAFU proponents are still ignoring the effect of line breaking (even with teh basic TLA) on the distribution of words at line start.

All the best, --stolfi

Learn something new everyday the sometime bloated vords at the beginning then crunch towards the end now that's analysis. Smile

(24-04-2026, 06:10 AM)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view.Are you sure this result is real, and not just an artifact of sampling error?

Jorge,

I see your point about the sample size and exact pairs. That’s why I didn’t rely on counting repeated pairs in my tests.

What I did instead was compare the real continuation with many alternatives, both inside lines and across line breaks.

The result, is quite consistent:

inside the line, the real continuation tends to score clearly better than alternatives
across lines, the real continuation behaves almost like a random choice

So the difference shows up even without needing repeated pairs.

I also checked this in a few ways to make sure it’s not an artifact: using large sections (Herbal and Biological), repeating with different random splits, comparing against controls with similar frequencies... And one important point: I removed spaces and line breaks completely and worked only at the character level. Even then, the model can still detect where line breaks were. So the signal is not just coming from token boundaries or spacing. Even in that setup, the same pattern can be seen: strong structure locally, but very weak continuity across lines.

I agree that small samples can be misleading with pair counts, but the effect I’m seeing does not depend on that. It appears in setups where sparsity is not the limiting factor.

P.S. I don't know if @nablator have checked my code. There might be errors. In the last version all the tests run from the beginning to the end. It takes a bit less than an hour to run all the tests in Kaggle, but I am absolutely open to double check anything.

(24-04-2026, 05:56 PM)quimqu Wrote: You are not allowed to view links. Register or Login to view.P.S. I don't know if @nablator have checked my code.

No, it would take me far too long at my low level of Python understanding. Maybe I'll do something similar in a language I understand and then compare the results. But it looks too much like real work and the weather is too nice to spend any time coding. Smile

(24-04-2026, 05:56 PM)quimqu Wrote: You are not allowed to view links. Register or Login to view.And one important point: I removed spaces and line breaks completely and worked only at the character level.

Ah, OK. I will try to repeat my analysis without spaces, and by mapping EVA letters to classes (the mapping that I showed on another thread, that mostly erases the difference between languages A and B).

All the best, --stolfi

(24-04-2026, 05:56 PM)quimqu Wrote: You are not allowed to view links. Register or Login to view.And one important point: I removed spaces and line breaks completely and worked only at the character level.

Well, I don't know whether I coded it right, but here is what I got:

[attachment=15317]

For this plot I used my transcription of the Starred Parags section. I discarded parag head and tail lines, deleted all word spaces, and mapped glyphs to glyph classes
Q = {q},
O = {a,o,y},
B = {ch,che,sh,she,ee,eee},
K = {k, t, p, f, ke, te}
N = {m,im,n,in,iin,iiin,ir,iir}  ,
D = {d,l,r,s}
Platform gallows cth etc were mapped to BK. This mapping should reduce the impact of sampling error.

Then, I discarded all lines with less than 22 letters. Then for each distance d in 0 to 10, and each line, I picked a par of characters W1,W1 separated by d chars. In one run (red plot) the pair was near the center of the line. In the other run (blue line) the pair straddled the line break: that is, char W1 was floor(d/2) chars before the end of the line and char W2 was ceil(d/2) into the next line (even if it it was a tail line).

In total, each run collected ~350 pairs. From this data I computed the predictability P12 of W2 given W1. This quantity was defined as E12-E2, the entropy E2 (in bits) of the overall distribution of the W2 char, minus the entropy E12 of the distribution of W2 for each possible char W1, averaged over the W1 with their respective frequencies. The plot shows that as a fraction P12/E2.

Thus, what the graph shows is that, in the middle of the line, knowing W1 reduces the uncertainty of the next four characters by about 33%, 22%, 15%, and 6%, respectively. Whereas, across a line breaks, W1 gives no information about the next four characters. It is as if the line break was a gap of at least 4 characters.

It is puzzling that the predictability does not drop to zero when d increases beyond 4, but hovers around 2-3%, both inside lines and across line break.

It could mean that there are two kinds of parags, say some that are generally K-rich and some generally K-poor; so if you see a K, there is a slightly higher than average chance that any other char in the same parag is a K.

Or it may be a bug in my code...

However, the predictability in the first case seems to be determined largely by a few character pairs. Like, W1='q' is almost always followed by W2='o', and that contributes a lot to P12 when d=0. If I do the same plot but deleting the class codes Q and O, the predictabilities in mid-line drop to nearly zero:

[attachment=15318]

In retrospect, that is quite understandable, since 'q' is practically non-existent near line end, and 'o' is rather rare. Thus it is not so much that prophecy does not work across line breaks, but rather that the most successful "prophets" stay away from those places.

I also tried with W1 and W2 defined as digrams rather than individual characters. At that point the sampling errors may be again too big. But the same qualitative behavior was observed.

More later.

All the best,--stolfi

Quimqu, this little experiment only reinforced my conviction that this line of enquiry is a big waste of time.

The question is not whether there are many significant anomalies in the text round line breaks. That is a fact beyond dispute. The question is what causes those anomalies -- and how could we determine those causes.

The LAAFU theory is actual two theories. One theory says that the line break anomalies (LBAs) originate in the plaintext itself, where line breaks would be semantically significant. Which would be the case, for example, if the plaintext was poetry or hymns, whith each line being a verse. Or if each parag was a formula with certain fixed fields, synchronized with lines:

Flight: MCXXV Airline: Baghdad Magic Carpets
Departure: 20/jan/1340 2:30 pm From: PRI Prague International
Arrival: 21/jan/1340 07:02 To: SAM Samarkand North
Baggage: 1 trunk Pets: No Special meals: No
Passenger: Nicole Oresme  Class: Plebeian

I think we can dismiss this version of LAAFU because the width of lines on many pages is highly variable and determined by accidents that the Author could not have foreseen, such as illustrations and vellum defects (like the "blotter" area of on f112r and f112v). And yet the text is justified normally within those irregular margins. Which makes it almost certain that the line breaks were chosen by the Scribe, based mainly on space conditions -- with little or no influence from the contents, or from the line breaks in the Author's draft. That is, the Scribe just applied some SLA to the unbroken stream of tokens of each parag.

The seconf version of LAAFU says that the LBAs arise because the encryption method was affected by the line breaks. Like, it could be an algorithm that took each line, whole and separately, as input, and acted on each character in a different way depending on the distance from the ends of the line.

This version is harder to dismiss. In theory, a sufficiently complicated encryption method could explain all LBAs, including those we haven't detected yet.

One problem, however is that (for the reasons above) this theory implies that the encryption was done by the Scribe himself, as he was transcribing the author's text. I find it hard to imagine how that arrangement could have worked, exactly, considering that the encryption would depend on distance from the end of the line.

Again, the alternative to both versions of the LAAFU theory is the BAAA theory ("Break Anomalies Are Accidental") which says that the LBAs are accidental and meaningless side effects of the line-breaking algorithm used by the scribe. We have already identified two mechanisms that generate LBAs, namely the length bias of basic TLA (interacting with word-to-word correlations), and the apparent option of abbreviating iin by m. Surely there may be others. Will they be enough to explain all the LBAs?

But, either way, I don't see any hope of finding the answers to those questions by computing statistics and staring at the results. The effect of the q-o pairs on predictability, described in my previous post, is an example of how complicated the "accidental" causes can be.

So I don't know what to do about this question.

All the best, --stolfi

Jorge,

I tried to replicate your previous experiment on the Herbal and Biological sections, with a similar setup (glyph classes, no spaces, filtering short lines).

The results are qualitatively the same: within lines there is short-range predictability (around 20–25% at d=1, decaying quickly), while across line breaks it drops to ~2–3%, essentially baseline. This suggests a very strong break effect, consistent across both sections.

Separately, I also looked at token-level effects:

– Line starts and line ends are both clearly special, with fairly symmetric signals.
– However, the dependency between the end of one line and the start of the next is weak.
– Around <->, the effect weaker than at the end and beginning of lines but it is asymmetric: the token before the split is more “end-like” than the token after it is “start-like”.

So, to summarize:

– there is local structure within lines
– line breaks seem to disrupt that structure quite strongly
– positional effects (start/end, and before <->) are real but not identical

As you noted, part of the within-line predictability is driven by specific patterns like q→o, and their uneven distribution (e.g. near line starts) can explain some of the effects. Also, strong token families at line ends likely contribute to the boundary signals.

At this point I am not assuming any particular explanation. Maybe it is something old and that has been discussed hundreds of times, but for me, discovering that there seems to be a break from line to line is new, and is something that I will take into account if someday I try to test any theroy of my own.

Anyway, thank you for your testing and thoughts!

@quimqu

Quote:– Around <->, the effect weaker than at the end and beginning of lines but it is asymmetric: the token before the split is more “end-like” than the token after it is “start-like”.

So indeed your data is finding word breaks at the end and start, but not balanced? What would that suggest about a fabricator who designed this into his conlang? Word breaks make a text not seem like gibberish. His intention that it had to follow zipf's law and would that indicate 20th century? This is a low entropy text! Excellent Post quimqu Exclamation

Pages: 1 2 3 4 5 6

Jorge_Stolfi

oeesordy

quimqu

nablator

Jorge_Stolfi

Jorge_Stolfi

Jorge_Stolfi

quimqu

oeesordy