The Voynich Ninja

Full Version: About the construction of lines in the MS
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Pages: 1 2 3 4 5 6
In this You are not allowed to view links. Register or Login to view., when I was thinking of how to determine if the lines and paragraph have some structure, I got the idea of creating features for the gaps that make the line splits. So see the properties of the words just before the line split and just after.

The idea was to avoid looking at the page as an image and instead work only from the transcription. I turned each paragraph into a sequence of word gaps. For every gap between two consecutive words, I asked whether it was a real line break or just an interior gap.

The model I created was deliberately simple. It did not use exact words as memorized items. It used short word-type signatures and short local segment features. On the left side of a candidate break it looked at the last word and the last two-word tail. On the right side it looked at the first word and the first two-word head. It also measured whether the candidate cut would leave a line length close to the typical line length of that section. So the score had four parts: end quality, start quality, fit to line length, and a smaller central transition term.

So, making it short, a "good line ending" means that the word just before the break, and the small tail that ends the line, look like the endings that real lines usually have in that section. A "good line opening" means that the word just after the break, and the short opening sequence after it, look like real line openings. This is not a claim about magical words, it is a claim about recurrent statistical shapes.

The main validation was done only on Herbal and Biological, because they are large enough and relatively coherent. I used paragraph-level cross-validation, so train and test never shared the same paragraph. I also repeated the whole split with several random seeds. Then I added a null control: within each training paragraph I randomly permuted which gaps were marked as boundaries, trained the same model on those fake labels, and tested on the real held-out paragraphs. If the signal was an artifact of the pipeline, the null should also score well.... It did not.

These are the main validation results.

SectionReal AUC meanReal AUC sdNull AUC meanNull AUC sd
Biological (balneological)0.8910.0570.4590.119
Herbal0.9050.0360.4610.131

Those values are the most important part of the finding. The real model stays very high across repeated train/test splits. The null model collapses close to chance. That does not prove everything, but it is strong evidence that the detection is not coming from a trivial leak or from learning the same paragraphs it later predicts.

The next question was whether the score was really mixing several weak signals or whether the individual components also worked on their own.... they did.

SectionEnd-quality AUCStart-quality AUCFit-quality AUC
Biological (balneological)0.7080.7630.797
Herbal0.7520.7630.780

This means that real line breaks are not only associated with a better local transition. They are also associated with better line endings, better next-line openings, and better line-length fit. In other words, the break seems to happen where the text can be segmented into two pieces that both look well formed.

The direction of the effect is also consistent in both sections.

SectionBoundary end qualityInterior end qualityBoundary start qualityInterior start qualityBoundary fit qualityInterior fit quality
Biological (balneological)-2.019-2.564-1.774-2.508-1.531-2.531
Herbal-1.788-2.349-1.798-2.444-1.453-2.663

The scores are log-like and therefore negative, so the less negative values are the better ones. Real line breaks systematically look better than interior gaps on all three dimensions.

After that I tested whether the mechanism was section-specific or partly shared. First I did a cleaner transfer between the two large sections. Herbal was used to score Biological, and Biological was used to score Herbal. Then I trained on both of them together and only then applied the model descriptively to the other sections.

TrainingTest sectionBoundary AUCEnd AUCStart AUCFit AUC
HerbalBiological (balneological)0.8040.6670.6900.753
Biological (balneological)Herbal0.8090.6270.6590.785

That is lower than within-section validation, as expected, but still strong. So the mechanism is not just a local quirk of one section.

When Herbal and Biological are pooled and used as a source model, the score still separates real boundaries from interior gaps in the other sections as well.

TrainingTest sectionBoundary AUC
Herbal + BiologicalMarginal stars only0.958
Herbal + BiologicalText-only0.859
Herbal + BiologicalPharmaceutical0.782
Herbal + BiologicalZodiac0.746
Herbal + BiologicalAstronomical0.739
Herbal + BiologicalCosmological0.727

I would still call those transfer values descriptive rather than fully validated, because those sections were not cross-validated on themselves in the same way. But they are high enough to suggest that at least part of the lineation logic is shared across the manuscript.

The last part was to look at what actually defines a good ending and a good beginning. The model is not using exact words as rules. It is using recurrent word-type patterns. So the result is better read as families of endings and openings rather than individual tokens.

For Biological, line endings are enriched in short and medium forms such as ol..lyol..dyor..ryol..olok..ky, and some larger qo... tails. Line openings are enriched in families like so..dyso..eyso..orso..olsa..arsa..inds..dydc..dytc..dy, again with some recurring qo... heads.

For Herbal, line endings are enriched in short and medium forms such as ..am..dyda..anda..amok..amot..amch..ry, and some repeated two-word tails involving da..in. Line openings are enriched in families like yc..oryc..olyc..eyso..inso..orso..oldc..eydc..dyds..dyyt..dytc..in, and ta..ar.

A compact summary of the strongest recurrent families is this:

SectionGood line endings tend to look likeGood line openings tend to look like
Biologicalol..ly, ol..dy, or..ry, ol..ol, ok..ky, some qo... tailsso..dy, so..ey, so..or, so..ol, sa..ar, sa..in, ds..dy, dc..dy, tc..dy
Herbal..am, ..dy, da..an, da..am, ok..am, ot..am, ch..ry, some da..in tailsyc..or, yc..ol, yc..ey, so..in, so..or, so..ol, dc..ey, dc..dy, ds..dy, yt..dy, tc..in

So the main result is not that the manuscript has special boundary words that never appear elsewhere. It is that line breaks tend to occur where the preceding segment ends in a statistically good way and the following segment begins in a statistically good way, with the added constraint that the resulting line length also fits the section.

That is why the detection values are so high. The model is not reading the same paragraphs twice. It is learning which kinds of endings, beginnings and line-length fits are typical in training paragraphs, and then finding the same structure in held-out paragraphs. The null control shows that this does not survive random relabelling. So the effect looks real.

I think the safest formulation is this: in Herbal and Biological, line breaks are statistically structured rather than arbitrary. They can be detected very well out of sample because real boundaries have better line-ending quality, better next-line-opening quality, and better line-length fit than ordinary interior gaps. The same scoring scheme also transfers surprisingly well to several other sections, which suggests that at least part of the lineation mechanism may be shared across the Voynich manuscript.
(15-04-2026, 08:25 AM)quimqu Wrote: You are not allowed to view links. Register or Login to view. The null control shows that this does not survive random relabelling. So the effect looks real.  I think the safest formulation is this: in Herbal and Biological, line breaks are statistically structured rather than arbitrary.

Good, but, again, that is an expected (and even already verified) consequence of the trivial line-breaking algorithm (provided that the margins are defined by mm or character count, not by word count).  It does not insert line breaks at random, but in a way that strongly depends on the lengths of the words before and after the break.   As a result, the first word after a line break tends to be longer than average, and the last 1-3 words before a line break tend to be shorter than average.  And if the word length distributions are different, the same will almost certainly be true of any other other statistic, character- or word-based.

So, before we can take your results above as evidence of LAAFU, you need to repeat the analysis on text that has been re-justified as discussed before.  Namely, discard the first line of each parag, join the remaining lines in a single token stream, and feed that to the trivial line-breaking algorithm, with maximum line length set to about 62% or 162% of the original average line length -- always counting characters, not words.   

The set of these re-justified parags should be the "null control".  I am quite sure that you will see on this  control text the same kind of anomalies that you see on the VMS.  The question is only whether they will be just as strong, or significantly weaker.

And even if the anomalies in this control text are weaker than on the original parags, that is still not yet evidence of LAAFU.  Because the algorithm used by the scribe is a bit more complicated than the trivial one, and the additional complications add to the line-break anomalies.  So we will have to try to simulate these complications too.

All the best, --stolfi
(15-04-2026, 12:33 PM)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view.So, before we can take your results above as evidence of LAAFU, you need to repeat the analysis on text that has been re-justified as discussed before.  Namely, discard the first line of each parag, join the remaining lines in a single token stream, and feed that to the trivial line-breaking algorithm, with maximum line length set to about 62% or 162% of the original average line length -- always counting characters, not words.   

The set of these re-justified parags should be the "null control".  I am quite sure that you will see on this  control text the same kind of anomalies that you see on the VMS.  The question is only whether they will be just as strong, or significantly weaker.

And even if the anomalies in this control text are weaker than on the original parags, that is still not yet evidence of LAAFU.  Because the algorithm used by the scribe is a bit more complicated than the trivial one, and the additional complications add to the line-break anomalies.  So we will have to try to simulate these complications too.

Yes, that was a good point, so I tried exactly that control.

I took the same paragraphs, removed the original line breaks (keeping the first line aside as suggested), joined everything into a single token stream, and re-broke the lines with a simple greedy algorithm based only on character count. I tried three regimes: about 0.62×, 1.0× and 1.62× the original average line length.

Then I did a strict out-of-sample test: train only on real paragraphs, and test on held-out paragraphs, comparing real breaks vs re-justified breaks on the same text.

These are the results.

SectionDatasetAUC totalEnd AUCStart AUC
BiologicalReal0.8750.7430.749
BiologicalRejust 0.620.5550.5190.530
BiologicalRejust 1.000.6310.5150.555
BiologicalRejust 1.620.6790.5270.552
HerbalReal0.8850.7120.756
HerbalRejust 0.620.7310.5190.536
HerbalRejust 1.000.7420.5260.554
HerbalRejust 1.620.6420.5440.563

So yes, the trivial line breaker does create signal. The AUC is clearly above chance, especially in Herbal. So that part of your explanation is correct. But it does not explain everything. The real text is consistently much higher:

SectionControlReal − control AUC
Biological0.62×+0.32
Biological1.00×+0.24
Biological1.62×+0.20
Herbal0.62×+0.15
Herbal1.00×+0.14
Herbal1.62×+0.24

More importantly, if you look at the actual separation between boundary and interior, the difference is big.

In the controls, boundary vs interior differences are small, around 0.03–0.1.
In the real text they are much larger, around 0.3–0.5.

So the trivial algorithm produces weak preferences, but the Voynich text shows much stronger ones.

Also, this is not only about line length. The biggest gap is in the end and start features. Those are based on local word patterns, not just length. The “fit” term behaves more like you would expect from line-breaking, and I would not overinterpret it.

It seems that the simple character-based line breaker does reproduce part of the effect, but only a fraction of it. When you keep the same text and only change where the lines are cut, the signal drops a lot. The real text still has a much stronger and more consistent boundary structure. So at least in my test, the effect is not just coming from trivial line-breaking.
Line breaks in the Voynich seem to follow two constraints at the same time. Lines tend to fill the available space, but they also tend to end with certain types of word patterns and start with other types of patterns. The model is picking up both things.

This is not typical of ordinary prose, where line breaks are mostly arbitrary. It suggests that the text was produced with local structural preferences, not just written and then wrapped to fit the page.

That does not tell us exactly how the text was generated. It could be stylistic, procedural, or something else. But it does suggest that line breaks are part of the structure, not just a formatting afterthought.
(15-04-2026, 03:12 PM)quimqu Wrote: You are not allowed to view links. Register or Login to view.Line breaks in the Voynich seem to follow two constraints at the same time. Lines tend to fill the available space, but they also tend to end with certain types of word patterns and start with other types of patterns. The model is picking up both things.

This is not typical of ordinary prose, where line breaks are mostly arbitrary. It suggests that the text was produced with local structural preferences, not just written and then wrapped to fit the page.

That does not tell us exactly how the text was generated. It could be stylistic, procedural, or something else. But it does suggest that line breaks are part of the structure, not just a formatting afterthought.

Perhaps it suggests that the meaningful content is in the central part of each line, with filler words added at the beginning and end.
After that control, I tried to look inside the signal instead of only comparing AUC.

Not all real breaks are equally well detected, so I split them into “good” (high score) and “weak” (low score), and compared them with interior gaps.

SectionGroupEndStartTotalLeft HRight H
Herbalgood1.911.273.181.151.95
Herbalweak0.000.000.003.373.37
Herbalinterior0.270.310.576.246.19
Biologicalgood2.202.374.572.392.31
Biologicalweak0.000.000.003.093.03
Biologicalinterior0.840.871.715.385.38

Here the pattern is quite clear.

- Good breaks look very structured. high end, high start, low entropy.
- Interior gaps look very unstructured. low scores, high entropy.
- Weak breaks are in between, but much closer to interior.

So it seems that not all line breaks behave the same. There is a subset that fits very well the “good ending + good beginning” pattern, and another subset that barely fits it.

Then I checked the same idea on the re-justified controls.

They do produce some signal, but it looks much flatter. There are fewer “good” breaks, and the entropy does not drop as much. The structure is weaker and less concentrated.

Also, looking at the actual patterns, in the real text a relatively small set of families explains a large part of the good breaks. In the controls those same families exist, but they are not aligned with the breaks in the same way.

So, it seems that the trivial line breaker explains part of the effect. It creates some bias, especially from word length. But it does not reproduce the strong concentration of “good endings + good beginnings” at the break points.

In the real text, line breaks seem to happen preferentially where both sides look locally well-formed, not just where the line happens to fill.

So it seems that two things are happening at the same time. Lines fill the available space, but they also tend to end and start with specific families of patterns.

That is not what you get if you only re-wrap the same text mechanically.
(15-04-2026, 07:07 PM)quimqu Wrote: You are not allowed to view links. Register or Login to view.So it seems that two things are happening at the same time. Lines fill the available space, but they also tend to end and start with specific families of patterns.

Thanks for doing these tests!  But indeed we should expect the VMS to have stronger line-end anomalies than those that are created by the simple line-breaking algorithm.  Because the real algorithm has two more complications.

First, the m character is probably an abbreviation for some other ending, that the Scribe could use when needed but also just when he felt like it.   I guess that the "some other ending" is iin.  That is, where the trivial algorithm said

  1. If the next word W fits in the remaining space, write W,
  Else break the line and write W.

his algorithm said

  1. If the next word W ends in iin, with X% probability do W = W/iin/m.

  2. If the next word W fits in the remaining space, write W.
  Else if W ends in 'iin', and Z = W/iin/m fits in that space, write Z. 
  Else break line and write W

where W/iin/m means W with iin replaced by m. and X% is the observed ratio of (freq of m-words) to (freq of m- plus iin-words) in the middle of the line. 

This extra step creates an excess of ms as the last word of the line.  I am guessing that this excess of m must be an important component of the line-end anomaly.

The second complication is best left for another post.  

For now, how can we account for the effect of those abbreviations in the control text?  After joining the body lines into a single token stream, I would try replacing all occurrences of m by iin, and then running the token stream through the line-breaking algorithm modified as above.

By the way: there are many pages where the lines are interrupted by plants.  An intrusion would trigger the line breaking algorithm just as if it was the right rail.  That is, the line-end anomalies should be observed on the last few words before the intrusion, and the line-start anomalies should be observed on the first word after it.  I vaguely recall this having been tested and found to be true.  Is t?  

All the best, --stolfi
Jorge, I get your idea, but I changed a bit the way to test it. I understand your hypothesis as the scribe compressing words when approaching the end of the line, for example with alternations like iin → m.

What I first did was to look at the structure before the final position. Instead of focusing only on the last word, I measured a "line-final-likeness" score for each token depending on its distance to the end of the line. There is a clear progression. In several sections, especially Herbal and Biological, tokens become increasingly end-like as we approach the line ending. So the effect is not just at the last word, it builds up gradually. These end-like tokens tend to belong to specific families and shapes that are well known at line ends in EVA, for example patterns like -dy, -y, -l, -r, or short forms such as dy, dal, lo, which frequently appear in final position and have high end-like scores.

So I thought: maybe the scriba is kind of compressing words when approching to the end of line. So I tested the compression idea directly. If the scribe is compressing, we should see that these end-like tokens correspond more often to longer interior forms, for example via prefix matches or small edit distance. I compared real line-final tokens with matched interior controls and checked several criteria (longer prefix matches, Levenshtein neighbors, subsequences).

The result is negative. Final tokens do not show an excess of longer expandable forms. If anything, they show slightly fewer such relations than the controls.

So the global effect does not seem to be driven by a general compression mechanism. It looks more like a positional preference, where certain variants or families, such as dy, dal, lo, or -dy endings, are increasingly favored as the line end approaches.
(16-04-2026, 09:09 AM)quimqu Wrote: You are not allowed to view links. Register or Login to view.These end-like tokens tend to belong to specific families and shapes that are well known at line ends in EVA, for example patterns like -dy, -y, -l, -r, or short forms such as dy, dal, lo, which frequently appear in final position and have high end-like scores.

Do only the last two EVA characters matter?

Stolfi Wrote:By the way: there are many pages where the lines are interrupted by plants.  An intrusion would trigger the line breaking algorithm just as if it was the right rail.  That is, the line-end anomalies should be observed on the last few words before the intrusion, and the line-start anomalies should be observed on the first word after it.  I vaguely recall this having been tested and found to be true.  Is t?

Good question.

Quimqu: Have you used the words before/after "<->" in your before/after linefeed data or not? If not, what happens if you add them?

[attachment=15126]
(15-04-2026, 07:07 PM)quimqu Wrote: You are not allowed to view links. Register or Login to view.So it seems that not all line breaks behave the same. There is a subset that fits very well the “good ending + good beginning” pattern, and another subset that barely fits it.

It could be that, unsurprisingly, some words fit in the available space without shortening. Is the average length of the "good" words smaller than the average length of the "barely fits" words?
Pages: 1 2 3 4 5 6