In this You are not allowed to view links.
Register or
Login to view., when I was thinking of how to determine if the lines and paragraph have some structure, I got the idea of creating features for the gaps that make the line splits. So see the properties of the words just before the line split and just after.
The idea was to avoid looking at the page as an image and instead work only from the transcription. I turned each paragraph into a sequence of word gaps. For every gap between two consecutive words, I asked whether it was a real line break or just an interior gap.
The model I created was deliberately simple. It did not use exact words as memorized items. It used short word-type signatures and short local segment features. On the left side of a candidate break it looked at the last word and the last two-word tail. On the right side it looked at the first word and the first two-word head. It also measured whether the candidate cut would leave a line length close to the typical line length of that section. So the score had four parts: end quality, start quality, fit to line length, and a smaller central transition term.
So, making it short, a "good line ending" means that the word just before the break, and the small tail that ends the line, look like the endings that real lines usually have in that section. A "good line opening" means that the word just after the break, and the short opening sequence after it, look like real line openings. This is not a claim about magical words, it is a claim about recurrent statistical shapes.
The main validation was done only on Herbal and Biological, because they are large enough and relatively coherent. I used paragraph-level cross-validation, so train and test never shared the same paragraph. I also repeated the whole split with several random seeds. Then I added a null control: within each training paragraph I randomly permuted which gaps were marked as boundaries, trained the same model on those fake labels, and tested on the real held-out paragraphs. If the signal was an artifact of the pipeline, the null should also score well.... It did not.
These are the main validation results.
| Section | Real AUC mean | Real AUC sd | Null AUC mean | Null AUC sd |
| Biological (balneological) | 0.891 | 0.057 | 0.459 | 0.119 |
| Herbal | 0.905 | 0.036 | 0.461 | 0.131 |
Those values are the most important part of the finding. The real model stays very high across repeated train/test splits. The null model collapses close to chance. That does not prove everything, but it is strong evidence that the detection is not coming from a trivial leak or from learning the same paragraphs it later predicts.
The next question was whether the score was really mixing several weak signals or whether the individual components also worked on their own.... they did.
| Section | End-quality AUC | Start-quality AUC | Fit-quality AUC |
| Biological (balneological) | 0.708 | 0.763 | 0.797 |
| Herbal | 0.752 | 0.763 | 0.780 |
This means that real line breaks are not only associated with a better local transition. They are also associated with better line endings, better next-line openings, and better line-length fit. In other words, the break seems to happen where the text can be segmented into two pieces that both look well formed.
The direction of the effect is also consistent in both sections.
| Section | Boundary end quality | Interior end quality | Boundary start quality | Interior start quality | Boundary fit quality | Interior fit quality |
| Biological (balneological) | -2.019 | -2.564 | -1.774 | -2.508 | -1.531 | -2.531 |
| Herbal | -1.788 | -2.349 | -1.798 | -2.444 | -1.453 | -2.663 |
The scores are log-like and therefore negative, so the less negative values are the better ones. Real line breaks systematically look better than interior gaps on all three dimensions.
After that I tested whether the mechanism was section-specific or partly shared. First I did a cleaner transfer between the two large sections. Herbal was used to score Biological, and Biological was used to score Herbal. Then I trained on both of them together and only then applied the model descriptively to the other sections.
| Training | Test section | Boundary AUC | End AUC | Start AUC | Fit AUC |
| Herbal | Biological (balneological) | 0.804 | 0.667 | 0.690 | 0.753 |
| Biological (balneological) | Herbal | 0.809 | 0.627 | 0.659 | 0.785 |
That is lower than within-section validation, as expected, but still strong. So the mechanism is not just a local quirk of one section.
When Herbal and Biological are pooled and used as a source model, the score still separates real boundaries from interior gaps in the other sections as well.
| Training | Test section | Boundary AUC |
| Herbal + Biological | Marginal stars only | 0.958 |
| Herbal + Biological | Text-only | 0.859 |
| Herbal + Biological | Pharmaceutical | 0.782 |
| Herbal + Biological | Zodiac | 0.746 |
| Herbal + Biological | Astronomical | 0.739 |
| Herbal + Biological | Cosmological | 0.727 |
I would still call those transfer values descriptive rather than fully validated, because those sections were not cross-validated on themselves in the same way. But they are high enough to suggest that at least part of the lineation logic is shared across the manuscript.
The last part was to look at what actually defines a good ending and a good beginning. The model is not using exact words as rules. It is using recurrent word-type patterns. So the result is better read as families of endings and openings rather than individual tokens.
For Biological, line endings are enriched in short and medium forms such as
ol..ly,
ol..dy,
or..ry,
ol..ol,
ok..ky, and some larger
qo... tails. Line openings are enriched in families like
so..dy,
so..ey,
so..or,
so..ol,
sa..ar,
sa..in,
ds..dy,
dc..dy,
tc..dy, again with some recurring
qo... heads.
For Herbal, line endings are enriched in short and medium forms such as
..am,
..dy,
da..an,
da..am,
ok..am,
ot..am,
ch..ry, and some repeated two-word tails involving
da..in. Line openings are enriched in families like
yc..or,
yc..ol,
yc..ey,
so..in,
so..or,
so..ol,
dc..ey,
dc..dy,
ds..dy,
yt..dy,
tc..in, and
ta..ar.
A compact summary of the strongest recurrent families is this:
| Section | Good line endings tend to look like | Good line openings tend to look like |
| Biological | ol..ly, ol..dy, or..ry, ol..ol, ok..ky, some qo... tails | so..dy, so..ey, so..or, so..ol, sa..ar, sa..in, ds..dy, dc..dy, tc..dy |
| Herbal | ..am, ..dy, da..an, da..am, ok..am, ot..am, ch..ry, some da..in tails | yc..or, yc..ol, yc..ey, so..in, so..or, so..ol, dc..ey, dc..dy, ds..dy, yt..dy, tc..in |
So the main result is not that the manuscript has special boundary words that never appear elsewhere. It is that line breaks tend to occur where the preceding segment ends in a statistically good way and the following segment begins in a statistically good way, with the added constraint that the resulting line length also fits the section.
That is why the detection values are so high. The model is not reading the same paragraphs twice. It is learning which kinds of endings, beginnings and line-length fits are typical in training paragraphs, and then finding the same structure in held-out paragraphs. The null control shows that this does not survive random relabelling. So the effect looks real.
I think the safest formulation is this:
in Herbal and Biological, line breaks are statistically structured rather than arbitrary. They can be detected very well out of sample because real boundaries have better line-ending quality, better next-line-opening quality, and better line-length fit than ordinary interior gaps. The same scoring scheme also transfers surprisingly well to several other sections, which suggests that at least part of the lineation mechanism may be shared across the Voynich manuscript.