The Voynich Ninja

Full Version: The structure of the Voynich text and how it may be generated
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Pages: 1 2 3 4 5 6
Hello Jorge,

I tested both types of artificial line breaking.

First, I broke continuous natural texts by a fixed number of words. As expected, that removes the main wrapping artifact, because line position is no longer correlated with word length. Then I also tested breaking by a maximum number of characters, which is closer to what a real scribe would do.

What I found is that the natural baseline is much less stable than I first thought. If one measures token-level left/right asymmetry on a continuous text and then redistributes the values by artificial line position, the shape of the curve changes a lot depending on the exact metric and on how the wrapping is done. Means are especially unstable, because a few token values can dominate them. Medians flatten the signal too much. So I do not think that the natural-text baseline is reliable enough, at least not in the simple form I first tried.

Because of that, I changed the experiment.

For natural text, I now compute the token-level statistic on the running text itself, without using the artificial lines in the calculation. The artificial lines are used only afterwards, to assign each token to a positional bin such as first, second, middle, penultimate, last. In other words, the line cut no longer changes the token’s measured value. It only changes the group where that value is displayed.

For the Voynich, I do the analogous thing. I reconstruct each real paragraph as one flat token sequence, compute the token-level statistic on that flat paragraph, and only then project each token back to its original line position and line type.

This avoids the trivial boundary problem and lets me ask a different question: not whether line cuts mechanically create edge effects, but whether tokens behave differently when they happen to occur at different real positions in Voynich lines.

In the following plots, positive values mean that, in that position, the token depends more on the previous word than it usually does. Negative values mean it depends more on the next word than usual.

[attachment=15092]

(I have also the plots divided by section, if anyone finds it interesting).

The interesting result is this. In the Voynich, even when I compare the same token across positions, there is still a systematic positional shift. The effect is strongest in the middle lines of paragraphs. Tokens in first and second position tend to shift in one direction, tokens in penultimate and last position tend to shift in the opposite direction (always inwads the line from both ends, thing that I find at least curious), and middle positions are close to neutral. Tightening the frequency thresholds did not remove that pattern.

So at the moment my view is this: the natural baseline is not a clean reference for absolute values, because it is too sensitive to metric choice and aggregation. However, the Voynich still shows a stable within-token positional effect once the trivial line-break artifact is removed from the calculation itself. That makes me think that the line is not just a passive formatting layer.

I agree with your point that trivial wrapping by character width should create positional distortions in any text. My problem is that, in practice, once one tries to measure this at token level, the natural baseline is much noisier and less informative than I expected. So I am now focusing less on absolute comparison with natural text, and more on whether the same Voynich tokens change behavior depending on their real line position and line type.

That is the stage I am at now.
Just one thought about this finding: it seems that in the inner lines of the paragraphs, the central “words” influence the first two and the last two positions. This is somewhat unexpected, and suggests that line position is not neutral, but interacts with the surrounding context in a non-trivial way.
(14-04-2026, 11:05 AM)quimqu Wrote: You are not allowed to view links. Register or Login to view.I tested both types of artificial line breaking.

Thanks!!

Quote:For natural text, I now compute the token-level statistic on the running text itself, without using the artificial lines in the calculation. The artificial lines are used only afterwards, to assign each token to a positional bin such as first, second, middle, penultimate, last. In other words, the line cut no longer changes the token’s measured value. It only changes the group where that value is displayed.

For the Voynich, I do the analogous thing. I reconstruct each real paragraph as one flat token sequence, compute the token-level statistic on that flat paragraph, and only then project each token back to its original line position and line type.

That makes more sense!  

But there is still one problem: we know that the head line and the first few words of each paragraph are special (the "PAAFU anomaly").  That is quite expected, assuming that the paragraphs are what they seem to be (descriptions and uses of the associated plant) and that the puffs are embellished versions of other glyphs that the Scribe was free to use on the head lines.  Therefore, if you use all the lines of re-justified paragraph, you will get anomalies due to the first line.  

To fix that problem, I would discard the first line of the original paragraph, then re-justify the remaining tokens with new page width (in characters, of course). 

By the way, with a variable-width "font" like Voynichese, it should make some difference whether the new page width is specified in mm (when t should count ~3 times as much as e) , in glyphs (counting Ch as 1) or EVA letters (counting Ch as 2).  In English, for example, with TimesRoman font, the line-breaking effect on the frequency of "it" at line end should be at least twice as strong as the effect on "me".

And then we can discuss other details of the Scribe's actual line breaking "algorithm"  that can create additional LAAFU-like effects, even if LAAFU is false...

Quote:I have also the plots divided by section, if anyone finds it interesting

Again, mixing all sections will only confuse the issue.  Even if the LAAFU theory is correct, it may work differently in different sections.  I think that the statistics will be easier (or less difficult) to interpret if we consider only one homogeneous section first, like Herbal.  Then later we can look at other sections, each on its own.

For one thing, I expect that PAAFU will be much stronger in Herbal than in Bio.  And it is defintely very strong in the Starred Parags, for the same reason as it is in Herbal.

All the best, --stolfi
(14-04-2026, 04:42 PM)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view.To fix that problem, I would discard the first line of the original paragraph, then re-justify the remaining tokens with new page width (in characters, of course). 

Jorge,

in the second plot, line type 1 is first paragraph line, line type 2 are middle lines of paragraph, and line type 3 is last line of paragraph. But ok, you suggest to break the original lines and try to creat fixed lines to see if the effect is stil visible. I'll give a try and let you know.

By the way: some final lines have a single word at their most right end of it. Is there any consensus about that use?
(14-04-2026, 04:54 PM)quimqu Wrote: You are not allowed to view links. Register or Login to view.some final lines have a single word at their most right end of it. Is there any consensus about that use?

In what section?  

I know of f1r. But page f1r should not considered Herbal.  Like f66r, f85r1, and other isolated pages without figures, it is best place in an "unknown" section of its own.

I see a right-justified tail line on You are not allowed to view links. Register or Login to view. and f42v. Are there any others?

In the herbal section,I see centered tail lines on f9r, f18r, f22v, f24r, f27r, f31r, f40v, You are not allowed to view links. Register or Login to view. (2x). I don't  know what those are.  My best guess is just the Scribe trying to make text look pretty.  Since the previous line is always full, they are probably just tail lines.  I don't think it will make much difference for the statistics if you include those lines or discard them.

There are four centered or right-justified in the Starred Parags section. Three seem to be section titles, and one seems to be a case of the Scribe skipping part of a line and then trying to insert it in above the previous line. 

All the best, --stolfi
(14-04-2026, 06:29 PM)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view.
(14-04-2026, 04:54 PM)quimqu Wrote: You are not allowed to view links. Register or Login to view.some final lines have a single word at their most right end of it. Is there any consensus about that use?

In what section?  

I know of f1r. But page f1r should not considered Herbal.  Like f66r, f85r1, and other isolated pages without figures, it is best place in an "unknown" section of its own.

I see a right-justified tail line on You are not allowed to view links. Register or Login to view. and f42v. Are there any others?

In the herbal section,I see centered tail lines on f9r, f18r, f22v, f24r, f27r, f31r, f40v, You are not allowed to view links. Register or Login to view. (2x). I don't  know what those are.  My best guess is just the Scribe trying to make text look pretty.  Since the previous line is always full, they are probably just tail lines.  I don't think it will make much difference for the statistics if you include those lines or discard them.

There are four centered or right-justified in the Starred Parags section. Three seem to be section titles, and one seems to be a case of the Scribe skipping part of a line and then trying to insert it in above the previous line. 

All the best, --stolfi

Yes, those kind of finishing lines. At least they are strange.

Thank you
(14-04-2026, 04:54 PM)quimqu Wrote: You are not allowed to view links. Register or Login to view.By the way: some final lines have a single word at their most right end of it. Is there any consensus about that use?

Each time someone mentions "consensus" I wonder how that would matter?

Anyway, these cases are clearly identified in all files using the IVTFF format.
(14-04-2026, 07:06 PM)ReneZ Wrote: You are not allowed to view links. Register or Login to view.
(14-04-2026, 04:54 PM)quimqu Wrote: You are not allowed to view links. Register or Login to view.By the way: some final lines have a single word at their most right end of it. Is there any consensus about that use?

Each time someone mentions "consensus" I wonder how that would matter?

Anyway, these cases are clearly identified in all files using the IVTFF format.

I meant if it is known in other manuscripts or it was a standard form in latin (i.e.)...
(14-04-2026, 07:31 PM)quimqu Wrote: You are not allowed to view links. Register or Login to view.I meant if it is known in other manuscripts

Not in Latin manuscripts that I've seen.

Right-aligned and center-aligned words have been called "signatures" and "titles" speculatively. Nobody knows what they are, really. The last time the question came up: You are not allowed to view links. Register or Login to view.
Well, working further with the asymetry of the words, I admit that it does not give much information. Asymetry of the same words may depend on its position in the line and paragraph and this gives results that are not of interest.

Nevertheless, I think I got something really interesting that I will post in a new thread, as it is a bit out of the initial goal of this thread and I think it has a good potential of making us think about the structure of the text.

I will entitle it "About the construction of lines in the MS"
Pages: 1 2 3 4 5 6