14-04-2026, 11:05 AM
Hello Jorge,
I tested both types of artificial line breaking.
First, I broke continuous natural texts by a fixed number of words. As expected, that removes the main wrapping artifact, because line position is no longer correlated with word length. Then I also tested breaking by a maximum number of characters, which is closer to what a real scribe would do.
What I found is that the natural baseline is much less stable than I first thought. If one measures token-level left/right asymmetry on a continuous text and then redistributes the values by artificial line position, the shape of the curve changes a lot depending on the exact metric and on how the wrapping is done. Means are especially unstable, because a few token values can dominate them. Medians flatten the signal too much. So I do not think that the natural-text baseline is reliable enough, at least not in the simple form I first tried.
Because of that, I changed the experiment.
For natural text, I now compute the token-level statistic on the running text itself, without using the artificial lines in the calculation. The artificial lines are used only afterwards, to assign each token to a positional bin such as first, second, middle, penultimate, last. In other words, the line cut no longer changes the token’s measured value. It only changes the group where that value is displayed.
For the Voynich, I do the analogous thing. I reconstruct each real paragraph as one flat token sequence, compute the token-level statistic on that flat paragraph, and only then project each token back to its original line position and line type.
This avoids the trivial boundary problem and lets me ask a different question: not whether line cuts mechanically create edge effects, but whether tokens behave differently when they happen to occur at different real positions in Voynich lines.
In the following plots, positive values mean that, in that position, the token depends more on the previous word than it usually does. Negative values mean it depends more on the next word than usual.
[attachment=15092]
(I have also the plots divided by section, if anyone finds it interesting).
The interesting result is this. In the Voynich, even when I compare the same token across positions, there is still a systematic positional shift. The effect is strongest in the middle lines of paragraphs. Tokens in first and second position tend to shift in one direction, tokens in penultimate and last position tend to shift in the opposite direction (always inwads the line from both ends, thing that I find at least curious), and middle positions are close to neutral. Tightening the frequency thresholds did not remove that pattern.
So at the moment my view is this: the natural baseline is not a clean reference for absolute values, because it is too sensitive to metric choice and aggregation. However, the Voynich still shows a stable within-token positional effect once the trivial line-break artifact is removed from the calculation itself. That makes me think that the line is not just a passive formatting layer.
I agree with your point that trivial wrapping by character width should create positional distortions in any text. My problem is that, in practice, once one tries to measure this at token level, the natural baseline is much noisier and less informative than I expected. So I am now focusing less on absolute comparison with natural text, and more on whether the same Voynich tokens change behavior depending on their real line position and line type.
That is the stage I am at now.
I tested both types of artificial line breaking.
First, I broke continuous natural texts by a fixed number of words. As expected, that removes the main wrapping artifact, because line position is no longer correlated with word length. Then I also tested breaking by a maximum number of characters, which is closer to what a real scribe would do.
What I found is that the natural baseline is much less stable than I first thought. If one measures token-level left/right asymmetry on a continuous text and then redistributes the values by artificial line position, the shape of the curve changes a lot depending on the exact metric and on how the wrapping is done. Means are especially unstable, because a few token values can dominate them. Medians flatten the signal too much. So I do not think that the natural-text baseline is reliable enough, at least not in the simple form I first tried.
Because of that, I changed the experiment.
For natural text, I now compute the token-level statistic on the running text itself, without using the artificial lines in the calculation. The artificial lines are used only afterwards, to assign each token to a positional bin such as first, second, middle, penultimate, last. In other words, the line cut no longer changes the token’s measured value. It only changes the group where that value is displayed.
For the Voynich, I do the analogous thing. I reconstruct each real paragraph as one flat token sequence, compute the token-level statistic on that flat paragraph, and only then project each token back to its original line position and line type.
This avoids the trivial boundary problem and lets me ask a different question: not whether line cuts mechanically create edge effects, but whether tokens behave differently when they happen to occur at different real positions in Voynich lines.
In the following plots, positive values mean that, in that position, the token depends more on the previous word than it usually does. Negative values mean it depends more on the next word than usual.
[attachment=15092]
(I have also the plots divided by section, if anyone finds it interesting).
The interesting result is this. In the Voynich, even when I compare the same token across positions, there is still a systematic positional shift. The effect is strongest in the middle lines of paragraphs. Tokens in first and second position tend to shift in one direction, tokens in penultimate and last position tend to shift in the opposite direction (always inwads the line from both ends, thing that I find at least curious), and middle positions are close to neutral. Tightening the frequency thresholds did not remove that pattern.
So at the moment my view is this: the natural baseline is not a clean reference for absolute values, because it is too sensitive to metric choice and aggregation. However, the Voynich still shows a stable within-token positional effect once the trivial line-break artifact is removed from the calculation itself. That makes me think that the line is not just a passive formatting layer.
I agree with your point that trivial wrapping by character width should create positional distortions in any text. My problem is that, in practice, once one tries to measure this at token level, the natural baseline is much noisier and less informative than I expected. So I am now focusing less on absolute comparison with natural text, and more on whether the same Voynich tokens change behavior depending on their real line position and line type.
That is the stage I am at now.