The Voynich Ninja

Full Version: About the construction of lines in the MS
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Pages: 1 2 3 4 5 6
(16-04-2026, 04:56 AM)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view.
Quote:So it seems that two things are happening at the same time. Lines fill the available space, but they also tend to end and start with specific families of patterns.
Thanks for doing these tests!  But indeed we should expect the VMS to have stronger line-end anomalies than those that are created by the simple line-breaking algorithm.  Because the real algorithm has two more complications.

First, the m character is probably an abbreviation for some other ending, that the Scribe could use when needed but also just when he felt like it.   I guess that the "some other ending" is iin.  That is, where the trivial algorithm said

  1. If the next word W fits in the remaining space, write W,
  Else break the line and write W.
I dont think that the remaining space had in general any influence in the pre-creation of the last word.
The right margin is quite irregular in many pages, there is space to write larger words that would make the margin more regular, but this doesnt happen.
In the pages with more regular right margins, like 81v, it is possible that some last 2 or 3 words in a line were originally separated but aglutinated (spaces deleted) by the the scribe because of lack of space.
(16-04-2026, 09:09 AM)quimqu Wrote: You are not allowed to view links. Register or Login to view.There is a clear progression. In several sections, especially Herbal and Biological, tokens become increasingly end-like as we approach the line ending. So the effect is not just at the last word, it builds up gradually.

Yes, and that is expected even from the trivial line-breaking algorithm (TLA).  It creates anomalies on the first token of the line, and on several tokens at the end of the line.

As an extreme example, suppose that, in the original text, 90% of the tokens have two letters, and 10% have 20 letters, alternating at random.  The TLA will enhance the probability of the first token of the line being long, and enhance the probability of the last 7 tokens of the line being short (because up to 7 short words would fit in the space where a long word would not).

The TLA algorithm may create anomalies also on the second token of the line, if there is correlation between the lengths of successive tokens.  In the example above, if 10% of the tokens are long, but after a long token there is 50% chance that the next token will be long too, then the TLA will also enhance the probability of the second token of the line being long.

Quote:These end-like tokens tend to belong to specific families and shapes that are well known at line ends in EVA, for example patterns like -dy, -y, -l, -r, or short forms such as dy, dal, lo, which frequently appear in final position and have high end-like scores.

Words like al, dy, dal are shorter than average, so they should be enhanced  at or near line end by the TLA alone.   But presumably iin -> m is not the only abbreviation that the Scribe was allowed to use.   (But I don't see any such in the Starred section.)

Quote:So I thought: maybe the scriba is kind of compressing words when approaching to the end of line.

That is expected, and I think we can see that just by looking at the page scans.  Word spaces seem to become narrower near the end of the line.   But it is more complicated than that.

A good professional scribe would strive to (1) end every line except the tail precisely on the right rail, and (2) avoid bad line breaks.  As a minimum, a bad line break is one that creates a very short tail line, with only one or two words.

To achieve these goals, the scribe would (consciously or unconsciously), while getting close to the end of the line, look ahead and try to predict where the next line break could be; and then compress or expand the spacing of the words and glyphs, and possibly the glyphs themselves, to try to achieve goals (1) and (2).  (Good typesetting software like TeX does this too, but uses dynamic programming to find all the optimum line breaks at once for the whole parag.)

Quote:So I tested the compression idea directly. If the scribe is compressing, we should see that these end-like tokens correspond more often to longer interior forms, for example via prefix matches or small edit distance. I compared real line-final tokens with matched interior controls and checked several criteria (longer prefix matches, Levenshtein neighbors, subsequences).

The result is negative. Final tokens do not show an excess of longer expandable forms. If anything, they show slightly fewer such relations than the controls.

I don't know what to make of this result.  I looked in the Starred section at the word distribution for the last token of each line, and of the fourth-last one.  Some of the most significant anomalies I can see in the last token:
  • Much higher frequency of am (and other words ending in am, like otam, qokam, dam),  aiinalary, dalaral,ldy, oly,dy, y, etc.
  • Much lower frequency of otedy and other words of 4 or more letters, l, r, s, ar (and other words ending in ar like dar, lkar), or (ditto), os (ditto), airshoqol, etc.
Some of these differences, like the words ending in am, are probably due to use of abbreviations.  Maybe l sometimes works as an abbreviation like m.

Other anomalies, like the suppression of isolated r and s, may be the result of compression.  If you have ever tried transcribing from the page images, you must have noticed that there is often extra space after an r or s glyph, which is hard to decide whether it is a word space or not.  It seems plausible that those spaces are indeed not word spaces, and thus get compressed to nothing near the end of the line, when space is running short.

Conversely, the excess of dy and y at the end of the line may be the result of the Scribe expanding the text in that region, and writing those common suffixes as separate words (like a Spanish scribe, back when orthography was just a suggestion, might stretch "darme" into "dar me" to meet the rail).  Or maybe they are suffixes that follow r or s, and those ambiguous spaces after those glyphs get stretched to the point that they get transcribed as word gaps.

Anyway, I still think that the line-position anomalies that are alleged as evidence for LAAFU could be explained as mere artifacts of the Scribe's line breaking algorithm.   Thus I still don't see those anomalies as evidence that line breaks are semantically significant, or that they trigger a reset of the hypothetical "encryption algorithm". 

All the best, --stolfi
(16-04-2026, 01:49 PM)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view.The TLA algorithm may create anomalies also on the second token of the line, if there is correlation between the lengths of successive tokens.

Jorge, I think your explanation via a line-breaking mechanism (TLA + spacing adjustments) makes sense for part of the effect, especially the general bias toward shorter tokens near the end of the line. My gradient result is indeed consistent with that kind of mechanism.

However, there are two points where I think the data go beyond a pure TLA explanation.

1- the signal is not only about length. The gradient is computed on a family-based “end-likeness” score trained out-of-sample, and what increases toward the end is not just shortness, but membership in specific token families. A pure TLA would primarily act on length distributions, but here we see structured preferences at the level of variants.

2- I tested the compression hypothesis directly. If line-end effects were mainly driven by local compression or abbreviation, we should see that final tokens correspond more often to longer interior forms. But we do not. In fact, finals show slightly fewer such relations than matched interior controls. So the effect is not behaving like a systematic shortening process.

Your point about spacing, ambiguous boundaries (e.g. after r or s), and local expansion or contraction is very plausible, and could explain part of the anomalies. But the results suggest that line-end behavior is not reducible to spacing and line breaking alone. There seem to be a positional preference for certain families or variants that are built towards the end of the line.
(16-04-2026, 01:19 PM)nablator Wrote: You are not allowed to view links. Register or Login to view.
(15-04-2026, 07:07 PM)quimqu Wrote: You are not allowed to view links. Register or Login to view.So it seems that not all line breaks behave the same. There is a subset that fits very well the “good ending + good beginning” pattern, and another subset that barely fits it.

It could be that, unsurprisingly, some words fit in the available space without shortening. Is the average length of the "good" words smaller than the average length of the "barely fits" words?

That’s a good point, and I already checked it before.

If this was mainly a fitting problem, you’d expect the “good” endings to be clearly shorter than the ones that barely fit. But that’s not what we see. The average length is basically the same. Finals are around ~4.29 characters, and the matched interior controls are also ~4.29. Even if we focus only on strongly end-like tokens, it’s something like ~3.8 vs ~3.9, so still very close.

At the same time, the difference in end-likeness is huge. Finals are around +0.95, while controls are around −0.78. So length is almost identical, but the behavior is completely different.

That makes it hard to explain this just as a packing or fitting effect. Short words alone don’t explain why certain tokens are so strongly preferred at line ends. It looks more like something acting at the level of specific variants or families, not just length. Remember that the model learns the families with a training set and test them with lines not seen. This for herbal and biological sections. But when we test the model to the rest of the sections (not seen) the results are also pretty good.

(16-04-2026, 12:42 PM)nablator Wrote: You are not allowed to view links. Register or Login to view.Have you used the words before/after "<->" in your before/after linefeed data or not? If not, what happens if you add them?

I will try to answer this question this evening or tomorrow.

(15-04-2026, 05:34 PM)Fontanellean Wrote: You are not allowed to view links. Register or Login to view.Perhaps it suggests that the meaningful content is in the central part of each line, with filler words added at the beginning and end.

No idea. I think this finding can help checking for new hypothesis.
(16-04-2026, 12:42 PM)nablator Wrote: You are not allowed to view links. Register or Login to view.Do only the last two EVA characters matter?

no, it’s not just the last two characters.

Endings like -dy, -y, -l, -r do matter, but they are not enough to explain the effect.

If it were only about the last two EVA characters, then all tokens sharing the same ending should behave similarly. But that’s not what we see. The patterns are much more structured and depend on broader families, not just suffixes.

For example, in Biological, line endings tend to look like ol..ly, ol..dy, or..ry, ol..ol, ok..ky, with some qo... tails. Line openings, on the other hand, prefer quite different families like so..dy, so..ey, so..or, so..ol, sa..ar, sa..in, ds..dy, dc..dy, tc..dy.

In Herbal, line endings are enriched in patterns like ..am, ..dy, da..an, da..am, ok..am, ot..am, ch..ry, while openings again show a different structure, like yc..or, yc..ol, yc..ey, so..in, so..or, so..ol, dc..ey, dc..dy, ds..dy, yt..dy, tc..in, ta..ar.

So even when two tokens share the same final characters (like -dy), their behavior depends strongly on the rest of the form and the family they belong to.

Also, the gradient result argues against a purely suffix-based explanation. The "end-likeness" signal increases progressively over several tokens before the line end, not just at the last word.
(15-04-2026, 12:33 PM)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view.
(15-04-2026, 08:25 AM)quimqu Wrote: You are not allowed to view links. Register or Login to view. The null control shows that this does not survive random relabelling. So the effect looks real.  I think the safest formulation is this: in Herbal and Biological, line breaks are statistically structured rather than arbitrary.

Good, but, again, that is an expected (and even already verified) consequence of the trivial line-breaking algorithm (provided that the margins are defined by mm or character count, not by word count).  It does not insert line breaks at random, but in a way that strongly depends on the lengths of the words before and after the break.   As a result, the first word after a line break tends to be longer than average, and the last 1-3 words before a line break tend to be shorter than average.  And if the word length distributions are different, the same will almost certainly be true of any other other statistic, character- or word-based.

So, before we can take your results above as evidence of LAAFU, you need to repeat the analysis on text that has been re-justified as discussed before.  Namely, discard the first line of each parag, join the remaining lines in a single token stream, and feed that to the trivial line-breaking algorithm, with maximum line length set to about 62% or 162% of the original average line length -- always counting characters, not words.   

The set of these re-justified parags should be the "null control".  I am quite sure that you will see on this  control text the same kind of anomalies that you see on the VMS.  The question is only whether they will be just as strong, or significantly weaker.

And even if the anomalies in this control text are weaker than on the original parags, that is still not yet evidence of LAAFU.  Because the algorithm used by the scribe is a bit more complicated than the trivial one, and the additional complications add to the line-break anomalies.  So we will have to try to simulate these complications too.

All the best, --stolfi

Given the chance at a constructed language where meaning is not very important, stuffer vords maybe used so that the margins are full.  The prefix's and suffix's wouldn't matter they would apply just the same for stuffer vords to fill margins.
@Jorge, @nablator, I ran the tests with the <-> spacing (images intra lines).

The internal split points marked with <-> do show part of the same effect seen at true line endings, but more weakly. I trained the model only on normal paragraph lines from Herbal and Biological, and then tested the tokens immediately before and after <->. If those split points behave like true visual interruptions, the token before <-> should look more like a line ending, and the token after <-> should look more like a line beginning.

That is exactly what happens on the left side of the split. In Biological, the token before <-> has a mean end-like score of +0.42, compared with -0.89 for ordinary interior tokens, although still below true line endings at +0.84. In Herbal, the pattern is very similar: +0.45 before <->, versus -0.60 for interior tokens, again below true line endings at +0.82.

The right side of the split shows a weaker version of the same pattern. In Biological, the token after <-> has a mean start-like score of +0.07, compared with -1.01 for interior tokens, but still far below true line starts at +1.02. In Herbal, it is -0.10 after <->, versus -0.63 for interiors, again much weaker than true line starts at +1.13.

So the best summary is that <-> does not behave exactly like a real line break, but it is not neutral either. The token before the split is clearly more line-end-like than a normal interior token, and the token after the split is slightly more line-start-like than a normal interior token. The effect is therefore real, but partial, and stronger on the left side of the split than on the right side.
(16-04-2026, 01:34 PM)Juan_Sali Wrote: You are not allowed to view links. Register or Login to view.I dont think that the remaining space had in general any influence in the pre-creation of the last word.
The right margin is quite irregular in many pages, there is space to write larger words that would make the margin more regular, but this doesnt happen.
In the pages with more regular right margins, like 81v, it is possible that some last 2 or 3 words in a line were originally separated but aglutinated (spaces deleted) by the the scribe because of lack of space.

I agree with you. I don't see why would the scriba start compacting in the last set of words instead of leaving all words as they are and move the last word t to the next line. Compacting only the last word is understandable in some lines (I reapeat, in some lines, not in all the lines like in the MS) but compacting 3 or 4 ending words... why?
(16-04-2026, 10:43 PM)quimqu Wrote: You are not allowed to view links. Register or Login to view.The internal split points marked with <-> do show part of the same effect seen at true line endings, but more weakly. ...That is exactly what happens on the left side of the split. The right side of the split shows a weaker version of the same pattern.

In retrospect, that makes sense.   At the right rail, the Scribe must have tried harder to avoid a large gap, and also about creating a "widow" -- a parag tail line with only one or two words.  At a figure intrusion, those defects would have been less objectionable.  Thus, instead of abbreviating aiin as am before an intrusion, he might just leave a blank gap and write the aiin after the intrusion.  And he would not care if the line ended with a single word after the intrusion.

By the way, this is a quick statistic that I computed on the Starred Parags section.  I looked at the word frequencies of the fourth token from the end of each line (second column) and on the last token of each line (third column).  The last signed number is the log of the ratio between the two frequencies, slightly cooked to avoid infinities.  Thus a positive value means a word that occurs more often in the -4 position, and a negative value is the reverse.  This clip is the 19 words with most extreme difference in each direction. 
Code:
am          |  .    | 0.0375 | +12.835
otam        |  .    | 0.0100 | +11.513
ary          |  .    | 0.0088 | +11.379
dal          |  .    | 0.0088 | +11.379
dam          |  .    | 0.0088 | +11.379
ram          |  .    | 0.0088 | +11.379
okam        |  .    | 0.0063 | +11.043
qotam        |  .    | 0.0063 | +11.043
aral        |  .    | 0.0050 | +10.820
chedam      |  .    | 0.0050 | +10.820
kam          |  .    | 0.0050 | +10.820
ldy          |  .    | 0.0050 | +10.820
oly          |  .    | 0.0050 | +10.820
om          |  .    | 0.0050 | +10.820
aiinal      |  .    | 0.0037 | +10.532
lkam        |  .    | 0.0037 | +10.532
olam        |  .    | 0.0037 | +10.532
opam        |  .    | 0.0037 | +10.532
raram        |  .    | 0.0037 | +10.532
...
skaiin      | 0.0026 |  .    | -10.166
tchey        | 0.0026 |  .    | -10.166
aiiin        | 0.0039 |  .    | -10.574
aiir        | 0.0039 |  .    | -10.574
char        | 0.0039 |  .    | -10.574
chody        | 0.0039 |  .    | -10.574
kaiin        | 0.0039 |  .    | -10.574
keey        | 0.0039 |  .    | -10.574
pchedy      | 0.0039 |  .    | -10.574
qokchedy    | 0.0039 |  .    | -10.574
taiin        | 0.0039 |  .    | -10.574
kchedy      | 0.0052 |  .    | -10.861
l            | 0.0052 |  .    | -10.861
otar        | 0.0052 |  .    | -10.861
chedar      | 0.0065 |  .    | -11.084
cheedy      | 0.0065 |  .    | -11.084
chor        | 0.0065 |  .    | -11.084
cheo        | 0.0091 |  .    | -11.420
otedy        | 0.0117 |  .    | -11.672

I think that these numbers show clearly that the ending -iin often gets abbreviated to -m, in all sorts of words.  The full table shows other  abbreviation/splitting/joining anomalies 

(Did I complain already about this wonderful forum software, that cannot handle spaces correctly, not even n code inserts?)
(16-04-2026, 05:49 PM)quimqu Wrote: You are not allowed to view links. Register or Login to view.That makes it hard to explain this just as a packing or fitting effect. Short words alone don’t explain why certain tokens are so strongly preferred at line ends. It looks more like something acting at the level of specific variants or families, not just length.

But word length (per transliteration data) could be wrong, especially near the right margin. I can't prove it, but I believe that many spaces were skipped (which also happened in Latin manuscripts), and probably more than suggested by Massimiliano Zattera if there is a need to be able to completely re-space Voynichese unambiguously: because the "real" (in my opinion) structure of Voynichese words does not allow daram, actually it must be interpreted as dar am.
Pages: 1 2 3 4 5 6