The Voynich Ninja

Full Version: About the construction of lines in the MS
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Pages: 1 2 3 4 5 6
First, I must apologize to @quimqu and others here.  I was wrong. 

Or at least half-wrong. I have been claiming that TLA makes the first word of the line longer than average, and the 1-3 lines words at the end shorter than average.  The first part is true, but the second part is false.  I have run some simulations, and to my surprise TLA (or SLA) has practically no effect on the length of words near the end of the line:

[attachment=15225]

The top plot is the average word length of the Nth word, counting from the line start. The bottom plot is the same, but counting from the line end. 

In this simulation the words are 50% "long" (7 letters) and 50% "short" (2 letters). The max line lenth is set at 60 characters (including blanks between words). The SLA algorithm
will abreviate iin to m if that delays the line break.  Here is a sample
of SLA-justfied text:

ytchedy dy ar dy qokaiin ol qokaiin qokaiin qokaiin qokaiin
dy shochdy qokaiin shochdy ytchedy dy shochdy ytchedy ol ol
qokaiin ytchedy shochdy qokaiin shochdy qokaiin ytchedy
shochdy shochdy qokaiin ol ytchedy ol shochdy ytchedy dy
shochdy ar ar dy dy ol ol ytchedy dy ar dy ol shochdy qokam
dy shochdy ar ytchedy ol dy qokaiin ol ol dy qokaiin shochdy
ytchedy ol ol ar ytchedy ytchedy qokaiin dy dy ar qokaiin
qokaiin dy shochdy dy shochdy ol ol shochdy shochdy shochdy
qokaiin ar shochdy ol ytchedy ytchedy ytchedy dy ytchedy

The top plot shows that, as I claimed, TLA (and SLA) cause the first word on the line to be longer on average than the overall word length average of 0.5*7 + 0.5*2 = 4.5 (green line).

But the bottom plot shows that the average length of last word of each line (rightmost point) is practically the same as the overall average!  Oops!

That was quite a blow to my intuition. I observed that shorter words could still be added to the end of the line in those situations where longer words would drop to the next line.  I intuitively concluded that, for that reason, the last word would be more likely to be short.  But in fact the breaking of the line depends on the next word, whereas the last word that remains on the line depends on the previous word -- which is independent of the word that caused the break.

So TLA can explain line-start anomalies, but not line-end anomalies.

One bizarre feature of the top plot is that it shows that the average length of the Nth word keeps decreasing below the global average as N increases.  That seems to contradict the bottom plot. The average number of words per line is about 10 but the 15th word from the start (top plot) is shorter than average, while the 5th word from the end (bottom plot) is just average.

That turns out to be an illusion, the result of what could be called "selection" or "survivor bias".  It turns out that the lines that have a 15th word must have lots of short words; and since the average length of the 15th word is computed only over those lines, it naturally comes out less than average. At the extreme,  the only lines that have a 19th or 20th word are lines that consist entirely of short words (3 of them in my sample text), so the average length of the 19th word and of the 20th word is just 2 letters.  

Ans this bias starts to show even before the 10th word. Lines that have only long words cannot have more than 7 words.  Thus the average length of the 8th word is less than the global average because it considers only lines that have at least one short word, and so on.

The same bias explains the left-hand part of the bottom plot, that says that the 15th word counting from the end is much shorter than average -- even though the top plot shows that the 5th word counting from the start is precisely average.

This is an artifact that one must be wary of when plotting average lengths of lines as a function of position, counted in words from the line start.  Could it be that this selection bias explains some of the claimed line-end anomalies?

I will answer @pfeaster in the next post.

All the best, and again apologies --stolfi
(21-04-2026, 12:29 AM)pfeaster Wrote: You are not allowed to view links. Register or Login to view.How could the tendency of the second words of lines to begin disproportionately with [Sh] and [s] be explained as a byproduct of an algorithm for splitting lines based on word length?

Texts in natural languages normally have significant correlations between successive words.  That is,, the probability Pr(t[i] = Y | t[i-1] = X) is usually very different from Pr(t[i] = Y).

Thus, any process that affects the word frequencies at the start of the line will also affect the frequencies at the 2nd position.  And also at the 3rd position, 4th, and so on, but with decreasing magnitude.

In the case you mention,it must be that words starting with sh are more likely to occur after a longer word than a shorter one.  

I did a simulation of this effect by creating two token streams, A and B.  Bot have 50% of "short" tokens with 2 letters and 50% of "long" tokens with 7 letters.  In file A, each token is chosen independently, so that there is no correlation between successive tokens.  In file B, a 1st-order Markov process was used to get a long token be followed by another long token 70% of the time (instead of 50%).  As a consequence, a short token was also followed by another short one 70% of the time.   Here is a sample of the TLA-justified text:

ytchedy dy ar dy ol qokaiin ol ol ol ol ytchedy shochdy
qokaiin shochdy ytchedy dy ar dy qokaiin qokaiin ol dy ar ol
ar ol dy ar ar ol ol ytchedy qokaiin shochdy ytchedy dy
shochdy shochdy shochdy ytchedy ytchedy qokaiin qokaiin dy
ytchedy shochdy ytchedy qokaiin shochdy qokaiin ytchedy ar
shochdy dy ol dy ol ol ol dy ol ar dy qokaiin qokaiin
shochdy dy dy ol ytchedy ytchedy shochdy ol ol dy shochdy dy
ar ol ol shochdy shochdy shochdy qokaiin ar shochdy qokaiin
dy dy dy dy ytchedy shochdy dy dy shochdy ytchedy qokaiin
ytchedy shochdy ytchedy shochdy qokaiin ytchedy ytchedy
shochdy ol ol shochdy qokaiin dy dy ytchedy ytchedy ar ol

Here are the statistics of word length depending on position along the line:

[attachment=15226]

The plots were truncated to positions 1-7 only.  Since all lines of the text have at least 7 tokens, this avoids the selection bias artifacts showed in the previous posts.

The left plot shows that TLA visibly increases the length of words in positions 1 to 4 on the lines when the word lengths are positively correlated.  Whereas, when there is no correlation, the effect is limited to the first word of the line.

The right plot surprised me again, because it shows that positive length correlation can increase the average length of the last word, above the general average word length.  I am struggling to understand the math cause of this result, but it seems real.

All the best, --stolfi
And, to complete the exercise, here is a plot comparing TLA and SLA on the same token stream:

[attachment=15227]

As in the "A" test of the previous post, the input token stream has 50% "long" tokens with 7 letters and 50% "short" ones with two leters, and each word is independently chosen.  The SLA is the same as TLA, except that a word that ends in "iin" will gets that ending replaced by "m" if that change delays a line break that would otherwise occur just before that word.  (The abbreaviation also happens at random in the middle of the lines, with a small probability).

As the plots show, the effect of this sophistication at line-start is barely detectable, and at line end it is effectively zero.  The problem is that, in this simulation, only 15% of the words end in "iin", the abbreviation works in only a fraction of the cases, and when it does it shortens the word only by 2 out of 7 letters.  Thus the frequency of line-final m in the SLA-formatted text is very small. 

IIRC, the frequency of line-final m is much higher in the VMS, and is one of the main anomalies ascribed to LAAFU.  Maybe it can be reproduced by this SLA if we used the actual stream of Voynichese tokens, with line breaks removed and all "m" replaced by "iin".

BTW, the scripts and data files used to make the plots are You are not allowed to view links. Register or Login to view.. The main script is  do_note_005.sh.

All the best, --stolfi
(22-04-2026, 02:46 AM)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view.In the case you mention,it must be that words starting with sh are more likely to occur after a longer word than a shorter one.  

I did a simulation of this effect by creating two token streams....

That's an interesting conjecture, but I believe we could test it more directly than via simulation.

If it's true that words starting with [Sh] are more likely to occur after longer words than shorter ones, then -- in the absence of meaningful line-based patterning -- we should see this effect elsewhere too.  If we were to exclude the beginnings and ends of lines from consideration and calculate the average length of mid-line words that precede words starting with [Sh], it should differ significantly from the norm.

Anyone care to check this, either just for Q20 or more generally?
(22-04-2026, 11:36 PM)pfeaster Wrote: You are not allowed to view links. Register or Login to view.That's an interesting conjecture, but I believe we could test it more directly than via simulation.

If it's true that words starting with [Sh] are more likely to occur after longer words than shorter ones, then -- in the absence of meaningful line-based patterning -- we should see this effect elsewhere too.  If we were to exclude the beginnings and ends of lines from consideration and calculate the average length of mid-line words that precede words starting with [Sh], it should differ significantly from the norm.

I ran this test on the Starred Parags section (SPS).

First bad news: the results depend totally on whether I ignore commas or consider commas to be word spaces.  

First, I discarded parag heads, parag tails, and lines with fewer than 10 words.  Among the remainder, I looked at word pairs (w1,w2) where w1 is in position 4 to N-5 (counting from 1) where N is the number of words on the line.

In the tables below, with bar graphs, the first column is the length L of word w1.  The second column is the count of pairs where w1 has length L. The third column is the count of pairs where w1 has length L and w2 begins with "sh". The fourth column  is the ratio F of those two counts, that is, the frequency of a "sh" word after a word with length L.  The bar graph shows F, where each "X" is worth 0.005 (0.5%)

For this test I used my own transcription of the SPS, that differs slightly from Rene's.  Mostly, I was a bit more liberal with the use of commas, and I disagreed with some of his parag breaks.

In both cases there does seem to be a positive correlation between the length of w1 and the probability of w2 starting with "X".   Thus TLA or SLA should indeed enhance the frequency of "sh" words in position 2 of the line.

However, the counts of "sh" words are rather small, therefore affected by sampling noise; and they depend on the handling of commas in some complicated ways.  For instance,  if I ignore commas, there 12 words of length 1, but none is followed by an "sh" words. If I treat commas as spaces, there are 61 words of length 1, and 5 of them are followed by an "sh" word:

  l sheedain
  l shedy
  l sheo
  y sheo
  l shalshy

Also, if I ignore commas there are 16 words w1 of length 9, and 3 of them are followed by a "sh" word w2, creating a tall spike on the bar graph:

  oteedaiin sheedy
  cheolchal shchy


But if I treat commas as word spaces, there are 22 words of length 9, and only 2 are followed by an "sh" word:

  lsheedain shear
  oteedaiin sheedy
  cheolchal shchy

And these three w1 words do look like they are cases of missing word spaces.  That is, they should have had commas in them...

And we don't know which spaces the Scribe considered to be word breaks when applied his line-breaking algorithm...

All the best, --stolfi

statistics for words 4..-5
considering only non-head, non-tail lines with at least 10 words
ignoring commas
with at least 10 words
!! <f103r.2;U> 15 words, taking 4..10
!! <f103r.3;U> 14 words, taking 4..9
!! <f103r.6;U> 12 words, taking 4..7
330 head lines seen
330 tail lines seen
406 body lines read
314 body lines with at least 10 words
981 word pairs collected

001    12    0  0.000
002    55    1  0.018 XXXX
003    43    2  0.047 XXXXXXXXX
004   146   10  0.068 XXXXXXXXXXXXXX
005   256   15  0.059 XXXXXXXXXXXX
006   240   19  0.079 XXXXXXXXXXXXXXXX
007   142    9  0.063 XXXXXXXXXXXXX
008    58    3  0.052 XXXXXXXXXX
009    16    3  0.188 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
010     8    0  0.000
011     4    0  0.000
014     1    0  0.000
TOT   981   62  0.063 XXXXXXXXXXXXX

statistics for words 4..-5
considering only non-head, non-tail lines with at least 10 words
treating commas as word spaces
with at least 10 words
!! <f103r.2;U> 15 words, taking 4..10
!! <f103r.3;U> 14 words, taking 4..9
!! <f103r.6;U> 13 words, taking 4..8
330 head lines seen
330 tail lines seen
406 body lines read
380 body lines with at least 10 words
1568 word pairs collected

001   61    5  0.082 XXXXXXXXXXXXXXXX
002  121    4  0.033 XXXXXXX
003   94    5  0.053 XXXXXXXXXXX
004  261   19  0.073 XXXXXXXXXXXXXXX
005  404   22  0.054 XXXXXXXXXXX
006  352   27  0.077 XXXXXXXXXXXXXXX
007  186   14  0.075 XXXXXXXXXXXXXXX
008   62    4  0.065 XXXXXXXXXXXXX
009   22    2  0.091 XXXXXXXXXXXXXXXXXX
010    3    0  0.000
011    2    0  0.000
TOT 1568  102  0.065 XXXXXXXXXXXXX
I am sorry to say this, but I believe that this line of investigation is and will be a big waste of time.  I hope you can see it even if you don't believe in the Chinese Origin theory and the SPS=SBJ claim.

The line-break anomalies (statistical anomalies at the start and end of lines) are real, sure.

But a first problem is that it seems hard to even describe what the anomalies really are.  There is a bunch of differences in character statistics that are basically corollaries of the different word frequencies at those positions.  Because the frequency of a character or digraph in a certain context is determined by the frequencies, in that context, of the words that have that character or digraph.  And the differences of word distributions at different positions seem to be rather complicated, with no simple rule.

Moreover, there are so many potential "boring" causes that could explain those anomalies. Including the length bias of basic TLA, the use of abbreviations, and the stretching and shrinking that suppresses word spaces or creates bogus ones.  

The Scribe may have consciously or unconsciously favored line breaks in certain word contexts, like before words with gallows.  Or the Author's draft may have had line breaks at semantically significant places, like sentence boundaries or where we today would put a comma; and the Scribe then may have tended to accept those breaks if they happened to fall near the end of his line. 

And then there are many errors introduced at all stages -- spelling errors by the Author, reading and writing errors by the Scribe, errors added by BEEPs, and transcriber errors.  Including missing, bogus, and ambiguous word spaces.

And the main problem is that we do not know what are all those "boring" causes and how much effect they have on the line-break anomalies.

So, even if the LAAFU thesis is correct, I see no hope of disentangling its "signal" from all that "noise" of those "boring" causes.  

And those "boring" causes make the anomalies so complicated that I see no hope of getting any insight into anything by tabulating and plotting a bunch of statistics.  

It is like trying to understand the ecology of a forest by tabulating the heights, tail lengths, wing spans, and the colors of eyes and noses of every animal that goes through a checkpoint.  Those numbers are "meaningful" in principle, but they would mash together ants and antelopes, beavers and butterflies, cranes and crocodiles, zebras and zombies...  One could spend a lifetime staring at those numbers and analyzing them in a million sophisticated ways, but would never get any useful conclusion out of them.

All the best, --stolfi
(23-04-2026, 02:38 PM)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view.I am sorry to say this, but I believe that this line of investigation is and will be a big waste of time. 

I don't agree, Jorge,

according to the data and as a summary to my coding in this thread, the lines seem to have some strong initial tokens, some strong final tokens and a middle region. But what I find more interesting is that each line seems independent from the next or previous one. The results for that are strong, indicating that there is almost no connection from the last words to the first words of the next line. This, under my opinion, doesn't happen in a "normal" text and is a clue that we should take into account.
Quote:The results for that are strong, indicating that there is almost no connection from the last words to the first words of the next line. This, under my opinion, doesn't happen in a "normal" text and is a clue that we should take into account.

I agree. This is unnatural.

In real text you can predict the next word with some probability based on the previous word. For example if you have an English sentence "I am washing..." then there are a few obvious choices for the next word like "I am washing myself", "I am washing dishes" or even "I am washing my dog" but over 99% of other words won't fit there.

There is one case where the next word becomes very unpredictable - the end of sentence. But assuming that each line of Voynich text is full sentence would be very weird. Such things may happen in poetry but we don't have a slightest hint that Voynich is poetry.

There is such concept in Voynich studies as LAAFU - Line as a Functional Unit. I believe this observation may be considered as a part of LAAFU. It is indeed a big puzzle and something not happening in real texts.
(23-04-2026, 08:39 PM)quimqu Wrote: You are not allowed to view links. Register or Login to view.according to the data and as a summary to my coding in this thread, the lines seem to have some strong initial tokens, some strong final tokens and a middle region. But what I find more interesting is that each line seems independent from the next or previous one. The results for that are strong, indicating that there is almost no connection from the last words to the first words of the next line. This, under my opinion, doesn't happen in a "normal" text and is a clue that we should take into account.

Are you sure this result is real, and not just an artifact of sampling error?

I just tried looking at whether word k+2 can be predicted from word k, in the middle of the line and across line breaks.

Specifically, I took all 483 parag body lines (neither head nor tail) with at least 10 words in the Starred Parags section, considering commas as word spaces. Did two passes, (A) and (B)  From each line, I extracted a pair of words {W1,W2} where 
(A) W1 is word N/2-1 and W2 is word N/2+1 of the N words on the line; or
(B) W1 is word N-1 of the line and W2 is word 1 of the next line. 

I skipped one word because the last word of the line is known to be anomalous, due to abbreviations, compression, etc.  This might reduce the predictability a lot, but should have the same effect for (A) and (B).  

Anyway, this potential problem turned out to not matter.  What happened is that practically every pair {W1,W2} occurred only once in (A) and (B) together.  Thus, 483 pairs seem to be way too small a sample to determine whether W2 can be predicted from W1.

So, how can you say that definitely there is no correlation between words across a word break?  Should I try again without skipping one word between W1 and W2?

All the best, --stolfi
(23-04-2026, 08:39 PM)quimqu Wrote: You are not allowed to view links. Register or Login to view.
(23-04-2026, 02:38 PM)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view.I am sorry to say this, but I believe that this line of investigation is and will be a big waste of time. 

I don't agree, Jorge,

according to the data and as a summary to my coding in this thread, the lines seem to have some strong initial tokens, some strong final tokens and a middle region. But what I find more interesting is that each line seems independent from the next or previous one. The results for that are strong, indicating that there is almost no connection from the last words to the first words of the next line. This, under my opinion, doesn't happen in a "normal" text and is a clue that we should take into account.

Sounds like a conlang, where there would be liberty with words that did not hold much meaning?
Pages: 1 2 3 4 5 6