The Voynich Ninja

Full Version: Words around text intrusions/image breaks and towards the start/end of lines
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Pages: 1 2
So apparently a pretty common thing in medieval times was to make lines very cleanly justified by any means necessary, including abbreviation.

You are not allowed to view links. Register or Login to view.

I hypothesized that this would mess with statistics around drawings and line breaks. I filtered by lines not at the top or bottom of a paragraph, looked at a bunch of statistics, and made some observations.

Observation 1: The letter "e" is significantly less common in words around breaks, except for at the start of lines. It's about 40% as common as expected, before both line breaks and text intrusions, and 60% as common after text intrusions. Similarly, the letter q is around 60% as common both before line breaks and text intrusions, but 19% as common after text intrusions. I believe that all of this is due to omitting those characters when needed to properly align the text. The letter k at 70% may also commonly be removed for abbreviation.

Observation 2: The letter "s" is about 3 times as common in words at the start of lines and both before and after drawings, and 63% more common at the end of lines. For words sandwiched between two drawings, it's 6 times more common. My guesses as to the meaning of this:
  • Based on varying decreases in the counts of words like "al", "ar", "ol", "or", and "aiin", with increases to words starting with "s", it could be a padding character.
  • "sh" is 38% as common as expected at the end of lines and 61% as common before text intrusions, so part of it could be abbreviation.
  • One of the s-words may serve to mark the start of a sentence.

Observation 3: Words at line end and before text intrusions are shorter on average, by around 0.3 glyphs and 0.5 glyphs, respectively. There are also a lot of short words that seem to almost exclusively appear in these positions, like "sy", "oly", "oldy", "dy", "oky", "ldy", "ary", "lol", etc. My guess is that these words are either abbreviations or nonsense to pad for length.


Observation 4: This has been You are not allowed to view links. Register or Login to view., but words at the start of lines often have a "d" or "y" for padding if the words starts with "ch" or "sh", creating a lot of words that almost exclusively appear at line start. The word "sho" being a strange exception, and also a word that appears primarily at the start of lines. "d" is also usable in place of "s" for most of the words where it seems to be used as padding.
(28-01-2026, 11:27 PM)zachary.kaelan Wrote: You are not allowed to view links. Register or Login to view.common thing in medieval times was to make lines very cleanly justified by any means necessary,

That is still true today.  Decent book publishers do not use line fillers and leave the tail line of paragraphs short.  But they will use hyphenation and tweak the spaces between words, sometimes even between letters, in order to get an even right rail (right edge of the text).

The VMS Scribe did not use hyphenation (that we know of), but it seems that he used abbreviations.  Many lines end in am, and m seems less common at other places along the line.

On the other hand, he was not overly concerned with appearance.  His lines are all bent and tilted and unevenly spaced, and the seems to have regarded the right rail like Pirate's Code: more like a suggestion than a binding rule...

Quote:I hypothesized that this would mess with statistics around drawings and line breaks. ... Words at line end and before text intrusions are shorter on average, by around 0.3 glyphs and 0.5 glyphs, respectively.

You don't need an obsessive-compulsive Scribe for that.  

The trivial line-breaking algorithm is: if there is enough space on the current line for the next word, write it and continue.  If not, break the line there and start a new line at the left rail.

People don't seem to realize, but this banal algorithm results in the first word of each line being longer than average, with the last 1-3 words of each line being shorter than average.  

If the first word of the line is longer, it means that the frequency of each word at the start of a line is different from its frequency at the end, and different from its frequency elsewhere along the line.  The word "therefore" will be more common than expected at the beginning of lines, while the words "the", "is", "if", "in" will be much more common at the end of lines.

If the word frequencies are different, then the frequencies of characters (and digraphs etc) will be different too.  Because the frequency of a character is determined mainly by whether it occurs in the most common words.  The digraph "th" is common in English because it occurs in many common words like "the", "this", "that", "they", "them", "their", "there", "thus", "then", "than"...  But these words are shorter than average, so they will be more common at the end of lines than at the beginning.  And thus the same will happen to the frequency of "th".

Thus anyone claiming that X or Y is more common than expected at the start or end of lines should check whether that anomaly is not caused by this effect.

Quote:The letter "e" is significantly less common in words around breaks ... The letter "s" is about 3 times as common in words at the start of lines and both before and after drawings, and 63% more common at the end of lines ... 

These anomalies could be due to the letter e being used mostly in medium-length words like  cheedy and used less in both longer and shorter words.

Quote:There are also a lot of short words that seem to almost exclusively appear in these positions, like "sy", "oly", "oldy", "dy", "oky", "ldy", "ary", "lol", etc. My guess is that these words are either abbreviations or nonsense to pad for length.

Those may be abbreviated words indeed.

Quote:This has been You are not allowed to view links. Register or Login to view., but words at the start of lines often have a "d" or "y" for padding if the words starts with "ch" or "sh", creating a lot of words that almost exclusively appear at line start.

It makes no sense to have padding at the start of a line.  It may be simply that many long words start with y or d.

Quote:The word "sho" being a strange exception, and also a word that appears primarily at the start of lines.

The Scribe was not very consistent about the width of spaces. He often left wider spaces between certain pairs of letters, such as between an o and a gallows.  Transcribers have a hard time deciding whether some gaps are word breaks or not. Thus those Sho at line start may have been part of a long word that got split in the transcription.

And I must insist on two pieces of advice:

  1. Letter and digraph statistics are projections of word statistics.  Thus the frequency of "u" in English is a combination of the frequencies of the words that have "u", like "thus", "tetanus", "butane", "Zulu", and "weltanschauung".  For that reason, such statistics are often more confusing than illuminating.  It is like trying to study the ecology of a place by counting animals with no tails, like gorillas, toads, and cockroaches; with short tails, like turtles and elephants; and with long tails,  like lions and snakes.  From those numbers one may learn that the Sahara is different from the Amazon, but one would never understand why, or go beyond that.  Thus, whenever possible, compute and study the frequency of words rather than glyphs.
  2. There is no such things as the frequency of a letter of word in a language. Statistics are properties of specific texts, not of their languages.  Different texts can naturally have very different frequencies, even for basic function words like "the", "and", "is", because their topic and style influences the frequency of words, sentence structures, tenses, etc..  Therefore, when computing such statistics, be careful to use a text that is as homogeneous as possible.  Do not lump labels with paragraph, Herbal with Bio, etc.  It is better to study just one section of the VMS in detail than try to understand all sections at the same time -- even if you compute the statistics separately for each section.

All the best, --stolfi
(29-01-2026, 07:22 AM)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view.Thus, whenever possible, compute and study the frequency of words rather than glyphs.

I didn't mention it, but I did, actually. Specifically, I looked at the rates of words in specific positions compared to words anywhere not at the start or end of the line. 

Despite the word "aiin" occurring 529 times in the text, it never occurs at the start of a line, similar with "ain". The words "daiin" and "dain" occur about twice as often than expected at the start of lines. The word "shedy" is 19% as common as expected at the start of lines, and 91% of occurrences of the word "dshedy" are at the start of lines. Other words that are abnormally common at the starts of lines include dshey, yshey, dchor, ychor, dchey, ychey, dcheey, ycheey, dshedy, yshedy, dshor, dchol, and ycheol. 

(29-01-2026, 07:22 AM)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view.Many lines end in am, and m seems less common at other places along the line.

It does also occur a solid 18 times before text intrusions, most commonly as the trigraphs "dam" or "dom".

(29-01-2026, 07:22 AM)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view.The trivial line-breaking algorithm is: if there is enough space on the current line for the next word, write it and continue.  If not, break the line there and start a new line at the left rail.

People don't seem to realize, but this banal algorithm results in the first word of each line being longer than average, with the last 1-3 words of each line being shorter than average.

The discrepancy in word length can probably at least mostly be attributed to that, though on some folios like You are not allowed to view links. Register or Login to view. there's definitely a bunch of what objectively looks like filler:

[Image: 30fJpNI.png]
[Image: T3raHD7.png]

(29-01-2026, 07:22 AM)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view.These anomalies could be due to the letter e being used mostly in medium-length words like  cheedy and used less in both longer and shorter words.

More investigation is required.

[/quote]
(29-01-2026, 07:22 AM)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view.It makes no sense to have padding at the start of a line.  It may be simply that many long words start with y or d.

Of course it makes no sense, but it's what they appeared to have been doing. Maybe they're nulls meant to make things more difficult. Maybe it's genuinely a stupid way to pad for length. Maybe they play some role in the cipher. Maybe it's shorthand to mark the start of sentences. Maybe it's something arbitrarily decided on for reasons we'll never know.
(29-01-2026, 10:19 PM)zachary.kaelan Wrote: You are not allowed to view links. Register or Login to view.Despite the word "aiin" occurring 529 times in the text, it never occurs at the start of a line, similar with "ain".

I take that as evidence that aiin and ain are not complete words, but suffixes that got split off in the transcription file.

The Scribe has the habit of leaving extra space after r and s. On a quick check, many of the aiin indeed follow an r or s, and often that glyph is a word by itself.  These cases are likely intances of saiin and raiin that got incorrectly split.  But I see that there are other instances of aiin beyond those, so the truth may be more complicated...

Quote:The words "daiin" and "dain" occur about twice as often than expected at the start of lines.
  

There the situation may be the opposite: many of those daiin may be prefixes of longer words.

Curiously it seems that, of the ~70 instances of pairs daiin.X. at the start of a line, most of them are unique.  In the transcription I use, the only such pairs that occur more than once are 4 daiin.Shey, 3 daiin.Chey, 2 daiin.al, 2 daiin.Cheey, and 2 daiin.ol = 13 lines total.  I suppose that the excess of line-initial daiin is more than that, correct?.  

Quote: on some folios like You are not allowed to view links. Register or Login to view. there's definitely a bunch of what objectively looks like filler:

[Image: 30fJpNI.png]
[Image: T3raHD7.png]

Why do you say so?  saiin and okol are common words, so those may be just cases where the Scribe thought it was OK to split a word across a plant.  And dy.dy or dydy is not uncommon either.

Quote:Maybe it's genuinely a stupid way to pad for length.

I meant, the Scribe would not know whether he would needed padding until he got near the end of the line.

Quote:Maybe it's shorthand to mark the start of sentences.

There are many possible explanations of that general nature.

For instance, some manuscripts of the time have some mark in the left margin of a line if a sentence/topic/paragraph starts anywhere within that line.

Quote:Maybe it's something arbitrarily decided on for reasons we'll never know.

When we finally get to read the text, the explanation of many of those baffling mysteries will be obvious.  Like the abbreviations and other weird quirks of any medieval manuscript...

All the best, --stolfi
(28-01-2026, 11:27 PM)zachary.kaelan Wrote: You are not allowed to view links. Register or Login to view.The letter "s" is about 3 times as common in words at the start of lines and both before and after drawings, and 63% more common at the end of lines.


Before you draw conclusions about the character  s you need to take into account the fact that language A text is different to language B text.  s words are more common in A. The frequency of character  s in language A is 2.3%. 1.2% in B.

( I need to make it clear that I am using the GC transliteration and only counting 101-s and 101-t characters. )

Also, look at the spline transfer plots of the line positions of words starting  s. In both A and B words peak at the start of lines, but in A  they rise in frequency towards the end.


[attachment=13792]
[attachment=13793]


For words ending  s there is a slight rise towards line ends in language A.

[attachment=13794]
[attachment=13795]
But also here is a curiosity about words starting  s. The parts of the words that follow seem themselves to be genuine words that appear frequently. This seems to suggest that initial  s is something of a nonsense character, that the writer will often start a line by putting down  s and then continuing with another word.


[attachment=13797]
(30-01-2026, 10:32 AM)dashstofsk Wrote: You are not allowed to view links. Register or Login to view.Also, look at the spline transfer plots of the line positions of words starting  s. In both A and B words peak at the start of lines, but in A  they rise in frequency towards the end.

Beware that the Scribe tend to leave a wider space after an s or r, which often gets transcribed as a word space.  Also the shapes of r and s are often ambiguous so one is often transcribed as the other.  

The prevalence of these problems will be different for different scribes (or different lengths of practice).  And the width of the bogus spaces after r and s, like the widths of characters in general, may vary along the line as the Scribe becomes more conscious of the approaching right rail an starts to squeeze things closer together.

One way to mitigate these problems would be to delete all spaces, map all s to r, map all a and y to o, and then look for occurrences of a given substring like raiiin or or in the compressed lines.  This trick would probably conflate unrelated words, but would mostly eliminate the noise due to the most common transcription errors.

By the way, smoothing the distributions with splines may obfuscate important details.  The bias towards long tokens at line start is in principle limited to the first token only; the splines make it seem that the effect extends for several tokens after the first.  

On the other hand, the word frequencies in the second position will be affected by the line-breaking algorithm, because there is correlation between nearby words.  E.g. "ache" may be extra common in second position if "stomach ache" is common, because "stomach" is a long word.  Consider also "take XXXX daily" where the XXXX can be one or more words.

All the best, --stolfi
(28-01-2026, 11:27 PM)zachary.kaelan Wrote: You are not allowed to view links. Register or Login to view. "d" is also usable in place of "s" for most of the words where it seems to be used as padding.


This is correct. Perhaps it helps to view the exchange for the most frequent words that start with these characters. It seems to suggest that these two characters are interchangeable when they appear at the start of a word. Almost always every  s word gives an  8 word. This is curious and in my opinion is evidence of artificiality in the text.

[attachment=13798]
[attachment=13799]
(29-01-2026, 10:19 PM)zachary.kaelan Wrote: You are not allowed to view links. Register or Login to view.a bunch of what objectively looks like filler:

Unfortunately, Zachary, imgur images are blocked to people in the UK. Are you able to post the images instead of giving links. Thanks,
(30-01-2026, 11:11 AM)dashstofsk Wrote: You are not allowed to view links. Register or Login to view.But also here is a curiosity about words starting with s.  The parts of the words that follow seem themselves to be genuine words that appear frequently. This seems to suggest that initial s is something of a nonsense character, that the writer will often start a line by putting down s and then continuing with another word.

Many languages have phonetic tone, meaning that the pitch or pitch pattern changes the meaning of words.  Not just Chinese and other East Asian languages, but Swedish and other scattered languages of Africa and the Americas are like that.  (In Romance languages, a rising tone at the end of a sentence generally changes it from affirmative to interrogative; and a higher pitch is used to emphasize certain words.) 

There are several ways to denote tones in writing, like italics, question marks, diacritics, pitch brackets ˹˺ etc.  One common way is to use diacritics on vowels to specify one of a few pitch patterns, like in the pinyin rendering of Mandarin: e.g. zhǔ for the third "dipping" tone.  An alternative is to use digit suffixes or superscripts, like zhu3 or zhu³.

But another way is to encode pitch levels rather than pitch profiles.  Using digits 1,2,3 for low, medium, and high pitch, the Mandarin third tone would be written as zh21u3 meaning that the syllable is to be pronounced by varying the pitch from middle to low to high.  This notation is more cumbersome, but it can handle different dialects or languages with different tone profiles, like Cantonese (6 tones) or Vietnamese (8 tones), with a single notation.  It may also be chosen when one is not quite sure about the set of tone profiles of the language.  

However the digits may be inserted anywhere in the syllable, so that same word could be written 213zhu, zh2u13, zhu213 etc. And when the next syllable starts with the same pitch that the previous one ended with, one could omit the starting pitch of the second one: so instead of m1a2 m2a2 m21a3 one could write just m1a2 ma ma13.

(And then there may be regular changes in the pitch pattern of a syllable depending on that of the previous syllable, if there is no pause between them.  But the picture is complicated enough as it is.)

All this to say that, if the language is tonal, and some letters like a, o, y denote pitch levels, then the Scribe may omit redundant pitch codes in the middle of a line, as above, but write them out at the start of a line, to make reading easier.

And this is only to point out that there may be many explanations for the "LAAFU" effects that do not imply a semantic or cryptological function for line breaks, and are compatible with these being chosen by the Scribe with the trivial line-breaking algorithm,

All the best, --stolfi
Pages: 1 2