The Voynich Ninja - Analysis of patterns at the very start and end of lines, per line type

Pages: 1 2

When looking at the Voynich text it helps to separate different kinds of lines instead of treating the whole corpus as one block. A line at the start of a paragraph does not behave like a line in the middle, and a label line does not look like a body line. By checking the very first and last characters of each line, and comparing them with what is normal for the overall corpus, some clear patterns appear. Paragraph beginnings have their own favorite starters, body lines are more balanced, and lines outside of paragraphs follow yet another rule. What follows is a summary of these differences, line type by line type.

Labels
These short label lines behave differently from running text.

Starts: They very often begin with o (about 50%). Compared with the overall corpus, that’s a strong tilt toward o. Against their own local word mix it’s only mildly unusual, but versus the full corpus it stands out.
Ends: They tend to end in y, with smaller bumps for m and d.
Sumary: Labels have an o- opening habit and a fairly y-heavy ending, but otherwise don’t diverge wildly from their own local vocabulary.

Paragraph-initial lines
This is where the strongest cue lives.

Starts: Paragraph-initial lines strongly favor p, t, k, and f. Those openings are over-represented both against their own local word stock and against the whole corpus. By contrast, starts like o, q, c, s are under-represented here.
Ends: Endings don’t separate these lines much, with the clear exception that m is noticeably more common at the very end.
Distance: The divergence at the first character is high (JS_init ≈ 0.38 vs local, 0.52 vs global) (The very first character of these lines is much less like the rest).
Summary: If you see a line starting p, t, k, f odds are it’s the first line of a paragraph. This entry code is the sharpest positional signal in the manuscript.

Body lines (within paragraphs)
These are ordinary running lines within the paragraph (not the first line, nor the last line).

Starts: A mixed, stable recipe: d, s, y, q, o dominate, with c notably lower than in the general word stock.
Ends: y is slightly lower than global norms, while m at line end is clearly higher than we’d expect from the local body vocabulary.
Distance: Starts diverge modestly (JS_init ~ 0.12); ends are low (~0.066).
Summary: This is the baseline flow of the text: predictable starts and the familiar y, n, l, r, m mix at the end, with m a recurring tail.

Last lines of paragraphs
The group of last paragraph lines, broadly similar to the body lines.

Starts: Again y, s, d, o, q; c is much lower than its local/global baselines.
Ends: y and n are higher; l and r are lower. Log-odds confirm m, y, g are over-represented at the very end, while l, r, o are under.
Distance: Modest at the start (JS_init ~ 0.14), low at the end (~0.048).
Summary: Another "normal text" profile; very close to the first body set (the intra paragraphs lines), with slightly stronger y at both ends.

Non-paragraph lines (outside any paragraph)
Standalone or detached lines show a different entry behavior.

Starts: A very strong bias to o- (about 66% of line starts). Against the whole corpus this is striking, though against their own local word mix it’s milder.
Ends: y is high (≈ 47%), and m is also above baseline.
Distance: Start vs global is notable (JS_init ≈ 0.21), while start vs. local and ends are relatively low.
Summary: These lines have an o-anchor at the very start; the second character contributes far less than the first.

The body of the text (lines within and ending paragraphs), taken together, makes up about 72% of all words. Its behaviour is fairly steady: lines usually begin with d, s, y, q, or o, and they usually end with y, n, m, l, or r. Some small rules repeat across the text, such as d being followed by a, c, s, or o at the start of words, and l being followed by a or o at the very end of a line. These habits stay in place regardless of which type of body line you look at.

Looking at the manuscript as a whole, the sharpest divide appears right at the first character of each line. Lines that begin a paragraph usually open with a small set of markers such as p, t, k, or f. Lines that stand outside any paragraph, by contrast, almost always start with o. Once the line is underway and you move inside the word, the contrasts between line types are still there, but they are far less pronounced than the jolt that comes at the very first step.

I am not sure I understand. Does the table say that 42% of lines starting with o- start with oa-? If so, that doesn't sound right.

You are not allowed to view links. Register or Login to view.
You are not allowed to view links. Register or Login to view.

(15-09-2025, 06:06 PM)MarcoP Wrote: You are not allowed to view links. Register or Login to view.I am not sure I understand. Does the table say that 42% of lines starting with o- start with oa-? If so, that doesn't sound right.

You are not allowed to view links. Register or Login to view.
You are not allowed to view links. Register or Login to view.

Let me check later

(15-09-2025, 06:06 PM)MarcoP Wrote: You are not allowed to view links. Register or Login to view.I am not sure I understand. Does the table say that 42% of lines starting with o- start with oa-? If so, that doesn't sound right.

You are not allowed to view links. Register or Login to view.
You are not allowed to view links. Register or Login to view.

OK, I had a calculation error. I reposted the thread with my findings. I think they might not be something new, but maybe it is a good idea to have them summarized in one thread.

Thanks for noticing it.

(15-09-2025, 04:38 PM)quimqu Wrote: You are not allowed to view links. Register or Login to view.That initial step is much more dependent on the line type (paragraph-initial, paragraph-internal, paragraph-final, or lines outside paragraphs) than anything that comes after. Once you move further into the word, the distributions of character transitions look much more alike across types.

Forget the words in the first lines of paragraphs. Or even the first N words of a paragraph, even if they extend into the second line (e.g. in pages where the lines are short because of drawings.) Those lines are special for a few known and probably many unknown reasons.

As fof other lines: it has been observed recently that the algorithm that a scribe uses to break paragraphs into lines, in any language, will naturally result in the first word of each line being longer than average, and the last 2-3 words being shorter than average.

This fact should have an effect on letter and letter pair statistics, since they are are normally dominated by the letters and pairs that appear in the most frequent words. For instance, in English the most common words that begin with "th" are relatively short: "the", "this", "that", "then", "they", "them", "there". If these words are suppressed at the beginning of lines, the frequency of "th" in line-initial position will be reduced.

But that line-breaking bias is not the only phenomenon that could explain those anomalies. For instance, the Scribe may have been instructed to avoid or prefer inserting a line break between certain pairs of words. (However, while such rules are common today in professional typesetting (e. g. avoid breaking between a number and a unit of measure), they probably did not apply to manuscripts from the 1400s.)

Another possible factor is that there are many spaces of intermediary width between the usual inter-glyph and inter-word spaces. In the available transcriptions, some of these ambiguous spaces are marked "," some are marked ".", some are just omitted. These ambiguous spaces tend to occur before or after specific letters. Incorrect reading of those spaces may therefore change the statistics of word-initial letters and digraphs.

Just to explain this theory, suppose that every word that contains ol was incorrectly split before the ol. Then l would become much more common as the next-char after word-initial o in words that are not line-initial. Conversely, suppose that every non-line-initial word that started with oa was mistakenly joined to the previous word. That would decrease the frequency of a after word-initial o in words that are not line-initial.

People have been puzzling about those statistical anomalies for decades, with little progress. I don't hope that the explanations will jump out by themselves from just looking at the numbers. Progress is more likely to come by thinking of a possible explanation, then collecting statistics that would most clearly confirm or negate that theory. The old and boring "scientific method". That is how two important facts were discovered recently: that (in any language) the scribe's natural line-breaking algorithm distorts the distribution of line-initial and line-final words, and that the somewhat fixed formula of herbal captions greatly distorts the distribution of words, letters, and digraphs depending on the position within the parag.

All the best, --jorge

Thank you Jorge, as always, very interesting reading your thoughts.

I did this study because it seems to be a sort of pattern at the beginning of the lines. First lines of paragraph are evidently different, lines outside paragraphs tend to start with o, and lines between paragraphs tend to start with a bunch of gliphs d, s, y, q, o. I think there is almost no missunderstanding with potential existing spaces, as the analysis is just the initial word and the final word of the line.

Thanks again

(15-09-2025, 10:33 PM)quimqu Wrote: You are not allowed to view links. Register or Login to view.I think there is almost no missunderstanding with potential existing spaces, as the analysis is just the initial word and the final word of the line.

Wrong spaces would not affect the word-initial probabilities at the start of body lines, but would affect the word-initial probabilities for other words of the line. Aren't the latter being used as reference, to declare the former anomalous?

All the best, --jorge

(16-09-2025, 02:50 AM)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view.Wrong spaces would not affect the word-initial probabilities at the start of body lines, but would affect the word-initial probabilities for other words of the line. Aren't the latter being used as reference, to declare the former anomalous?

Yes, you are right. For your information, I assumed each ',' as a space. I understand that this could affect, as the separation of all ',' by spaces increases the corpus by 7,7%.

(15-09-2025, 04:38 PM)quimqu Wrote: You are not allowed to view links. Register or Login to view.Body lines (within paragraphs)
These are ordinary running lines within the paragraph (not the first line, nor the last line).
Starts: A mixed, stable recipe: d, s, y, q, o dominate, with c notably lower than in the general word stock.

I don’t like throwing out baseless theories or overcomplicating things, but I can’t help noticing that the characters most common at the beginnings of lines (except the first lines of paragraphs) are also the ones that most resemble Arabic numerals in the Voynich.

d=> 8
s => 2
y => 9
q => 4
o => 0

You are not allowed to view links. Register or Login to view.

Just a random thought I wanted to leave here.

(15-09-2025, 04:38 PM)quimqu Wrote: You are not allowed to view links. Register or Login to view.The body of the text (lines within and ending paragraphs), taken together, makes up about 72% of all words. Its behaviour is fairly steady: lines usually begin with d, s, y, q, or o, and they usually end with y, n, m, l, or r.

I like to separate the line pattern statistics by scribal section (see You are not allowed to view links. Register or Login to view.). It shows some important distinctions, e.g. liking of initial o at Line Start is mainly Scribe 1 for Herbal A. Scribe 2 and Scribe 3 hate it, although Scribe 3 has greater tolerance for the subvariant initial ol. In contrast, Scribe 3 despises line start q while the others either tolerate or like it.

It can vary per folio so there's a case for not stopping even at scribal level, but I guess the sample sizes would become far too low.

Quote:Some small rules repeat across the text, such as d being followed by a, c, s, or o at the start of words, and l being followed by a

Voynichese has a habit of the second glyph in a line start word often being the glyph that is underrepresented as an initial. The best example is probably the ch glyph, which massively underperforms at line start across the main sections as a word-initial glyph but simultaneously overperforms as a word-middle glyph in clusters like initial ych and initial dch. It's conspicuous at line start but I think something not completely dissimilar might also be going on in the top row of paragraphs as well.

(16-09-2025, 06:17 PM)tavie Wrote: You are not allowed to view links. Register or Login to view.I like to separate the line pattern statistics by scribal section (see You are not allowed to view links. Register or Login to view.).

Thank you, tavie, a great work what you posted in the thread!

I also tried to seek some sort of series or correlations between the first letters of the lines in a vertical direction, because it is really odd to find these kind of patterns. It might be just a visual effect to make it look really as a cipher real text, but in that case, why not use the other glyphs, that could make it even more credible?

Again, thanks for your work!

Pages: 1 2