Options

Analysis of patterns at the very start and end of lines, per line type

Index
Analysis of patterns at the very start and end of lines, per line type
Analysis of patterns at the very start and end of lines, per line type

quimqu > 6 hours ago
When looking at the Voynich text it helps to separate different kinds of lines instead of treating the whole corpus as one block. A line at the start of a paragraph does not behave like a line in the middle, and a label line does not look like a body line. By checking the very first and last characters of each line, and comparing them with what is normal for the overall corpus, some clear patterns appear. Paragraph beginnings have their own favorite starters, body lines are more balanced, and lines outside of paragraphs follow yet another rule. What follows is a summary of these differences, line type by line type.

Labels
These short label lines behave differently from running text.
- Starts: They very often begin with o (about 50%). Compared with the overall corpus, that’s a strong tilt toward o. Against their own local word mix it’s only mildly unusual, but versus the full corpus it stands out.
- Ends: They tend to end in y, with smaller bumps for m and d.
- Sumary: Labels have an o- opening habit and a fairly y-heavy ending, but otherwise don’t diverge wildly from their own local vocabulary.
Paragraph-initial lines
This is where the strongest cue lives.
- Starts: Paragraph-initial lines strongly favor p, t, k, and f. Those openings are over-represented both against their own local word stock and against the whole corpus. By contrast, starts like o, q, c, s are under-represented here.
- Ends: Endings don’t separate these lines much, with the clear exception that m is noticeably more common at the very end.
- Distance: The divergence at the first character is high (JS_init ≈ 0.38 vs local, 0.52 vs global) (The very first character of these lines is much less like the rest).
- Summary: If you see a line starting p, t, k, f odds are it’s the first line of a paragraph. This entry code is the sharpest positional signal in the manuscript.
Body lines (within paragraphs)
These are ordinary running lines within the paragraph (not the first line, nor the last line).
- Starts: A mixed, stable recipe: d, s, y, q, o dominate, with c notably lower than in the general word stock.
- Ends: y is slightly lower than global norms, while m at line end is clearly higher than we’d expect from the local body vocabulary.
- Distance: Starts diverge modestly (JS_init ~ 0.12); ends are low (~0.066).
- Summary: This is the baseline flow of the text: predictable starts and the familiar y, n, l, r, m mix at the end, with m a recurring tail.
Last lines of paragraphs
The group of last paragraph lines, broadly similar to the first.
- Starts: Again y, s, d, o, q; c is much lower than its local/global baselines.
- Ends: y and n are higher; l and r are lower. Log-odds confirm m, y, g are over-represented at the very end, while l, r, o are under.
- Distance: Modest at the start (JS_init ~ 0.14), low at the end (~0.048).
- Summary: Another "normal text" profile; very close to the first body set (the intra paragraphs lines), with slightly stronger y at both ends.
Non-paragraph lines (outside any paragraph)
Standalone or detached lines show a different entry behavior.
- Starts: A very strong bias to o- (about 66% of line starts). Against the whole corpus this is striking, though against their own local word mix it’s milder.
- Ends: y is high (≈ 47%), and m is also above baseline.
- Distance: Start vs global is notable (JS_init ≈ 0.21), while start vs. local and ends are relatively low.
- Summary: These lines have an o-anchor at the very start; the second character contributes far less than the first.
The body of the text (lines within and ending paragraphs), taken together, makes up about 72% of all words. Its behaviour is fairly steady: lines usually begin with d, s, y, q, or o, and they usually end with y, n, m, l, or r. Some small rules repeat across the text, such as d being followed by a, c, s, or o at the start of words, and l being followed by a or o at the very end of a line. These habits stay in place regardless of which type of body line you look at.

Looking at the manuscript as a whole, the sharpest divide appears right at the first character of each line. Lines that begin a paragraph usually open with a small set of markers such as p, t, k, or f. Lines that stand outside any paragraph, by contrast, almost always start with o. Once the line is underway and you move inside the word, the contrasts between line types are still there, but they are far less pronounced than the jolt that comes at the very first step.
RE: Distinct patterns at the very start of lines in the VMS

MarcoP > 5 hours ago

I am not sure I understand. Does the table say that 42% of lines starting with o- start with oa-? If so, that doesn't sound right.

You are not allowed to view links. Register or Login to view.
You are not allowed to view links. Register or Login to view.
RE: Distinct patterns at the very start of lines in the VMS

quimqu > 4 hours ago

(5 hours ago)MarcoP Wrote: You are not allowed to view links. Register or Login to view.I am not sure I understand. Does the table say that 42% of lines starting with o- start with oa-? If so, that doesn't sound right.

You are not allowed to view links. Register or Login to view.
You are not allowed to view links. Register or Login to view.

Let me check later
RE: Distinct patterns at the very start of lines in the VMS

quimqu > 1 hour ago

(5 hours ago)MarcoP Wrote: You are not allowed to view links. Register or Login to view.I am not sure I understand. Does the table say that 42% of lines starting with o- start with oa-? If so, that doesn't sound right.

You are not allowed to view links. Register or Login to view.
You are not allowed to view links. Register or Login to view.

OK, I had a calculation error. I reposted the thread with my findings. I think they might not be something new, but maybe it is a good idea to have them summarized in one thread.

Thanks for noticing it.
RE: Distinct patterns at the very start of lines in the VMS

Jorge_Stolfi > 53 minutes ago

(6 hours ago)quimqu Wrote: You are not allowed to view links. Register or Login to view.That initial step is much more dependent on the line type (paragraph-initial, paragraph-internal, paragraph-final, or lines outside paragraphs) than anything that comes after. Once you move further into the word, the distributions of character transitions look much more alike across types.

Forget the words in the first lines of paragraphs. Or even the first N words of a paragraph, even if they extend into the second line (e.g. in pages where the lines are short because of drawings.) Those lines are special for a few known and probably many unknown reasons.

As fof other lines: it has been observed recently that the algorithm that a scribe uses to break paragraphs into lines, in any language, will naturally result in the first word of each line being longer than average, and the last 2-3 words being shorter than average.

This fact should have an effect on letter and letter pair statistics, since they are are normally dominated by the letters and pairs that appear in the most frequent words. For instance, in English the most common words that begin with "th" are relatively short: "the", "this", "that", "then", "they", "them", "there". If these words are suppressed at the beginning of lines, the frequency of "th" in line-initial position will be reduced.

But that line-breaking bias is not the only phenomenon that could explain those anomalies. For instance, the Scribe may have been instructed to avoid or prefer inserting a line break between certain pairs of words. (However, while such rules are common today in professional typesetting (e. g. avoid breaking between a number and a unit of measure), they probably did not apply to manuscripts from the 1400s.)

Another possible factor is that there are many spaces of intermediary width between the usual inter-glyph and inter-word spaces. In the available transcriptions, some of these ambiguous spaces are marked "," some are marked ".", some are just omitted. These ambiguous spaces tend to occur before or after specific letters. Incorrect reading of those spaces may therefore change the statistics of word-initial letters and digraphs.

Just to explain this theory, suppose that every word that contains ol was incorrectly split before the ol. Then l would become much more common as the next-char after word-initial o in words that are not line-initial. Conversely, suppose that every non-line-initial word that started with oa was mistakenly joined to the previous word. That would decrease the frequency of a after word-initial o in words that are not line-initial.

People have been puzzling about those statistical anomalies for decades, with little progress. I don't hope that the explanations will jump out by themselves from just looking at the numbers. Progress is more likely to come by thinking of a possible explanation, then collecting statistics that would most clearly confirm or negate that theory. The old and boring "scientific method". That is how two important facts were discovered recently: that (in any language) the scribe's natural line-breaking algorithm distorts the distribution of line-initial and line-final words, and that the somewhat fixed formula of herbal captions greatly distorts the distribution of words, letters, and digraphs depending on the position within the parag.

All the best, --jorge
RE: Analysis of patterns at the very start and end of lines, per line type

quimqu > 44 minutes ago

Thank you Jorge, as always, very interesting reading your thoughts.

I did this study because it seems to be a sort of pattern at the beginning of the lines. First lines of paragraph are evidently different, lines outside paragraphs tend to start with o, and lines between paragraphs tend to start with a bunch of gliphs d, s, y, q, o. I think there is almost no missunderstanding with potential existing spaces, as the analysis is just the initial word and the final word of the line.

Thanks again
Next Oldest Next Newest

Analysis of patterns at the very start and end of lines, per line type

Index

Analysis of patterns at the very start and end of lines, per line type

RE: Distinct patterns at the very start of lines in the VMS

RE: Distinct patterns at the very start of lines in the VMS

RE: Distinct patterns at the very start of lines in the VMS

RE: Distinct patterns at the very start of lines in the VMS

RE: Analysis of patterns at the very start and end of lines, per line type