Why and how the text could be Bavarian

Why and how the text could be Bavarian - Printable Version

+- The Voynich Ninja (https://www.voynich.ninja)
+-- Forum: Voynich Research (https://www.voynich.ninja/forum-27.html)
+--- Forum: Theories & Solutions (https://www.voynich.ninja/forum-58.html)
+--- Thread: Why and how the text could be Bavarian (/thread-5312.html)

Pages: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

RE: Why and how the text could be Bavarian - JoJo_Jost - 01-05-2026

I will now focus all my further research here in this thread, since it all relates to the Bavarian hypothesis and I am not allowed to mention the Bavarian hypothesis in "my" other threads. Structurally, I would have preferred to keep these results separate, but I accept the reasoning behind this restriction.
---

I'm still working on the cores. The cores contain too little information for longer words, so my cipher produces too many short ones. This led me to revisit a basic assumption: that the written spaces are word boundaries.

If they are not, then a sequence like y-qo, which appears split by a space in the manuscript (...y (Space) qo...), could in fact be a continuous trigram "ypo" that the writer is concealing. In other words: there are word-end markers and word-start markers (we know that), and the writer may insert a space according to a rule whenever those occur. The result would be space placement that is rule-based rather than meaning-based. That would explain a lot.

Since word boundaries could directly related to line length, I tested whether line-initial glyphs correlate with the length of their lines, and whether they might encode information about how many words follow.

Filename: anfangsmarker_zeilenlaenge.png Size: 53.38 KB 01-05-2026, 08:34 AM

3910 lines, EvaZ3b transcription, foldout and circular-text folios excluded (these cause distortions because they are not running text).

The four gallows in the LONG class are roughly equidistant: p->f -3.14, f->t -3.42, t->k -2.29. Mean step ~3 glyphs.

A similar step (~3.0) shows up between ch and o in the SHORT class.

The MIDDLE class is internally homogeneous (s, d, q, y, l all between 35.28 and 36.02 glyphs - range only 0.74 across five line-initials).

The gallows order by line length is the inverse of their corpus frequency: k > t > p > f by frequency, but p > f > t > k by line length when used as initial.

The rarer the gallow, the longer the line it heads. (for whatever reason)

Important: The effect is nearly robust against excluding first-folio lines.

What this implies: I don't claim that the initial glyph encodes an exact word count. Unfortunately, the variation within the class is too great for that. But there is clearly some structural information carried by the initial glyph beyond the obvious LAAFU patterns.

(Has this specific class structure been quantified before? References welcome.)

Then I looked at the end markers; more on that later....

J

RE: Why and how the text could be Bavarian - JoJo_Jost - 01-05-2026

After looking at the line-start markers, I examined the line-end markers.
(3,852 lines with at least 5 glyphs, excluding foldout and circular text folios)

At first, the picture was monotonous but familiar: 5 end glyphs (y, n, m, l, r) account for 89% of all line ends. At the single-glyph level, there is practically no coupling to the next line.

It only got interesting at the token level.

When considering the LAST TOKEN of a line as a whole (i.e., up to the last space), three structurally distinct classes emerge.

Class A: Full tokens at the end of a line
Tokens that mostly occur in the middle of the text (>50% internal), but when they appear at the end of a line, show a p-onset effect with factor >= 2x global:

Filename: tab_class_a_en.png Size: 69.16 KB 01-05-2026, 10:41 AM

10 tokens with factor >= 2x. Aggregated across all 10: 156 transitions, of which 38 are p-onsets (24.36%) compared to 7.35% globally. Factor 3.31x.

These tokens are actual words in the text that sometimes happen to appear at the end of a line. When this happens, a p-onset follows clearly disproportionately.

Class B: End-heavy tokens (>35% occurrence at the end)

Filename: tab_class_b_en.png Size: 65.2 KB 01-05-2026, 10:41 AM

Aggregated across all 11 tokens: 320 transitions, of which 9 are p-onsets (2.81%) compared to 7.35% globally. Factor 0.38x. Not a single f-onset across all 320 transitions.

Class C: Short end tokens (2-3 glyphs) with p-onset = 0

Filename: tab_class_c_en.png Size: 32.26 KB 01-05-2026, 10:41 AM

Aggregated across all 4 tokens: 141 transitions, of which 0 (!!!) are p-onsets . With a global expectation of 7.35%, approximately 10 p-onsets would be expected. Observed: 0.

The statistical probability that this pattern occurs by chance is practically zero.

---

What follows from these three classes:

There are three structurally distinct functions at the end of a line, with measurably different effects on the start of the next line.

Class A (full words): triggers p strongly (factor 3.31x aggregated, individual tokens up to 7.4x)
Class B (end-heavy tokens): strongly avoids p (factor 0.38x), no f
Class C (short end tokens): completely blocks p (factor 0x, 0 out of 141)

---

Methodologically important in connection with my assumption that spaces are not word boundaries:

This separation is visible ONLY at the token level, i.e., taking spaces into account. At the pure glyph level (without spaces), the findings become blurred. The last 4-5 glyphs of a line look similar, regardless of whether the last token is 2 or 5 glyphs long.

It follows that: spaces carry structural information. Their position is correlated with the sequence properties of the line; it is not random.

---

To clarify:

The VMS has a strict rule across line boundaries.
0% p-onset in 141 transitions is not a "trend"; it is a prohibition.
Such rules are not found in random pseudotext.

Spaces are not random layout markers.
If the token boundary (position of the space) triggers a hard rule, then spaces are structurally relevant.
Exactly what they are (true word boundaries or rule-based positions) remains open; what matters is: they are not without function.

The system operates at the token level, not at the glyph level.
The rule recognizes "ol as a token at the end of a line" - it does not recognize "ol as a bigram." Tokens exist as units in the VMS system.

The concentration of end glyphs into 5 values (89%) fits this pattern.
The system reduces the permissible end positions to a small number.
If spaces were word boundaries, the distribution of end glyphs would have to be broader - in Middle High German, endings are distributed across 10-11 letters.)
---
Three hypotheses regarding what the pure end markers (Classes B and C) might be:

Hypothesis 1: Sentence or paragraph end markers
They conclude a block of content. What follows is something different or shorter, not a new long block. This is consistent with the resulting avoidance of p and f.

Hypothesis 2: Line fillers
If there was still space at the end of the line, the author resorted to a standard formula. The actual content continues normally on the next line.

Hypothesis 3: Position-dependent token selection
Certain tokens are systematically selected for line-end positions, independent of preceding content. The blocking of p-onset reflects a structural rule of the writing system, not a semantic constraint.

Do u have an idea?

RE: Why and how the text could be Bavarian - dashstofsk - 03-05-2026

Are you claiming that 54.5% of the 11 lines that end qokeey ( making 6 in total ) start with p ?

RE: Why and how the text could be Bavarian - tavie - 03-05-2026

I think JoJo means that 54.5% of lines ending with qokeey have a following line that starts with p, rather than that they are in a line starting with p.*

I count 3 line end instances of qokeey in the Balneological section. Two have a p starting the next line.
I count 7 line end instances of qokeey in the Stars section. Four have a p starting the next line (all paragraph breaks).
So that's 6/10, which is roughly what JoJo has.

A few years ago, I looked at how "mid paragraph" line end glyphs may differ from Top Line line end glyphs and Paragraph End line end glyphs. Final y was relatively more common at paragraph end, and for Balneological and Stars this was both in the form of final dy and ey. This was at the expense of final am, which was extremely common at line end but comparatively rarer at paragraph end. This would be consistent with (but not proof of) final am being an abbreviation when there is less right margin space.  We covered this briefly in the Chinese Thread, I think.

It's worth exploring if it is an attraction between qokeey and p or if it is an attraction of certain finals or their subvariants to paragraph ends. But that might be hard to do since the numbers are small here.

*Lines starting with p and ending with qokeey would be extremely unexpected because qokeey is unlikely to be a Top Line word (in fact /keey/ is a big driver of the missing /ke/ in the Stars section).

RE: Why and how the text could be Bavarian - tavie - 03-05-2026

(01-05-2026, 10:58 AM)JoJo_Jost Wrote: You are not allowed to view links. Register or Login to view.To clarify:

The VMS has a strict rule across line boundaries.
0% p-onset in 141 transitions is not a "trend"; it is a prohibition.
Such rules are not found in random pseudotext.

Spaces are not random layout markers.
If the token boundary (position of the space) triggers a hard rule, then spaces are structurally relevant.
Exactly what they are (true word boundaries or rule-based positions) remains open; what matters is: they are not without function.

The system operates at the token level, not at the glyph level.
The rule recognizes "ol as a token at the end of a line" - it does not recognize "ol as a bigram." Tokens exist as units in the VMS system.

The concentration of end glyphs into 5 values (89%) fits this pattern.
The system reduces the permissible end positions to a small number.
If spaces were word boundaries, the distribution of end glyphs would have to be broader - in Middle High German, endings are distributed across 10-11 letters.)
---
Three hypotheses regarding what the pure end markers (Classes B and C) might be:

Hypothesis 1: Sentence or paragraph end markers
They conclude a block of content. What follows is something different or shorter, not a new long block. This is consistent with the resulting avoidance of p and f.

Hypothesis 2: Line fillers
If there was still space at the end of the line, the author resorted to a standard formula. The actual content continues normally on the next line.

Hypothesis 3: Position-dependent token selection
Certain tokens are systematically selected for line-end positions, independent of preceding content. The blocking of p-onset reflects a structural rule of the writing system, not a semantic constraint.

Adding also that it took me a while to work out what you meant by p-onset. The text above in particular reads like an AI wrote this and it's not easy to work out exactly what you mean.

RE: Why and how the text could be Bavarian - JoJo_Jost - 03-05-2026

@dashstofsk Tavie's reading was correct -- the question was about qokeey at the end of a line and what appears at the start of the following line.

@tavie First a request, since I can't edit the original post anymore: could you delete it with a note like "I'm withdrawing this post due to a methodological error, see following posts for explanation"? Then people won't have to read it only to realize afterward that it was a waste of time.

Now the correction. I checked by hand in Notepad++ - you were right. It's a real blunder and I have to retract the post.

7 of 11 qokeey end-occurrences sit at paragraph ends. On 6 of those 7, the next paragraph starts with p. So my "qokeey -> p" effect was almost entirely the well-known top-line p effect that you and others have described. I just hadn't noticed - clearly on me.

Two things contributed. First, I'd been running my calculations on a cleaned EVA transcription that had stripped the <$> markers, so paragraph ends weren't visible in my data at all. But that's only the surface cause. The deeper one is methodological - It was my mistake; I should have thought of that, of course, and that means the other calculations are wrong too.

On the AI question as i write it before: Claude does the calculations - I couldn't run them at this volume otherwise. I cross-check against ChatGPT (deliberately given less context, to catch errors), and only post when both agree. That's worked so far... well, until now.

The flaw here was that both ran on the same cleaned file without <$> markers, so neither could see the issue.

I also let Claude do a final language pass on my drafts because writing statistical reasoning in English isn't my strong suit Big Grin

. Some AI phrasing inevitably ends up in the text - you spotted that, and it's a fair call.

RE: Why and how the text could be Bavarian - dashstofsk - 03-05-2026

(03-05-2026, 10:59 AM)tavie Wrote: You are not allowed to view links. Register or Login to view.have a following line that starts with p

So, the 11 lines that end qokeey are followed by 6 lines that start p. That does seem correct.

But the same does not hold for okain. According to the table 36.4% of the 12 lines that end okain should be followed by lines that start p. But 36.4% of 12 comes to 4.368. Is not a whole number. So the numbers are somehow wrong.

RE: Why and how the text could be Bavarian - dashstofsk - 03-05-2026

(01-05-2026, 08:00 AM)JoJo_Jost Wrote: You are not allowed to view links. Register or Login to view.

The high value for p can be explained simply by the fact that p at the start of a line is very much a language B phenomenon. Moreover also language B pages ( typically quires 13 and 20 ) have longer lines. So it should not surprise you to see a correlation between frequency of p start lines and line length.

Because of the language and writing and text differences between the sections I believe you would get more meaningful results if you were to perform your analysis separately for each of the main work groups

Herbal A1
Herbal B2
Pharma A1
Quire 13
Quire 20

RE: Why and how the text could be Bavarian - JoJo_Jost - 03-05-2026

@ dashstofsk

I’ve done a lot of other tests as well, with many more results, but I think that would take us too far afield here. But you’re right, of course - there are clear differences both across the different sections and in the language (A/B). Though we have to be careful here. It’s not exactly simple and can get very complex fast.

But in this thread, I’m more interested in the question of what the core elements are - in other words, the bigger picture. I’m looking for an anchor to crack the text. So the question is: Why are there line-start markers- why line-end markers, and perhaps even paragraph-end markers (I still need to test the latter more closely).

What is the background behind this structure? How can it be explained without having to resort to anachronistic ciphers?

I now have an initial idea of how this could be represented using 15th-century methods, but I’m still in the middle of testing that - yet the problem remains that we really know nothing about the underlying language, if indeed there is one.

RE: Why and how the text could be Bavarian - tavie - 03-05-2026

I'll get back to you on the rest - but I'd like to second dashstofsk's question on the p-start percentage thing. They do look weird. Can you explain where the 36.4% is coming from for okain's 12 line end instances, or the 22.2% for otedy's 10 instances?