The Voynich Ninja

Full Version: Why and how the text could be Bavarian
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Pages: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
I will now focus all my further research here in this thread, since it all relates to the Bavarian hypothesis and I am not allowed to mention the Bavarian hypothesis in "my" other threads. Structurally, I would have preferred to keep these results separate, but I accept the reasoning behind this restriction.
---

I'm still working on the cores. The cores contain too little information for longer words, so my cipher produces too many short ones. This led me to revisit a basic assumption: that the written spaces are word boundaries.

If they are not, then a sequence like y-qo, which appears split by a space in the manuscript (...y (Space) qo...), could in fact be a continuous trigram "ypo" that the writer is concealing. In other words: there are word-end markers and word-start markers (we know that), and the writer may insert a space according to a rule whenever those occur. The result would be space placement that is rule-based rather than meaning-based. That would explain a lot.

Since word boundaries could directly related to line length, I tested whether line-initial glyphs correlate with the length of their lines, and whether they might encode information about how many words follow.

[attachment=15379]

3910 lines, EvaZ3b transcription, foldout and circular-text folios excluded (these cause distortions because they are not running text).

The four gallows in the LONG class are roughly equidistant: p->f -3.14, f->t -3.42, t->k -2.29. Mean step ~3 glyphs.

A similar step (~3.0) shows up between ch and o in the SHORT class.

The MIDDLE class is internally homogeneous (s, d, q, y, l all between 35.28 and 36.02 glyphs - range only 0.74 across five line-initials).

The gallows order by line length is the inverse of their corpus frequency: k > t > p > f by frequency, but p > f > t > k by line length when used as initial.

The rarer the gallow, the longer the line it heads. (for whatever reason)

Important: The effect is nearly robust against excluding first-folio lines.

What this implies: I don't claim that the initial glyph encodes an exact word count. Unfortunately, the variation within the class is too great for that. But there is clearly some structural information carried by the initial glyph beyond the obvious LAAFU patterns.

(Has this specific class structure been quantified before? References welcome.)

Then I looked at the end markers; more on that later....

J
After looking at the line-start markers, I examined the line-end markers. 
(3,852 lines with at least 5 glyphs, excluding foldout and circular text folios)

At first, the picture was monotonous but familiar: 5 end glyphs (y, n, m, l, r) account for 89% of all line ends. At the single-glyph level, there is practically no coupling to the next line.

It only got interesting at the token level.

When considering the LAST TOKEN of a line as a whole (i.e., up to the last space), three structurally distinct classes emerge.

Class A: Full tokens at the end of a line
Tokens that mostly occur in the middle of the text (>50% internal), but when they appear at the end of a line, show a p-onset effect with factor >= 2x global:

[attachment=15385]

10 tokens with factor >= 2x. Aggregated across all 10: 156 transitions, of which 38 are p-onsets (24.36%) compared to 7.35% globally. Factor 3.31x.

These tokens are actual words in the text that sometimes happen to appear at the end of a line. When this happens, a p-onset follows clearly disproportionately.

Class B: End-heavy tokens (>35% occurrence at the end)

[attachment=15383]

Aggregated across all 11 tokens: 320 transitions, of which 9 are p-onsets (2.81%) compared to 7.35% globally. Factor 0.38x. Not a single f-onset across all 320 transitions.

Class C: Short end tokens (2-3 glyphs) with p-onset = 0

[attachment=15384]

Aggregated across all 4 tokens: 141 transitions, of which 0 (!!!) are p-onsets . With a global expectation of 7.35%, approximately 10 p-onsets would be expected. Observed: 0.

The statistical probability that this pattern occurs by chance is practically zero.

---

What follows from these three classes:

There are three structurally distinct functions at the end of a line, with measurably different effects on the start of the next line.

Class A (full words): triggers p strongly (factor 3.31x aggregated, individual tokens up to 7.4x)
Class B (end-heavy tokens): strongly avoids p (factor 0.38x), no f
Class C (short end tokens): completely blocks p (factor 0x, 0 out of 141)

---

Methodologically important in connection with my assumption that spaces are not word boundaries:

This separation is visible ONLY at the token level, i.e., taking spaces into account. At the pure glyph level (without spaces), the findings become blurred. The last 4-5 glyphs of a line look similar, regardless of whether the last token is 2 or 5 glyphs long.

It follows that: spaces carry structural information. Their position is correlated with the sequence properties of the line; it is not random. 

---

To clarify:

The VMS has a strict rule across line boundaries.
0% p-onset in 141 transitions is not a "trend"; it is a prohibition.
Such rules are not found in random pseudotext.

Spaces are not random layout markers.
If the token boundary (position of the space) triggers a hard rule, then spaces are structurally relevant.
Exactly what they are (true word boundaries or rule-based positions) remains open; what matters is: they are not without function.

The system operates at the token level, not at the glyph level.
The rule recognizes "ol as a token at the end of a line" - it does not recognize "ol as a bigram." Tokens exist as units in the VMS system.

The concentration of end glyphs into 5 values (89%) fits this pattern.
The system reduces the permissible end positions to a small number.
If spaces were word boundaries, the distribution of end glyphs would have to be broader - in Middle High German, endings are distributed across 10-11 letters.)
---
Three hypotheses regarding what the pure end markers (Classes B and C) might be:

Hypothesis 1: Sentence or paragraph end markers
They conclude a block of content. What follows is something different or shorter, not a new long block. This is consistent with the resulting avoidance of p and f.

Hypothesis 2: Line fillers
If there was still space at the end of the line, the author resorted to a standard formula. The actual content continues normally on the next line.

Hypothesis 3: Position-dependent token selection
Certain tokens are systematically selected for line-end positions, independent of preceding content. The blocking of p-onset reflects a structural rule of the writing system, not a semantic constraint.

Do u have an idea?
Pages: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18