The Voynich Ninja

Pages: 1 2 3 4 5

The first glyph of every line in the VMS – a statistical anomaly, or perhaps even part of the cipher?

This is a “by-product” Wink

of my statistical investigations in connection with the Bavarian hypothesis. I think this might interest others too, and I’m curious to see if you can verify it, or if it’s perhaps already known (I haven’t read about it yet). I’m posting it in the Text Analysis section because that’s where it belongs.

During the many analyses I’ve carried out, I’ve noticed that the first glyph in a line doesn’t match the translations. The effect is visible across the entire corpus and statistically distinguishes position 1 clearly from all others.
I already knew / suspected this regarding the first glyph on the page; most of you are probably aware of that. But the effect is more widespread.

A brief definition:

Anomaly rate: the proportion of tokens containing at least one internal bigram with a negative PMI – i.e. a pair of characters that occurs together less frequently in the manuscript than would be expected under independent distribution.
(PMI: Pointwise Mutual Information measures how surprising it is that two characters appear next to each other.).

Note

Only lines with at least three tokens are considered (to exclude labels and other elements; interestingly, the result is actually slightly better for lines with more than three letters (not token)).

1. For each token at line positions 1 to 5, the average length and the anomaly rate are calculated:

[attachment=15169]

If we group all tokens from position 2 onwards (32,425 words), they have an average length of 4.49 and an anomaly rate of 28.2%. Even so, the first word of each line deviates significantly, with 4.91 and 41.5%.

[attachment=15170]

On average, the first token of each line is 0.6 units longer than the token at position 2, and its internal strings are statistically significant roughly twice as often as at any other position.

2. I was then interested in the question of what happens when the first letter is removed.

The result: if one removes the first unit of each token, the anomaly rate changes to varying degrees depending on the line position:

[attachment=15171]

At position 1, removing the first unit reduces the anomaly rate by 20 percentage points; at all other positions, by only 7 to 8.

The effect at position 1 is about 2.5 times as strong.

The conspicuous sequence is therefore at the beginning of the token. If one removes precisely the first unit, the remaining word behaves statistically like a normal Voynich word.

3. "The Magnificent Seven" Wink

There are mainly seven glyphs in the first position.

[attachment=15172]

Total: 85.5 %

The remaining 14.5 per cent are distributed among rarer units such as ch, sh, k and others.

However, this is not the normal frequency. The overall frequency of these seven units in the corpus is significantly lower.
E.g.: p accounts for 0.84 per cent of all units in the VMS, but 7.2 per cent of all first glyphs in a line – an overrepresentation by a factor of 8.5. For s, the factor is 5.7. The distribution at position 1 therefore does not follow the general frequency of the corpus.

4. I was also interested in whether there are section-specific distributions:

Result: The seven units are unevenly distributed across the sections. Each figure indicates the proportion of lines with this initial marker that appear in the respective section:

Herbal Astro Balneo Cosmo Pharma Recipes

[attachment=15173]

Each unit has its own section distribution. "o" is concentrated in Astro, "q" in Herbal and Balneo, "p" in Recipes (f103 to f116), while p is practically absent in Astro.

If the seven markers were distributed randomly across the sections, a deviation from the measured frequency would be statistically extremely unlikely. A chi-square test confirms what is already evident in the table: each marker has its own section preference.

5. Then I thought to myself, hmm, perhaps the effect is distorted by the first line.

In 72 per cent of cases, the first line of a page begins with a Gallow (t, p, k, f). That is a known fact. However, if one removes these first 227 page lines from the analysis, the key figures hardly change:

[attachment=15174]

The key finding therefore stems from the normal text lines, not from the potentially ‘decorative’ page beginnings.

6. Summary

The first glyph of every line in the Voynich Manuscript exhibits statistical behaviour that differs from all other positions:
a) The first token is longer and statistically more anomalous
b) The effect is limited to the first glyph
c) 85 per cent of the lines begin with one of only seven units
d) The choice of this unit depends significantly on the section

----
Interpretation:

How this finding should be interpreted remains open. Possible directions:

- a calligraphic convention that manifests as a statistical glyph difference?

- a linguistic element (e.g. a clitic prefix) that is mandatory at the start of a line?

- what I believe: a specific hint at part of a cipher functionality. This is supported by the fact that I can already detect changes in the words following the initial letters, particularly p and t, but this is not yet statistically significant. Perhaps I’ll come back to this later.

---

In my opinion, the effect cant be dismissed as an artefact. What do you think?

Do you have any idea what this might be?
I’m interested in comments and, above all, whether this phenomenon has been described before. And, of course, in counter-evidence, hypotheses that might put it into a comprehensible context, etc...

Jojo

PS: It could at least suggest that, when attempting translations, one should try leaving out the first letter as a test.

(19-04-2026, 09:22 AM)JoJo_Jost Wrote: You are not allowed to view links. Register or Login to view.I’m interested in comments and, above all, whether this phenomenon has been described before. And, of course, in counter-evidence, hypotheses that might put it into a comprehensible context, etc...

The first character of lines statistics are part of the evidence for LAAFU.

Their sequence is also not random, there are line start patterns: Voynich Day 2024 You are not allowed to view links. Register or Login to view. + You are not allowed to view links. Register or Login to view. of the study by You are not allowed to view links. Register or Login to view..

Oh blimey, I would have been surprised too... Wink

When I first saw the page initial and subsequently the paragraph initial strong leanings toward ‘normal’ words being prefixed by a gallows, I imagined it was a means to flag to the reader which cipher table to begin with for that page/paragraph. A “Pch” meant use table P’s ch row to start.

Could line initial prefixes then indicate using starting that line with the ‘y’ or ‘d’ etc… row in the same encryption table?

I’ve also wondered if a space was only changing rows in a table such that each ch of consecutive chol chor or each d of daiin dain were different plaintext letters or syllables.

As for labelese, I always wondered if the ‘o’ table consisted of letters that also act as numerals.

Of course, I haven’t done anything but imagine the idea of a multi-table poly alphabetic encryption system that would fit into the time period.

(19-04-2026, 01:05 PM)Grove Wrote: You are not allowed to view links. Register or Login to view.When I first saw the page initial and subsequently the paragraph initial strong leanings toward ‘normal’ words being prefixed by a gallows, I imagined it was a means to flag to the reader which cipher table to begin with for that page/paragraph. A “Pch” meant use table P’s ch row to start.

Could line initial prefixes then indicate using starting that line with the ‘y’ or ‘d’ etc… row in the same encryption table?

I’ve also wondered if a space was only changing rows in a table such that each ch of consecutive chol chor or each d of daiin dain were different plaintext letters or syllables.

As for labelese, I always wondered if the ‘o’ table consisted of letters that also act as numerals.

Your hypothesis about multiple cipher tables is interesting, as it might align with my observations in the data. However, I am still looking for a simpler solution because, as you yourself say, this type of cipher is somewhat anachronistic, having only been (officially) described 30 to 60 years later. Perhaps, however, it was already known in 1430, or the VMS is an early precursor, or the vellum lay in a cellar for so long until someone misused it for testing purposes, which would explain the different developments of the cipher – who knows Wink

.

But I can mention a few figures that I did not include in the main post, as they are preliminary but could be relevant to your idea.

Bigram differences between the various markers

Idea: If the seven initial markers were indeed different cipher tables (or different rows within a table, or something similar), then one indication might be that the internal bigram statistics differ between the seven groups.
And that is why I have carried out a comparison

[attachment=15175]

Some bigrams appear to be highly group-specific (op, yp, ed); there are clear differences here. Others seem to be distributed more uniformly, such as (yk).

This is likely what would result if p- and o-rows were created from partially overlapping but partially distinct sets of glyphs – in other words, precisely the pattern that such a table/row-switching model would produce when generating an otherwise very homogeneous cipher (so that one cannot simply recognise that they are different tables or the like).

[I would like to emphasise that this investigation is still preliminary.]

Regarding your observation concerning the start of a paragraph
Your suggestion that ‘Pch’ means ‘Table P, row ch used at the start’ aligns
with what can be structurally interpreted there.

On the idea of the o-table as digits

This is an interesting hypothesis given the distribution I see for "o" in my data: just under 36% of all lines beginning with "o" are found in the Astro section, the section with the most label-like short tokens. If "o" were to be a subsystem used for digits and labels, this concentration would not be a coincidence.

The "o" lines in the Astro section should exhibit different distributions of token lengths (shorter, more repetitive) than “o” lines in Herbal or Pharma. That would then to be checked.

On the idea of the space as a switch. I’m also struggling with the spaces. At the moment, I believe they do not form word boundaries – but I cannot prove it.

But if one imagines that the word boundary itself signals a change to a different row within a table, this could explain the unusually high repetition of the same surface forms. Or am I mistaken there?

Then e.g. chol chor chol sequences an other would not be repetitions, but simply successive plaintext values that happen to share a cipher surface.

However, I haven’t yet found a way to test this. One would have to consider what "signature" such a type of structure would leave behind. I’m thinking about it; however, as I said, I’m looking for a simpler logic...

I’ve had lots of what-if’s that I couldn’t figure out how to validate the possibilities, but this is the VMS and that’s kind of expected.

Guess I could add that what I call Titles don’t appear to follow this line initial convention…. Hmmm maybe they do still have a line initial that fits.

[quote="Grove" pid='83086' dateline='
Guess I could add that what I call Titles don’t appear to follow this line initial convention…. Hmmm maybe they do still have a line initial that fits.
[/quote]

After a quick look at a few, I think these ones from You are not allowed to view links. Register or Login to view. are clearly odd word starts:

<f8r.T1.8;F>     dcho.daiin

<f8r.T2.13;F>   okokchodm
<f8r.T3.21;F>   schol.saim

(19-04-2026, 09:22 AM)JoJo_Jost Wrote: You are not allowed to view links. Register or Login to view.Do you have any idea what this might be?

I am still of the opinion that things like this are evidence for the manuscript being a fabricated hoax. The manuscript was not written in one go. Each section was written separately with gaps of time between them. The gaps of time were sufficient for the writer to loose some fluency in the method of fabrication so that at the commencement of each new section the writer has to relearn or perhaps refine the method to make the task of writing the longer text sections ( quires 13, 20 ) easier or perhaps just wanted to do things differently. Knowing that no-one would ever understand the text the writer could change or adapt as he wished. He was under no obligation to make the text consistent.

(19-04-2026, 09:22 AM)JoJo_Jost Wrote: You are not allowed to view links. Register or Login to view.Each unit has its own section distribution

With regards to the letter s being concentrated at the line start in quires 13 and 20 I have earlier given my opinion for this character being just a nonsense character. [ You are not allowed to view links. Register or Login to view. and in subsequent posts in that thread. ]

Likewise for character p. [ You are not allowed to view links. Register or Login to view. ]

Yes, this specific data does seem to support the theory that it’s a hoax, at first glance. But if you look at the underlying data, you realize that it’s too structured to be a hoax. From a purely historical perspective, it wouldn’t make sense to develop such a complex system for a hoax—one that adheres almost slavishly to very specific rules over 200 pages. The consistency in word usage, the syllabic structure within the words and at the glyph level—even with a cipher known at the time, one couldn’t achieve that level of perfection. The structure corresponds to an underlying language.

And the “S” fits perfectly with the placement of “and”—that can't be a coincidence either.
For a long time, I also thought the "P" at the beginning of a page was a zero glyph, but given the results above, it’s clearly a marker that makes sense.

Pages: 1 2 3 4 5

JoJo_Jost

nablator

JoJo_Jost

Grove

JoJo_Jost

Grove

Grove

dashstofsk

dashstofsk

JoJo_Jost