The first glyph of every line in the VMS – a statistical anomaly, or perhaps even part of the cipher?
This is a “by-product”

of my statistical investigations in connection with the Bavarian hypothesis. I think this might interest others too, and I’m curious to see if you can verify it, or if it’s perhaps already known (I haven’t read about it yet). I’m posting it in the Text Analysis section because that’s where it belongs.
During the many analyses I’ve carried out, I’ve noticed that the first glyph in a line doesn’t match the translations. The effect is visible across the entire corpus and statistically distinguishes position 1 clearly from all others.
I already knew / suspected this regarding the first glyph on the page; most of you are probably aware of that. But the effect is more widespread.
A brief definition:
Anomaly rate: the proportion of tokens containing at least one internal bigram with a negative PMI – i.e. a pair of characters that occurs together less frequently in the manuscript than would be expected under independent distribution.
(PMI: Pointwise Mutual Information measures how surprising it is that two characters appear next to each other.).
Note
Only lines with at least three tokens are considered (to exclude labels and other elements; interestingly, the result is actually slightly better for lines with more than three letters (not token)).
1. F
or each token at line positions 1 to 5, the average length and the anomaly rate are calculated:
If we group all tokens from position 2 onwards (32,425 words), they have an average length of 4.49 and an anomaly rate of 28.2%. Even so, the first word of each line deviates significantly, with 4.91 and 41.5%.
On average, the first token of each line is 0.6 units longer than the token at position 2, and its internal strings are statistically significant roughly twice as often as at any other position.
2. I was then interested in the question of
what happens when the first letter is removed.
The result: if one removes the first unit of each token, the anomaly rate changes to varying degrees depending on the line position:
At position 1, removing the first unit reduces the anomaly rate by 20 percentage points; at all other positions, by only 7 to 8.
The effect at position 1 is about 2.5 times as strong.
The conspicuous sequence is therefore at the beginning of the token. If one removes precisely the first unit, the remaining word behaves statistically like a normal Voynich word.
3.
"The Magnificent Seven" 
There are mainly seven glyphs in the first position.
Total: 85.5 %
The remaining 14.5 per cent are distributed among rarer units such as ch, sh, k and others.
However, this is not the normal frequency. The overall frequency of these seven units in the corpus is significantly lower.
E.g.: p accounts for 0.84 per cent of all units in the VMS, but 7.2 per cent of all first glyphs in a line – an overrepresentation by a factor of 8.5. For s, the factor is 5.7. The distribution at position 1 therefore does not follow the general frequency of the corpus.
4. I was also interested in whether there are
section-specific distributions:
Result: The seven units are unevenly distributed across the sections. Each figure indicates the proportion of lines with this initial marker that appear in the respective section:
Herbal Astro Balneo Cosmo Pharma Recipes
Each unit has its own section distribution. "o" is concentrated in Astro, "q" in Herbal and Balneo, "p" in Recipes (f103 to f116), while p is practically absent in Astro.
If the seven markers were distributed randomly across the sections, a deviation from the measured frequency would be statistically extremely unlikely. A chi-square test confirms what is already evident in the table: each marker has its own section preference.
5. Then I thought to myself, hmm,
perhaps the effect is distorted by the first line.
In 72 per cent of cases, the first line of a page begins with a Gallow (t, p, k, f). That is a known fact. However, if one removes these first 227 page lines from the analysis, the key figures hardly change:
The key finding therefore stems from the normal text lines, not from the potentially ‘decorative’ page beginnings.
6.
Summary
The first glyph of every line in the Voynich Manuscript exhibits statistical behaviour that differs from all other positions:
a) The first token is longer and statistically more anomalous
b) The effect is limited to the first glyph
c) 85 per cent of the lines begin with one of only seven units
d) The choice of this unit depends significantly on the section
----
Interpretation:
How this finding should be interpreted remains open. Possible directions:
- a calligraphic convention that manifests as a statistical glyph difference?
- a linguistic element (e.g. a clitic prefix) that is mandatory at the start of a line?
- what I believe: a specific hint at part of a cipher functionality. This is supported by the fact that I can already detect changes in the words following the initial letters, particularly p and t, but this is not yet statistically significant. Perhaps I’ll come back to this later.
---
In my opinion, the effect cant be dismissed as an artefact. What do you think?
Do you have any idea what this might be?
I’m interested in comments and, above all, whether this phenomenon has been described before. And, of course, in counter-evidence, hypotheses that might put it into a comprehensible context, etc...
Jojo
PS: It could at least suggest that, when attempting translations, one should try leaving out the first letter as a test.