The Voynich Ninja - The first glyph of every line

Pages: 1 2 3 4 5

@ dashstofsk

Yes, those are exactly the statistical effects I mentioned earlier as a potential problem - thanks for pointing that out. I'm always glad when someone flag these, because such potential errors can cause a lot of unnecessary work. Wink

Here's how I'd propose we calculate it: we divide the pages into three groups - "p"-poor pages, pages with a moderate distribution of "p", and pages where "p" occurs frequently. Then we compute the respective effects within each group for P-starters and non-P-starters, and check whether the P-starter effect is homogeneous. If it varies systematically with p-density, we've identified a confound rather than a genuine phenomenon; if it holds up consistently, the effect is robust. Would this approach be acceptable to you?

Well, it is your hypothesis. It is for you to provide the evidence for it. As for myself I believe I have already 'solved' the manuscript - it has no solution. I am firmly in the hoax / meaningless text / artificial fabrication camp and have done enough to convince myself of this.

As for character p I believe that there is nothing special about it. I have tried earlier [ You are not allowed to view links. Register or Login to view. and You are not allowed to view links. Register or Login to view. ] to give some evidence for believing so.

If however you do wish to investigate p further then perhaps you might care to note that in quire 13 there are 182 ( 0.6% ) of these characters of which only 88 come at the start of a word and an even smaller number come at the start of a line. Their distribution across the pages seems to be nicely regular but are probably too small in number to be able to lead to any significant conclusion.

You are not allowed to view links. Register or Login to view.

I just wanted to know whether you'd agree with this type of calculation as evidence — not asking you to run it! Wink

Here's the result:

[attachment=15274]

The p-effect clearly remains after this decomposition, which makes the hoax hypothesis a bit less plausible. Wink

(23-04-2026, 01:55 PM)JoJo_Jost Wrote: You are not allowed to view links. Register or Login to view.which makes the hoax hypothesis a bit less plausible

You will need to explain how you have come to that conclusion. The p-effect seems just to show that pages that have a raised frequency of lines starting p have a raised frequency of gallow words. I cannot see how this supports any of the meaningful text hypotheses, continuous narrative, natural language, shorthand or cypher.

(23-04-2026, 06:17 PM)dashstofsk Wrote: You are not allowed to view links. Register or Login to view.
(23-04-2026, 01:55 PM)JoJo_Jost Wrote: You are not allowed to view links. Register or Login to view.which makes the hoax hypothesis a bit less plausible

You will need to explain how you have come to that conclusion. The p-effect seems just to show that pages that have a raised frequency of lines starting p have a raised frequency of gallow words. I cannot see how this supports any of the meaningful text hypotheses, continuous narrative, natural language, shorthand or cypher.

The point remains that a generator capable of producing a hoax in 1430 that could replicate the fascinating interplay between structure and variability in the VMS is anachronistic.

It is far more likely that the cipher provides the structure and an underlying language provides the variability, while maintaining a deeper underlying structure. When examining the ciphers of that era, this is far less anachronistic....

(23-04-2026, 08:11 PM)JoJo_Jost Wrote: You are not allowed to view links. Register or Login to view.
(23-04-2026, 06:17 PM)dashstofsk Wrote: You are not allowed to view links. Register or Login to view.
(23-04-2026, 01:55 PM)JoJo_Jost Wrote: You are not allowed to view links. Register or Login to view.which makes the hoax hypothesis a bit less plausible

You will need to explain how you have come to that conclusion. The p-effect seems just to show that pages that have a raised frequency of lines starting p have a raised frequency of gallow words. I cannot see how this supports any of the meaningful text hypotheses, continuous narrative, natural language, shorthand or cypher.

The point remains that a generator capable of producing a hoax in 1430 that could replicate the fascinating interplay between structure and variability in the VMS is anachronistic.

It is far more likely that the cipher provides the structure and an underlying language provides the variability, while maintaining a deeper underlying structure. When examining the ciphers of that era, this is far less anachronistic....

There is structure, however the authors capacity to fool you came from an era with fancier tools. You can build a conlang from a substitution cipher, yet don't use words from that language or on rare occasion do, being that most words are made up. The MS-408 is not random that is known. The entropy implies for MS-408 more often repeated words, thus low entropy. That's not common for well used languages where rules are agreed upon to communicate. The p-effect I cannot make heads or tells about from your stat is the style of the author.

@oeesordy:

On the entropy point: I've tested that against actual MHD corpora. "Ortloff von Baierland" shows 75.4% Levenshtein-1 connectivity, "Breslauer Arzneibuch" 72.8% — basically the same range Timm claimed was unique to artificial text.

So "low entropy / many word repetitions" isn't diagnostic of a hoax - it's what you get with german medieval medical texts in general.

As for the conlang-from-substitution idea: that would require an author inventing a language with consistent morphology and syntax. A medieval scribe around 1430? That seems rather unlikely.

And I don’t want to discuss modern forgery theory in this thread, because that’s a completely different topic and would distract too much from the actual subject… I ask for your understanding. Wink

I’ve discovered another interesting phenomenon. I don’t think this has been discussed yet? Or am I mistaken?

The pch family in the P lines.

Closed forms: pchedy, pchdy, pchey, opchedy, opchdy, opchey, qopchedy, qopchdy, qopchey.

Key findings:

A total of 208 occurrences of the family
106 of these appear in lines beginning with p (51%)
Lines beginning with p account for only 7% of all lines
p ≈ 4 × 10^-64, so highly significant.

Position in the line:

10% are the first token (at the beginning of the line)
84% are in the middle of the line
6% are at the end of the line

This is therefore not merely a phenomenon at the beginning of the line in the style of Grove. Most pch lemmas appear later in the line.

2. The question: Does this also apply to other Gallows families
tch, kch, and fch?

Same tested for the tch, kch, fch families (same pattern, all endings included this time):

pch — very strong, all endings cluster consistently in p-start lines (6-12x above baseline)
tch — mixed: some endings cluster in t-start lines (3-5x), others don't (-y is actually under baseline)
kch — almost flat, one endings shows a mild cluster (-or, 6x)
fch — too few cases to judge reliably

pch is the only family where the effect is uniform across all endings.

pch is the exception among the four Gallows.

3. We know that many pages start with P; does this have anything to do with the pch concentration?

Of 106 pch tokens in p-start lines, 19 are located in positions on the first line of the page, 87 in the middle of the page. Density per 100 lines:

First line of the page + p-start: 24 pch/100
Middle of the page + p-start: 29 pch/100
No p-start line: 2 pch/100

In p-start lines at the top of the page and in the middle of the page, the density is roughly the same. About 12–14 times above the baseline.

Conclusion:

p-start lines at page top show about 11.8x baseline density, and p-start lines away from page top about 14.0x.

(23-04-2026, 08:11 PM)JoJo_Jost Wrote: You are not allowed to view links. Register or Login to view.The point remains that a generator capable of producing a hoax in 1430 that could replicate the fascinating interplay between structure and variability in the VMS is anachronistic.

I think I understand you now. You are of the opinion that any hoax would have needed some sort of generating algorithm? And that you do not believe that any such generator would have been likely at that time? And created all the oddities and irregularities in the manuscript? In particular it would have needed to recreate your p-effect or the statistically irregular distribution of gallows words in quire 13 that I showed earlier?

Neither do I believe that the writer used any generating algorithm, or stopped after every word to consult some code table or otherwise to decide the next word. He seems to have had the ability just to write, line after line, paragraph after paragraph, a whole page in one uninterrupted sitting. The fluency of the writing shows this.

The scenario I like to imagine is that the writer invented an alphabet in order to create the hoax manuscript. Used the alphabet primarily for this manuscript but otherwise did not use it much for any other purpose, so his own use of it was a bit fitful. He would have written one or two pages and then finished work for the day. Perhaps on the next day, or perhaps sometime during the next week, or perhaps the following year if he was starting on a new section, he sat down again to write. But at this new sitting he approached the task with a fresh mindset so that the language and writing was a bit different to what he wrote previously. Whether intentionally or not he might have been lead to use more gallows words, or liked to use more daiin words, or use more of a certain prefix or suffix, or liked to amuse himself by forming rare and longer words. Even if the writer wanted to write consistently he would not have succeeded. Humans are not good at mimicking randomness. Psychological tricks have a habit of misleading you into falling for pattern repeats. This is what Gaskell and Bowern observed in their experiment.

The writer had a way of generating text to give it a semblance of genuineness. But he was not an automaton, did not blindly follow some set algorithm. He gave himself a free hand to add variability and to write in a personal style. Wrote what he liked. Was untroubled by any of the psychological tricks. No-one was going to be able to understand the manuscript anyway.

So on some pages there would appear a greater frequency of gallows words. On other pages a lesser frequency. One consequence of this is that it gives your p-effect. Once again, there is nothing special about the EVA-p character.

But it is not just the frequency of gallows words that is irregular and statistically significant. It is true for many other parameters. My code can tabulate and do simulations to show it. The top words, top prefixes, top suffices, top character pairs, frequencies of gallow words, frequencies of words that appear once in the quire, frequencies of long words, frequencies of words of two characters, frequencies of words that contain characters EVA-s, d, e, a, n, r. Far too many of these are statistically anomalous. There appears to be no uniform core parameter that would be expected of a meaningful narrative.

For me, the problem with your scenario is precisely those 200 pages. Just look at this final connection between pch and P lines and other recognizable patterns- for example, how certain initial markers avoid specific glyphs, and so on. It’s hard to attribute that to random human writing. Nor can the differences in the various sections with varying vocabularies. The same morphological endings (ey, edy, aiin, ol) that run through the entire context, fixed bi- and trigrams. All the positional rules, the same positional behavior, the same token-internal patterns. Across hundreds of pages?

As far as I know, Gaskell and Bowern’s test series were limited to 2 / 3 pages - on that scale, similar structures are still conceivable - but not such a clear, consistent structure spanning 200 pages and over weeks, months, or years of writing. And they themselves recognized this as a “problem.” Too bad they didn’t test it Wink

, but well, poor research subjects…

In my opinion, no one - no matter how intelligent or organized they are - can sustain this kind of complex structural consistency across 200 pages by inventing it on the fly.

And Bowern’s own remarks point in that direction as well:

At the word level, for instance, Bowern and Lindemann’s analyses indicate that Voynichese is distinct from familiar languages, in that the occurrence and order of letters is much more predictable than it is in familiar languages. (This predictability, Bowern and a colleague add in a subsequent article from 2022, is typical of intuitively generated gibberish, designed to mimic “what written language ought to look like.”) But beyond the word level—at the level of large sections of text—Voynichese is, in fact, similar to familiar languages.

You are not allowed to view links. Register or Login to view.

The sole VMS metric which our gibberish samples are unable to replicate is the VMS’s unusually
large bias in character placement within words (charbias_words_mean). This is likely related to a well-
documented feature of the VMS in which certain glyphs appear almost exclusively at the start or end
of words.

You are not allowed to view links. Register or Login to view.

Have you run your tests on Bavarian yet? That’s really exciting Wink

Pages: 1 2 3 4 5