The Voynich Ninja

Dear all,

In this new paper (You are not allowed to view links. Register or Login to view.), companion to the testable signatures paper we recently shared on this forum, we test whether Currier's idea of two distinct "languages" (A and B) in the Voynich manuscript holds up under modern statistical scrutiny. We do this in two complementary ways, using character-pair ratios (like how often 'd' appears versus 'l' on a given page):

First, a generative (unsupervised) model looks at the raw character counts with no knowledge of Currier's labels and asks: how many distinct groups does the data itself support? It independently selects two groups and assigns pages in a way that substantially overlaps with Currier's A/B split.

Second, and perhaps more importantly, a predictive model tests whether knowing a page's A/B label actually lets you forecast its character statistics on unseen pages. The result: it predicts held-out page labels at 89.2% accuracy in character-pair ratios on text the model has never seen.

The A/B distinction is not just a pattern Currier saw: a model rediscovers it blind, and it survives predictive cross-validation.

The A/B label is the dominant axis of variation, but it only explains about 29% of inter-page variance. There's a lot of structure left to account for!

The dataset and methodology are going to be disclosed in two steps: first to a closed group of specialists, then made generally available later on this year.

I hope you'll enjoy the reading

Interesting work — I've been looking at some of the same boundary-structure questions.
I found your work interesting, as I have also been testing some of these same issues related to boundary structure.
One thing I noticed separately is how boundary distributions change between Currier A and B when looking at the internal structure of the tokens, rather than just the n-gram co-occurrences. Why does this A/B split seem to produce concentration profiles with different boundaries? A shows that morpheme boundary positions are slightly more distributed, while B shows a higher concentration at the edges. Of course, it's not a drastic change, but it's very consistent even across different segmentation strategies.
Does your classifier capture this as part of the 89.2% signal, or is it primarily based on lexical distribution?
Does your classifier pick this up as part of its 89.2% signal, or is it mostly riding the lexical distribution (different vocabulary distributions A vs. B)?

Labyrinthinesecurity

petronio