Hi all,
Some of you may have seen my earlier work confirming the Currier A/B distinction quantitatively. That paper showed the distinction is real, recoverable without labels, and predictive. But it also left a puzzle on the table that I could not explain at the time. I now have an explanation, and it leads somewhere unexpected.
Of the eleven character pairs I tested, one behaved paradoxically. The e/ch pair had essentially zero global correlation with the A/B split, yet it produced the strongest signal of all pairs at folio boundaries. And when included in clustering, it actively destroyed the A/B partition: removing it doubled the clustering accuracy.
How can a pair be simultaneously invisible globally, maximally informative locally, and destructive to classification? That combination is not possible under a simple two-language model. Something more structured is going on.
The answer turns out to be surprisingly clean. If you look at the vowel that follows the digraphs CH and SH across the manuscript, you find that folios split into two sharply separated groups. The gap between these two states is enormous, and a two-state binomial mixture model fits with a 2,549-point AIC improvement over a single state. Of 197 folios, 195 are assigned unambiguously.
This is not the same thing as the Currier A/B split, although it correlates with it. It is sharper, it operates at the individual folio level rather than at section boundaries, and it persists within the Herbal section alone (where the A/B boundary is supposed to be clean).
I call it a boolean switch: a single binary parameter, set once per folio.
Here is where it gets interesting. If the switch were just replacing graphemes uniformly, every word containing those graphemes would respond the same way. They do not.
When you group words into templates, you find three classes:
- Fixed O templates: these are locked to O in both switch states.
- Fixed E templates: these are locked to E in both switches states.
- Switchable templates: these respond strongly to the switch.
Template identity accounts for 93.5% of the variance. The folio switch accounts for only 7.9%.
So the system has two components: a template structure that determines which contexts are switchable, and a boolean parameter that modulates the switchable ones. The Currier A/B distinction is a blurred projection of this system, not the system itself.
The e/ch pair is paradoxical because it responds to the boolean switch, but only in switchable template contexts. In clustering, the e/ch ratio injects variance along a dimension that does not align with the primary A/B axis. Mystery solved.
Now for the part that surprised me the most
Everything above is derived purely from text statistics. I had no reason to expect it would connect to anything visual. But then I found Koen's morphometric study of the Herbal plant illustrations (You are not allowed to view links.
Register or
Login to view.), which classifies plants as A-type or B-type based on twelve visual features: stem-root lines, flower morphology, daisy-type flowers, grass elements, root platforms, leaf venation, and so on. This classification was done entirely from the drawings, with no reference to text statistics.
I cross-validated my boolean switch against the morphometric classification on 101 Herbal folios (excluding quire 8). The results:
Boolean switch vs. morphometrics: 96.0% agreement, Cohen's kappa = 0.870, Fisher's exact p = 3.5 x 10^-15.
Currier vs. morphometrics: 78.2% agreement, Cohen's kappa = 0.106.
Read that kappa for Currier again: 0.106. Once you correct for base rates, Currier's section-level labels have almost no predictive power for plant morphology. The boolean switch, derived from a single text ratio, predicts the visual classification of the plant drawings with near-perfect accuracy.
Every Currier discordance is resolved by the switch
Of the 27 folios where my switch disagrees with Currier's label, 18 have morphometric data.
In all 18 cases, the plant illustration sides with my switch, not with Currier. The probability of that under the null is 3.8 x 10-6.
These are not marginal cases. Folios like f31r, f34v, f39r, f43r, f46r, and You are not allowed to view links.
Register or
Login to view. are all traditionally classified as Currier A because they fall in the f1-f57 range. But their text is E-dominant, and their plant illustrations show B-type features (daisies, grass, root platforms, unidirectional leaves). Conversely, f87r, f90r, f93v, and You are not allowed to view links.
Register or
Login to view. are traditionally Currier B, but their text is O-dominant and their plants show A-type features (stem-root lines, A-type flowers and calyxes).
The switch is not just a better statistical classifier. It is detecting the same organizational principle that the illustrator(s) was(were) following.