You are not allowed to view links.
Register or
Login to view.
Hi Everyone, I'm new here. I've been working on a structural analysis of the Voynich manuscript and wanted to share the results.
The paper takes a statistical approach to the herbal section, treating the script as a system to be characterised rather than a language to be identified. The core finding is that label morphemes encode plant-architecture features (stem type, root form, leaf shape, complexity) that can be tested against the illustrations - and that prose behaves differently depending on how much descriptive work the labels already do.
It's built on falsification throughout - every claim has a permutation test, several hypotheses were killed along the way (including my original language candidate), and there's an explicit claims ledger in the appendix showing what survived and what didn't.
What the paper does NOT claim: decipherment, a source language, or readings. The last line is literally "What it does not yield - and may never yield through structural analysis alone - is a reading."
Happy to take questions or pushback - that's what it's for.
Paper description
This paper presents a falsifiable structural model of the Voynich manuscript (Beinecke MS 408), based on computational analysis of the complete ZL IVTFF 2b transcription (36,234 tokens, 226 folios, 8 sections). Rather than attempting to identify an underlying natural language, the study asks what kind of system the manuscript implements, and answers through holdout-validated formal analysis, independent unsupervised confirmation, and cross-modal testing against the manuscript's illustrations.
The model establishes five principal findings. First, a four-layer morphological grammar classifies 91-97% of tokens across six stratified holdout blocks spanning five manuscript sections, three or more scribal hands, and both Currier languages, with zero stacking-order violations in any block and no parameter adjustment after model freeze. Second, the invariant formal system is deployed in at least six distinct compositional regimes -- loop-based prose, topic-dominant chaining, nominal labelling, weakened-loop variant, closure-weighted operational mode, and balanced connective mode -- varying systematically by section and hand. Two regimes were discovered only upon unsealing the sealed reserve holdout, demonstrating that the taxonomy expands under evaluation. Third, discourse-framing density in text predicts visual complexity of herbal illustrations (Spearman rho = 0.600, p < 0.0001, n = 43), confirmed by pre-registered holdout with minimal attenuation. At the label level, specific morphemes predict specific plant features across five independent visual channels, and morpheme bundles predict multi-feature plant profiles compositionally (LOO AUC p = 0.0006). Fourth, a 17-mapping codebook decodes plant architecture from herbal labels at 58.5% accuracy across 72 folios and is bidirectional: image features recover label morpheme sets above chance (p < 0.0001), with forward-greater-than-inverse asymmetry diagnostic of selective encoding rather than cipher. Labels and prose perform complementary, load-balanced functions confirmed by an adaptive compensation mechanism (rho = -0.337, p = 0.011). Fifth, the system meets 8 of 10 criteria for restricted technical notation while failing the criterion most diagnostic of natural language: lexical recoverability.
These findings are independently triangulated: a rule-based grammar, holdout replication across two evaluation stages, and unsupervised HMM recovery of grammar classes from suffix sequences alone (NMI = 0.181, entity purity 0.53) converge on the same structural conclusions. The architecture is inconsistent with simple cipher, random generation, hoax, or classical mnemonic systems.
The study also situates the manuscript within the documented manuscript ecology of the eastern Mediterranean, presenting quantitative visual comparisons against six comparator manuscript traditions. The herbal section aligns closely with early encyclopedic Qazwini copies (Euclidean distance 2.37), while the zodiac section occupies a distinct visual regime matching no tested tradition, combining Latin computational diagram architecture with Byzantine Greek medico-astrological content and a unique figurative encoding system.
The manuscript is best understood as a structured, sectionally differentiated technical system with partially recoverable semantics -- structurally technical but lexically local. Its grammar is real and invariant. Its regimes are real and section-specific. Its text and images interact. Its labels carry structured semantic content. What it does not yield through structural analysis alone is a reading.
This deposit includes the pre-submission draft (v5.0), analysis scripts, data files, and figures.