I am sharing two companion papers that address the Voynich Manuscript through the lens of Bayesian Model Selection and Cognitive Science.
Paper 1: Epistemological Hygiene & The Zero-Patch Standard
You are not allowed to view links. Register or Login to view.
This paper argues that the field has been constrained by the Patching Fallacy: the acceptance of hypotheses (like Natural Language or Ciphers) that only fit the evidence by introducing unconstrained auxiliary parameters.
By applying a strict Zero-Patch Standard, the paper demonstrates that a Structured Reference System ($\Href$) is the information-theoretically minimal model. It is deduced directly from the corpus invariants (Rigid Morphology, High Hapax, Sectional Disjointness) rather than postulated and patched.
Paper 2: Cognitive Optimization in External Memory Systems
You are not allowed to view links. Register or Login to view.
This follow-up provides the mechanistic explanation for that structure. It reinterprets Stolfi’s "Crust-Mantle-Core" and Currier’s partitions not as linguistic anomalies, but as Cognitive Optimizations for manual retrieval in a paper-based database:
Voynichese has been compared with several known languages; while patterns have been perhaps identified, none have allowed for interpreting the text.
The script was written smoothly, as if by a native speaker, not someone who was encrypting the text in real time.
The manuscript is believed to originate in NE Italy, roughly in the 15th century.
If you look at the languages spoken in NE Italy around that time, one of the most common languages was Friulian.
While many people spoke Friulian, it was primary a spoken linage, not written, unlike the other languages of the region at that time: Latin, Venetian, or Italian.
The Benandanti and other folk medicinalists in NE Italy around that time primarily spoke Friulian.
And here's my hypothesis: What if Voynichese isn't primarily a written language, but a native speaker who tried to transcribe their primarily oral (not written) language phonetically into the text. If the language had distinct sounds, the transcriber might need to invent new characters to reflect those sounds in the phonetic transcription.
The evidence would be that Voynichese would follow linguistic patterns that would mirror such patterns if a speaker of Friulian invented a phonetic vocabulary to transcribe spoken Friulian. Also, some specific words in the Voynich text, especially those included in or next to images, could be shown to correlate to meaningful words in Friulian.
Here is some evidence that I would like to so share, and admit that I was aided with AI in generating this information. But please, do not reject this simply because it was aided with AI. I hope someone with more experience on these topics could either refute this idea, or further explore it. I'm sharing it here because I saw very few references on this site or elsewhere to the Friulian language's relevance to the manuscript.
Friulian (Furlan) is a Rhaeto-Romance language spoken in the Friuli-Venezia Giulia region of northeastern Italy. In the early 15th century — when the manuscript's vellum was prepared — it occupied a unique position in the European linguistic landscape. It was a living vernacular spoken by hundreds of thousands of people, yet it possessed no established written standard. Legal documents in the region were written in Latin, then Venetian, then Italian. Friulian was the tongue of the field, the market, and the hearth. It was heard, not read.
THE UNWRITTEN LANGUAGE PROBLEM
If a 15th-century practitioner — a healer, an apothecary, a folk scholar embedded in the Benandanti tradition of Friulian agrarian magic — wished to write down spoken Friulian systematically, there was no orthography to follow. He could not simply write Friulian in standard letters the way a Florentine could write Tuscan, because no standard spelling of Friulian existed. The solution, rational and elegant, would be to invent a phonetic script — one that captured the sounds of the language as heard, consistently applied, without the interference of competing orthographic traditions. Such a script, written by a single disciplined hand applying self-invented phonological rules, would produce exactly the statistical signature we observe in Voynichese: lower-than-natural-language entropy, rigid positional rules for characters (because the script encodes phonological constraints directly), and high word-form consistency (because one or perhaps two individuals are transcribing the spoken language).
FRIULIAN'S DISTINCTIVE PHONOLOGICAL FEATURES
Several of Friulian's phonological characteristics would produce distinctive statistical signatures in a phonetic transcription, and several of these match properties of Voynichese that have long puzzled researchers.
Preservation of final consonants. Unlike Italian and Venetian, which dropped most Latin final consonants, Friulian keeps them. Words end in -c, -t, -n, -l with a frequency unmatched in neighboring Romance varieties. A phonetic script for Friulian would require dedicated word-final consonant glyphs — and Voynichese has several characters that appear almost exclusively at word-end.
The palatal /j/ consonant. Friulian preserves the Latin /j/ as a palatal glide, writing it "j" or "i". Italian palatalized this to "gi-"; Venetian affricated it to "z-". Only in the Rhaeto-Romance family does the sound survive as a clean /j/. This has direct consequences for how a Friulian speaker would transcribe certain words — evidence we examine directly in the month names below.
Pervasive clitic pronouns. Friulian's spoken grammar deploys subject clitics — small unstressed pronouns — before virtually every finite verb. The masculine singular "al" (he) and third-person plural "a" appear so frequently in spoken Friulian that any phonetic transcription would be dominated by these short recurring morphemes. This matches Voynichese's most distinctive property: the dominance of short, highly repetitive word-forms.
The Month Names: Direct Testimony
The astrological section of the manuscript (folios f70v–f73v) contains something unique: ten month names written in readable Latin script, inscribed next to the zodiac signs in what researchers call the "Third Script." These are not in the unknown Voynich alphabet. They can be read directly, and they constitute our firmest linguistic anchor point in the entire manuscript.
MARC: FINAL CONSONANTS AND RHAETO-ROMANCE IDENTITY
The spelling "Marc" (f70v2, Pisces/March) with its preserved final hard -c is phonologically impossible in standard Italian (Marzo) or Venetian (Marzo). Both have lost the Latin final consonant entirely. The form appears in Catalan and Occitan as "Març," but crucially it also fits Friulian perfectly — and indeed, final consonant preservation is one of the defining characteristics of Rhaeto-Romance languages, the family to which Friulian belongs. The writer who spelled March with a final -c was not writing Italian. They were writing from a phonological system that keeps what Italian discards.
YUNY: THE PALATAL GLIDE THAT RULES OUT ITALIAN
The month name for June, written on folio f72r2 (Gemini), appears as "Yuny" or "Yony" — using a Y-initial to represent the /j/ sound. This is the single most diagnostic piece of evidence the month names provide. In Italian, June is "Giugno" — the original /j/ has been completely absorbed into the palatal affricate "gi-". In Venetian it is "Zugno" — affricated to "z-". Neither could produce a "Y-" spelling. But in Friulian, the palatal glide /j/ is preserved as a distinct sound, and a writer representing it phonetically would naturally write "y" or "j." The spelling "Yuny" is the phonetic transcription of a Friulian speaker's ear.
AUGST: THE GERMANIC SUBSTRATE
The spelling "Augst" (f72v3, Leo/August) with its compressed consonant cluster "-gst" rather than the smooth Romance "Agosto" points to a speaker at the Italian-Germanic linguistic border. Friuli sits precisely at that interface, where Austrian and Tyrolean German influence on the calendar vocabulary was real and documented. The German form "August" with its Germanic consonant cluster was preserved in early Friulian usage in ways that smooth Italian "Agosto" never would be. The writer who spelled August as "Augst" was hearing the month through a Germanic substratum — an Alpine, border-region ear.
These examples describe, in their spelling choices, a phonological profile that belongs to the Friulian-Alpine borderland of the early 15th century: a Rhaeto-Romance speaker, at the Italian-German interface, whose ear had absorbed Germanic calendar forms but whose vowels and consonants were those of the mountains and the river valleys of Friuli.
The Rosettes Folio: A Map in Friulian
The six-page foldout known as the Rosettes folio (f85v–f86r) contains what most researchers now accept is a map or cosmological diagram featuring nine circular "rosettes" connected by causeways, with illustrations of buildings, towers, and what appears to be volcanic or mountainous terrain. The architectural detail is particularly significant: the castle in the northeastern rosette displays swallowtail or Ghibelline crenellations, a style closely associated with northern Italy and specifically with the Scaliger family of the Verona-Friuli region from the 14th century onward. The map, if it is one, could depict a Friulian landscape.
The word "otol", appearing directly before a tower marker in the Stolfi transcription, is the most convincing single word match this investigation has produced. Breaking it as "o" (prepositional/article prefix) + "tol" (→ Friulian tôr, tower, with diminutive lateral suffix -ol), we arrive at o tôrol — "at/of the little tower" or "turret." The diminutive suffix -ol/-ul is genuinely productive in Friulian: cjastelul (little castle), furnul (little oven) follow exactly this pattern. A map label meaning "turret" or "small tower" placed directly adjacent to a drawn tower is, if the reading is correct, a nearly perfect crib.
The word "oal" in the same ring is potentially significant as an encoding of Friulian val (valley) — one of the most common geographic terms in northern Italian place names (Val Camonica, Valpolicella, Val Gardena). The "o/v" correspondence requires an assumption about the sound assigned to that glyph, but it is not arbitrary: in some 15th-century northern Italian handwriting traditions, the letters "o" and "v" shared visual ambiguity.
The word "sarald" resists simple grammatical analysis and may be a proper name — exactly the kind of place-name label a map would carry. The sequence "-ald" appears in Germanic-influenced northern Italian place names (Reinald, Gerwald, Serravalle compressed), and Friuli's long history under Germanic rule (Lombard, then Frankish, then Patriarchate of Aquileia) left Germanic elements throughout its toponymy.
THE "-AIIN" FAMILY: A VERB PARADIGM
One of the most statistically prominent features of Voynichese is the family of words sharing the ending "-aiin": daiin, aiin, saiin, otaiin, qokaiin, okaiin. These six forms together account for thousands of tokens. Under the Friulian hypothesis, this family represents a single verb ending — the 3rd person plural present tense "-an/-in" combined with various Friulian clitic prefixes and verb stems.
In a text recording oral Friulian medical and botanical knowledge, the construction "they take... they add... they boil... they gather..." would appear constantly. The formulaic register of folk medicine — highly repetitive, structurally invariant — naturally produces exactly the low-entropy, high-repetition statistical profile that has puzzled Voynich researchers for decades. The low entropy is not a cipher artifact. It is the statistical signature of formulaic spoken language.
Medieval star catalogues, following the tradition of Al-Sufi's Book of Fixed Stars (transmitted to Europe through Latin translations in the 12th–13th centuries), named individual stars by their position on the constellation figure: "the eye of the bull" (Aldebaran), "the tail of the lion" (Denebola), "the foot of Orion" (Rigel). The Voynich star labels, under our reading, follow this same convention — each label encoding "al-[position on figure]" in a Friulian phonetic rendering of the Arabic astronomical tradition. The suffixes "-ar" (arm/wing), "-am" (hand), "-al" (another positional term) would then be the Friulian or Venetian equivalents of these positional descriptors.
The Entropy Problem — Resolved?
The most persistent objection to any natural-language hypothesis for the Voynich Manuscript is its anomalously low entropy. Voynichese is more predictable, more constrained, more repetitive than any known natural language written in a conventional orthography. This has led many researchers to suspect the text is either a cipher, a constructed language, or deliberate nonsense. The Friulian phonetic hypothesis offers a genuinely novel account of this anomaly — one that does not require any of those explanations.
When a language is written phonetically for the first time, by a single inventor applying self-consistent rules, the result is orthographically more regular than naturally evolved writing systems, not less. There are no inherited irregular spellings from Latin roots. There are no learned Latinate forms intruding on vernacular words. There are no competing dialect spellings. The inventor hears a sound, assigns a glyph, and applies that rule every time. The resulting text has lower orthographic entropy than, say, medieval Latin or Italian, precisely because it lacks all the historical noise that those traditions accumulated.
Furthermore, the content of the manuscript — if our hypothesis is correct — would itself be low-entropy by nature. Oral folk medicine recipes are formulaic in every language and culture. "Take the root... add water... boil for... apply to..." repeated with botanical variation across hundreds of pages would produce a text whose statistical profile is not that of a narrative or a letter, but of a formulaic instructional corpus. The low entropy is not an artifact of encoding. It is the signature of formulaic oral genre, faithfully transcribed.
The Challenge of Analyzing a Dynamic Text: Why the Voynich Manuscript Resists Systematic Analysis by Torsten Timm
Abstract:
Quote:The Voynich Manuscript (MS 408) presents a unique analytical challenge that transcends conventional cryptanalytic approaches. This paper examines why standard analytical frameworks—whether linguistic, cryptographic, or statistical—consistently fail to produce definitive results. It argues that the manuscript's most fundamental property is its continuous evolution throughout the text, creating a dynamic system where local predictability coexists with dramatic large-scale transformation. By examining the manuscript through the lens of network analysis and developmental processes, the paper demonstrates that the text exhibits properties consistent with organic growth rather than rule-governed production. This perspective reconciles apparently contradictory observations and provides a framework for understanding why the manuscript has resisted systematic analysis for over a century.
Introduction:
I am sharing a new discovery regarding the Rosettes Map (Folio 86v). My research shows that this map is not just a drawing; it is a geometric layout of the world and a warning about the end of times. The Discovery of "T" and "Code 4":
The T-Axis: If you look at the nine rosettes, you can see an invisible T-shape that connects them. The vertical line connects the top and bottom, and the horizontal line connects the sides. This "T" is the key to the whole map.
The Tilt (Karkata): Notice that the T-axis is not perfectly straight; it is tilted. In geometry, a tilted axis means the world is out of balance. This shows that power has shifted to one side.
Code 4: This "T" divides the circular world into 4 parts. These represent the 4 corners of the earth (North, South, East, and West). The "Code 4" shows how a central power (the fortified city) is trying to control all four corners of the world at once.
Fikra (Insight):
Everything in the Voynich Manuscript is visible if you look at the shapes. You don't need to read the letters to see the truth. The author had a vision of the future, showing a city with high towers that would dominate the world. The T-axis shows us the direction, the Code 4 shows us the global scale, and the Tilt shows us that the end-time sequence has begun. Conclusion:
This is a "Warning Map." The geometry doesn't lie. I invite you to look at the angles and the 4-part division to see the reality for yourselves. This is not just history; it is what is happening now.
If you haven't read my first post on this, here's the link:
You are not allowed to view links. Register or Login to view.
It'll explain a lot about what I'm going to present.
I don't believe God plays dice.
In the last post I had just split the Voynich into my 0ed (Currier A) and ed+ (Currier B) pages. This post is going to be less chart pretty and more statistics. Prepare to be bored mindless with numbers.
The Voynich
Total Pages
0ED pages: 104
ED+ pages: 121
Tokens per page (mean / median)
0ED: 84.6 / 80
ED+: 211.2 / 145
Unique tokens per page (mean / median)
0ED: 66.8 / 65
ED+: 140.4 / 110
Global hapax ratio (mean / median)
0ED: 0.146 / 0.142
ED+: 0.144 / 0.131
Reuse ratio (mean / median) - How often words repeat.
0ED: 0.725 / 0.755
ED+: 0.784 / 0.800
Variance of token length (mean / median)
0ED: 2.69 / 2.64
ED+: 2.77 / 2.68
Proportion of long tokens (length ≥ 7)
0ED: 0.145
ED+: 0.182
Unique bigrams per page (mean / median)
0ED: 60.5 / 61
ED+: 80.9 / 79
Bigram repetition rate
(1 − unique_bigrams / total_bigrams)
0ED: 0.791
ED+: 0.861
TLDR; ED+ pages
Are much longer (Baneo and Recipe influenced)
Introduce more total vocabulary
Reuse prior vocabulary more than 0ED pages
Contain longer tokens
Use more bigram types
Repeat bigrams more heavily.
Have roughly the same hapax creation rate.
Sometimes things have to break before you can fix them.
So, there are a lot of statistics that can show that ED+ pages are significantly different from ED0 pages, and much of this has been discovered and mulled over since Currier spotted that difference. But, what I'm about to show you will, I think, make you reconsider that difference.
I was working on splat repair. It occurred to me that some 2,000 splats exist in Takahashi. If the Voynich contained information, that was a lot of lost information. So, I started working on suite of repair tools, some bits stolen from OCR, some from spell checking. In that suite I did something different. I did 'leak free' testing. I would train with tokens on herbal and would then I would test on recipe. Or, I'd alternate folios. Train on one, test on the other. This allowed comparisons between two sections of the Voynich without one set of data contaminating the other. So when I spotted ed0 and ed+, the repair thing popped right up.
When I trained my model on 0ed pages and tested on ed+ pages, here's what I found.
OOV tokens:
(OOV = out of vocabulary)
10,389 - That's the number of tokens seen in ed+ pages that were not seen in ed0 pages. A fairly large number
Bigram-illegal tokens:
170 - That's how many tokens on ed+ pages that had bigrams that were not seen in the 0ed pages.
Repairability of OOV tokens:
SUB or DEL (token-level): 82.15%
SUB or DEL (type-level): 66.49%
SUB/DEL/INS (token-level): 82.69%
SUB/DEL/INS (type-level): 67.33%
This, is the big one. Of those 10,389 tokens that were found on the ed+ pages that didn't exist on ed0 pages, 82% could be made an ed0 token with one simple character deletion or substitution. These two sets of pages are not that different.
Next, I reversed that test. I trained on ed+ pages and tested on ed0 pages.
OOV tokens (Vocab-OOV):
1,714 - That's a huge difference. Over 10,000 tokens were seen on ed+ pages that didn't exist on ed0, but only 1,714 tokens were on ed0 that were not on ed+
Bigram-illegal tokens:
17 - Only 17 tokens in ed0 had bigrams that were not on ed+.
When you compare those numbers based on total tokens in the test,
0ED → ED+
OOV tokens: 10,389
Test tokens: 25,554
OOV ratio: 40.66%
ED+ → 0ED
OOV tokens: 1,714
Test tokens: 8,797
OOV ratio: 19.48%
And, 82.21% of ed+ tokens could be repaired to make ed0 tokens with a single substitution or deletion.
More notes:
High-frequency backbone
In 0ED
Top 10 most frequent tokens → 100% shared with ED+
Top 20 → 100% shared
Top 50 → 100% shared
In ED+:
Top 10 → 70% shared
Top 20 → 75% shared
Top 50 → 78% shared
Exclusive bigrams
0ED-only bigrams: 12
ED+-only bigrams: 60
TLDR2;
0ED vocabulary is largely contained within ED+.
ED+ expands well beyond 0ED.
Every high-frequency backbone token in 0ED exists in ED+.
195 of 207 0ED bigrams survive in ED+
That is 94% retention
ED+ expands the bigram alphabet
Adds 60 new bigrams
Bigram space grows from 207 → 255
And here's the first chart. This is the vocabulary growth between 0ED and ED+. This is a very smooth growth rate. This suggests that there was no big shift between 0ED and ED+. Despite all of those differences above, it's still the same base "engine" chugging along with no dramatic change.
Sometimes, broken things deserve to be repaired.
In ED+ there are 4,260 unique tokens that do not exist in 0ED pages (different from the OOV above). If we take those tokens and we do a simple 1 edit distance repair to a token in 0ED:
2,870 are 1 edit distance away.
1,123 are edit distance 2 away.
3: 218 are edit distance 3
>3: 49
I gotta bold this to make sure it's seen
Around 94% of ED+-only vocabulary is within edit distance ≤2 of 0ED.
Let me put that another way.
Out of 4,260 uniquie tokens in ED+ pages
3993 can be made a 0ED token by changing 2 characters.
I had to keep repairing things...
I set up a chain, where I would take all of those that were edit distance 1 from a 0ED token and I compared all of those that were edit distance >= 2
By repeating this chain of checking and rechecking edit distance I came to an abrupt stop at get 6
Gen 1: 2,337
Gen 2: 630
Gen 3: 160
Gen 4: 30
Gen 5: 9
Gen 6: 1
1,093 tokens were still unreachable. So, I relaxed the rules a bit. I started allowing 2 edit distance.
And then allowed edit distance 3.
After 2 rounds of editing like this, I was left with 15 tokens that could not be chained back to 0ED. And every single one of those tokens was length >8. I then considered those to be possibly transcription errors and were conjugations. I compared them to shorter tokens and was able to split them into 2 words. All of those were then 1 or 2 edit distance from a 0ED token or a previously repaired token.
Ok, so every single token on ED+ could be chained into an edit distance of 3 or less into a 0ED token.
I checked Zandberg/Landini.
99.07% of the ED+-only vocabulary in was absorbed within ≤3 edits.
I had 45 tokens left over. 45 / 45 (100%) have a split where both halves were within ≤3 edits of an another checked token.
Ok,... that can't possibly be right. It means I can edit any word a few times... ok 6 at most, maybe 7, and I can make every single word from one half a book match the other half.
Voynich words have a greater similarity than Latin or English. Now to old hands at Voynich, this is no huge surprise. But, it does show that after editing what appears to be two very different sections of the book, that similarity is just a few edit distances away.
Conclusion.
I'm likely going to get beat up over this but, here goes:
Currier Language A and B are not distinct languages and he noted that.
0ED pages were likely created prior to the ED+ pages. I said likely! I don't have solid proof but the difference in vocabulary and bigrams suggests it.
0ED and ED+ are not behaving like normal text. Well, the whole Voynich doesn't behave like normal text so no surprise there.
0ED and ED+ look like two regimes, but not two vocabularies:. ED+ is almost entirely built from 0ED by tiny edits. The same "engine", different settings.
So, I hope I've given enough evidence to show how these two regimes are different, but the same underlying system. I'll be interested in hearing your thoughts.
The big question now is:
Why does a lexical "engine" make a drastic switch like "ed" if the vocabulary isn’t actually changing much?
I think I can answer that. But that's for another post.
Disclaimer: I have tried to review all of these numbers and I believe them to be reasonably accurate. I may have missed some but hopefully nothing drastic.
Functional Resolution: The Reactive Geometric Labeling Model (RGLM) and the Entropy Anomaly
Hi everyone,
I have been working on a functional approach to the MS 408 text-image relationship, and I am excited to share a formal model that provides a reproducible explanation for the manuscript's low entropy.
Instead of looking for a natural language or a complex cipher, my research focuses on Reactive Geometric Labeling (RGLM / MEGR in Spanish). The core thesis is that the "labels" and text blocks are isomorphic to the visual morphology, density, and spatial distribution of the illustrations.
Key findings of the model:
• Isomorphic Determinism: The word length and prefix/suffix distribution (like the D/C and O/SH families) correlate directly with the geometric complexity of the drawing.
• Entropy Resolution: The low entropy isn't a linguistic feature but a functional one; the "vocabulary" is constrained by the recurring visual patterns it labels.
• Predictability: The model allows us to predict certain lexical clusters based on the specific arrangement of botanical or pharmaceutical elements in the folios.
I have registered the full methodology and the preliminary report on Zenodo to ensure open access and peer review.
Full Paper / DOI: You are not allowed to view links. Register or Login to view.
I would love to hear your thoughts, especially from those focused on computational linguistics and pattern recognition. I am open to testing the model against specific folios suggested by the community.
Best regards, Emmanuel Jiménez Independent Researcher
"Figure 1: Application of the Reactive Geometric Labeling Model (RGLM) on folio 2r. Note the correlation between visual complexity and specific lexical clusters."
New member here. I want to be upfront about a few things before presenting what I've been working on, because I know this community has seen a lot of "I've cracked it" posts, and I don't want to waste your time.
What this is not:
- A decipherment
- A translation
- A claim that I know what language the VM is written in
What this is:
- A morphological model that decomposes ~30% of the VM vocabulary cleanly using a prefix-root-suffix system
- A hypothesis (the "corrupted copy" hypothesis) that explains why the remaining 70% resists decomposition
- A set of testable structural predictions, some of which I've checked against known data and which seem to hold
How it started:
This came out of a completely unrelated discussion about the Phaistos Disc, where I was exploring structural-functional approaches to undeciphered texts with an AI language model (Claude). On a whim, I tried applying the same approach to the VM — building a synthetic grammar from scratch based purely on internal structure, with no prior assumption about what language family it belongs to. I expected it to fail. It didn't fail as completely as I expected.
The core idea in brief:
The VM text behaves like an agglutinative language with stable prefixes (qo-, ch-/che-, sh-, da-, ol-/o-), productive roots (ke-, te-, ka-), and grammatically meaningful suffixes (-dy, -y, -in, -ain). These decompose the highest-frequency words cleanly: qokedy, qokeedy, qokeey, chedy, shedy, daiin, dain, and their families.
The model makes specific positional predictions that appear to hold:
- Words ending in -in/-ain avoid line-final position (they mark continuation)
- Words ending in -y can close lines (terminal)
- da- words (daiin, dair) are strongly line-initial (>20% of occurrences)
I then tested this against the full You are not allowed to view links. Register or Login to view. transcription (Takahashi version from the Stolfi interlinear) line by line. Results: roughly 30% clean decomposition, 38% partial, 32% fail. The failures concentrate on gallows characters, which the model doesn't address at all — that's the biggest gap.
The "corrupted copy" hypothesis:
The reason I think the remaining 70% is noisy rather than wrong: the VM may not be an original composition. If a 15th-century European scribe copied an older text in a language they couldn't read — character by character, purely as visual patterns — you'd expect exactly the pattern we see: high-frequency morphemes preserved (because the scribe's hand learned them as motor patterns), interior morpheme boundaries smeared, rare forms absorbed into common ones, and vowel distinctions (the e/ee/eee system) rendered inconsistently.
This would also explain why the VM has natural-language statistics but resists decipherment: the source was a natural language, but the copying process added a layer of systematic noise.
Typological direction (most speculative part):
The morphological profile — agglutinative, prefix-based, ergative-looking alignment (da- as possible ergative marker), SOV-compatible word order, r/l alternation in the ol/or/al/ar system — doesn't match any European language. It does align typologically with Hurrian, Urartian, and Northeast Caucasian languages. I'm not claiming the VM is in Hurrian. I'm saying the type of grammar matches that corridor better than anything in Europe. The Diakonoff-Starostin "Alarodian" connection means these structural features are shared across multiple families in the region.
What I'm looking for from this community:
1. Has anyone tested positional constraints of suffixes systematically? The -in/-ain line-avoidance and da- line-initial preference are the model's strongest testable predictions. If these are already known/published, I'd like to know.
2. Gallows integration. The model completely ignores gallows characters (~15-20% of the text). If anyone has ideas about how cth/ckh/cph/cfh might fit into an agglutinative prefix system, I'm very interested.
3. Currier A vs B. The model currently treats the text as uniform. If A and B have different morphological profiles, that's important — it could mean different source texts, different scribal hands, or dialectal variation.
4. Where am I reinventing the wheel? I'm new to VM research specifically. If someone has already proposed an agglutinative model, or tested prefix/suffix positional behavior, or explored Caucasian typological parallels, please point me to their work. I'd rather build on existing research than duplicate it.
5. Where am I obviously wrong? I can take it. That's why I'm here.
Full paper attached. It includes the complete morpheme inventory, decomposition test results with line-by-line You are not allowed to view links. Register or Login to view. analysis, the corrupted copy argument, typological comparison with Hurrian/NEC languages, historical transmission scenarios, and full references.
Transparency note: This was developed collaboratively with Claude (Anthropic's AI). The hypothesis and direction are mine; the systematic testing, frequency analysis, and typological comparison were done with AI assistance. The AI was also used to stress-test the model — the initial assessment was actually quite harsh, identifying major gaps (gallows, semantic unfalsifiability, cherry-picked examples) before we refined the framework. I mention this because I think the methodology is legitimate and worth being honest about.
Thanks for reading. Looking forward to being told why I'm wrong.
I am new, so please forgive my lack of completely understanding the current state of the research.
I was wondering what the current status is of the hypothesis that Voynichese is a delta cipher, in other words, the idea that it is the transition from one word to the next that encodes information -- which characters are dropped and added and where, etc. This could encode plaintext letters or the numbers of an intermediate Polybius Square. I know papers like those of Timm and Schinner looked at word similarity in transitions, but I couldn't tell the degree to which this sort of encoding was ruled out. I also would have expected modern professional cryptography to have cracked it by now if this were the case, but perhaps I have too much faith in that.
Ok, so I have this paper I've been working on and I have a very rough draft on Zenodo. I've decided to put the things I've been digging into on ninja with the hopes that additional pairs of eyes will clue me in to things I've been missing before I make a complete fool of myself and submit it for peer review. For all of these tests, I've used the EVA Takahashi (I'm old and have used it for years) with cross verification of the EVA Zandberg/Landini.
I'm goin to try to break all of this down into multiple posts because I have a bunch of territory to cover. Each will refer back to previous ones. Much of what I'll cover won't be new territory to the old hands at the Voynich. Some, may be.
The bigram "ed"
It's been known for many years that the bigram "ed" is just plain odd. It occurs in the Voynich as a midfix 4,474 times and as a suffix 186 times. Never as a prefix. That may not sound that striking but this chart shows just how striking it is.
That is "ed" compared to the top 100 bigrams by total count and percentage of pages. It occurs on roughly 56% of pages but is in the top 10 for total bigram count (#9).
Currier and "ed"
Currier noticed a difference when he described his language A and language B. He could never quite put his finger on all of the differences between the two. I'll suggest that the big thing he noticed was the bigram "ed".
This chart shows the locations of the bigram "ed" with the background shaded to represent Currier A and Currier B. The dot colors represent blue = no ed bigrams on the page, orange = 1 ed bigram on the page and green 2+ bigrams on the page.
Side note: You'll notice 2 orange dots early in the herbal section You are not allowed to view links. Register or Login to view. and f11r. In both of those pages, ed occurs once and it's inside a hapax token. The total number of pages where ed only occurs once is 19. Of those 19 pages, it's a hapax token on 6.
So, just from comparing Currier to ed, we see there's a very close match. He apparently never fully defined the zodiac section as either so it has the white background.
"ed" by section
The first thing I noticed was, the first 25 folios only have those 2 occurrences of ed. That seemed pretty odd for a bigram that's one of the top 10 by count. So, I decided to dig further.
"
This chart shows the bigram ed by section. I lumped the ed's into buckets. No ed on the page, 1 ed per page and a low, medium and high bucket that split the ed per-folio count into 3 groups of around 40 pages each. This chart is also normalized by folio word count to show the differences even better than the previous chart. On the left, you'll notice again, the first 25 folios, only the two hapax token ed occurrences. At f26r, ed gets introduced. But not all at once. It skips around between pages with ed and no ed. The pharma section does the same thing. Some have ed, some do not. The same for zodiac. About half either have no ed or 1 ed. Baneo, rosette and recipes all have the highest count and ratio of ed in the entire Voynich.
"ed" by sheet?
Now here's where things get a big strange. I'm not going to interject my theory here. I'm going to be really interested in hearing yours.
I downloaded the quire diagram from Voynich.nu and converted it into a csv that I could import into my python. I then changed the background color to match the quire sheet number. With one exception, 27v, you will notice that all of the pages where the bigram ed is the highest in the herbal section, they're all on the same sheet. But, they're intermixed with sheets that contain no ed.
F26 and F31 are on sheet 2
F33 and F41 are on sheet 1
F34 and F40 are on sheet 2 F41 and F48 are on sheet 1 F43 and F46 are on sheet 3 F50 and F55 are on sheet 1
You can also see a similar pattern in pharma. All have a relatively low ed count with those in the middle having a higher normalized count. Those appear on sheets marked as sheet 1. Again, no theory, but if the Voynich is in some semblance of a chronological order, this, combined with the no ed pages in other sections made me seriously scratch my head.
Which came first?
One thing Dr. Davis has mentioned in some of her talks is that she believes the folios are not in original order (I can't wait to see the results of that!). And looking at these charts, it struck me as interesting that the ed bigram appears to be in clusters and groups. Not so much by region as by quire sheet. Since we truly have no idea what order this book was written in, I developed a theory. Assume that all of the pages where ed never occurred or was in a hapax token where created first and that the bigram ed was brought into prominence later (or the reverse of that). What kind of differences would they have? So, I spit the Voynich into 2 "halves". The 0ed half, which included pages where it never occurred or it occurred once in hapax token, and the ed+ "half" where it occurred at least once and was not in a hapax token.
Here's a csv list of the pages I identified and began classifying as 0ed and ed+ pages.
So, this is how I entered the rabbit hole. There's a bit to digest here when you consider the implications so I'll end the post here. But, there's also lots more to pile on top of this so I'll be referring back to this post. I'll be sure to link it when I continue this in a new thread in the near future™.
Thanks for looking it over and I'm eager to hear opinions.
I wanted to open a discussion on a specific possibility that seems to be gaining traction with the recent studies coming out.We have all seen the discussion around Greshko's "Naibbe Cipher" and how it generates text that statistically resembles the Voynich. You are not allowed to view links. Register or Login to view.
Then, Pincar’s model identifies a very specific dependency in the text. Essentially, he demonstrates that the ciphertext depends not just on the current symbol, but on the previous one as well meaning the system has "memory" or context. And let's be clear, he himself highlights the following in his article: “This model identifies the structure, but not the content. We cannot determine: the identity of the source language, the semantic meaning of any word, or whether the manuscript contains meaningful information.” You are not allowed to view links. Register or Login to view.
If we start from that basis, that there is a mechanical dependency between the previous state and the current one, could we be looking at the text result of a three part volvelle or cipher disk?
A device with concentric rings (Outer, Middle, Inner) would naturally force the rigid Prefix + Root + Suffix word structure that we see throughout the manuscript. Something like this: You are not allowed to view links. Register or Login to view.
I am curious to hear your thoughts. At first glance, what jars you about this idea? Does a mechanical "wheel" explanation fail to account for any specific linguistic features you've noticed?