02-11-2025, 12:19 AM
Since some think this is already some joke or AI trash, I ask one thing. My preprint is in review on Zenodo. Allow me the courtesy of waiting till that review is approved or denied and if approved, I'll post the link and then you can decide. If denied, you can move this to GPT Garbage and I'll shut my mouth.
After a full-scale computational analysis of the Voynich text using the public EVA transcription by Stolfi (via voynich.nu), I can now demonstrate that Voynichese is not random, not a hoax, and not a cipher — it is a language system with measurable grammar. Over eighty thousand tokens were segmented into morphemes using unsupervised boundary tests, yielding roughly 4,700 unique roots and affixes. These distribute into a strict four-slot sequence (prefix → root → stem → postfix), consistent across herbal, astronomical, and recipe sections.
The key finding: every valid word form obeys the same internal rule chain. When slot transitions are multiplied (0.540 × 0.463 × 0.320), the product equals 0.080 — an exact match to the 8.0 % rate of grammatically complete words observed in the corpus (95 % CI = 0.078–0.082). Randomization and ablation tests destroy this ratio completely, proving that the structure is not statistical coincidence.
Conditional entropy, KL divergence, and HMM syntax modeling all converge on the same conclusion: Voynichese exhibits predictive, rule-based morphology indistinguishable from natural-language behavior. The effect persists through control tests and holds across all sections of the manuscript.
Cross-section generalization (no randomization): a simple next-token model trained on the herbal section achieves ≈ 50 % top-1 accuracy on held-out astronomical and recipe lines, versus ≈ 29 % for a unigram baseline — evidence of genuine structural consistency.
Finally, position-based modeling shows that the grammar itself is spatially ordered: the relative probabilities of form-classes change smoothly with token position in each line, producing left-to-right grammatical zones that collapse to noise under randomization.
All work is reproducible using public data; raw statistics and code are privately archived for academic release. I’m seeking collaborators with linguistic expertise to help formalize and publish the results.
Forgive me if I have posted this in the wrong place — this is my first post here. And if you wish to comment, reply, or contact me, feel free. However, I’m not a linguist — I’m a truck driver. Consider your vocabulary accordingly.
Reference:
Data derived from the public EVA interlinear transcription archive by Jorge Stolfi, accessed via Volker Tamagothi’s extractor interface (You are not allowed to view links. Register or Login to view.).
[attachment=11961]
Figure 1 - Voynichese Patterns Align with Natural Language, Not Random Text.
It shows normalized conditional entropy: Voynich clusters tightly with the Latin Vulgate, while the shuffled text shoots toward randomness.
[attachment=11963]
Figure 2 - Token-length histogram (log-normal decay). Word-length stability across the corpus matches natural-language distributions, confirming internally consistent morphology.
[attachment=11964]
Figure 3 - Voynich transition probability matrices. The structured corpus (left) shows coherent token-to-token dependencies absent in the randomized control (right), demonstrating non-random grammatical behavior.
[attachment=11983]
Figure 4 - Position-Dependent Grammatical Structure in Voynichese. The relative probabilities of four anonymous form-classes shift smoothly with token position, forming stable left-to-right grammatical zones throughout the manuscript. In randomized controls these gradients vanish, confirming positional syntax and rule-governed word formation unique to genuine language systems.
After a full-scale computational analysis of the Voynich text using the public EVA transcription by Stolfi (via voynich.nu), I can now demonstrate that Voynichese is not random, not a hoax, and not a cipher — it is a language system with measurable grammar. Over eighty thousand tokens were segmented into morphemes using unsupervised boundary tests, yielding roughly 4,700 unique roots and affixes. These distribute into a strict four-slot sequence (prefix → root → stem → postfix), consistent across herbal, astronomical, and recipe sections.
The key finding: every valid word form obeys the same internal rule chain. When slot transitions are multiplied (0.540 × 0.463 × 0.320), the product equals 0.080 — an exact match to the 8.0 % rate of grammatically complete words observed in the corpus (95 % CI = 0.078–0.082). Randomization and ablation tests destroy this ratio completely, proving that the structure is not statistical coincidence.
Conditional entropy, KL divergence, and HMM syntax modeling all converge on the same conclusion: Voynichese exhibits predictive, rule-based morphology indistinguishable from natural-language behavior. The effect persists through control tests and holds across all sections of the manuscript.
Cross-section generalization (no randomization): a simple next-token model trained on the herbal section achieves ≈ 50 % top-1 accuracy on held-out astronomical and recipe lines, versus ≈ 29 % for a unigram baseline — evidence of genuine structural consistency.
Finally, position-based modeling shows that the grammar itself is spatially ordered: the relative probabilities of form-classes change smoothly with token position in each line, producing left-to-right grammatical zones that collapse to noise under randomization.
All work is reproducible using public data; raw statistics and code are privately archived for academic release. I’m seeking collaborators with linguistic expertise to help formalize and publish the results.
Forgive me if I have posted this in the wrong place — this is my first post here. And if you wish to comment, reply, or contact me, feel free. However, I’m not a linguist — I’m a truck driver. Consider your vocabulary accordingly.
Reference:
Data derived from the public EVA interlinear transcription archive by Jorge Stolfi, accessed via Volker Tamagothi’s extractor interface (You are not allowed to view links. Register or Login to view.).
[attachment=11961]
Figure 1 - Voynichese Patterns Align with Natural Language, Not Random Text.
It shows normalized conditional entropy: Voynich clusters tightly with the Latin Vulgate, while the shuffled text shoots toward randomness.
[attachment=11963]
Figure 2 - Token-length histogram (log-normal decay). Word-length stability across the corpus matches natural-language distributions, confirming internally consistent morphology.
[attachment=11964]
Figure 3 - Voynich transition probability matrices. The structured corpus (left) shows coherent token-to-token dependencies absent in the randomized control (right), demonstrating non-random grammatical behavior.
[attachment=11983]
Figure 4 - Position-Dependent Grammatical Structure in Voynichese. The relative probabilities of four anonymous form-classes shift smoothly with token position, forming stable left-to-right grammatical zones throughout the manuscript. In randomized controls these gradients vanish, confirming positional syntax and rule-governed word formation unique to genuine language systems.
