09-09-2025, 03:34 PM
Hello,
Rather than trying to decode the VMS outright, I think we can certainly answer other questions about the MS with the help of AI. For example, we should be able to decide with high probability whether the MS has real meaning or not. And I propose a system for doing that.
My own take: what best fits the evidence is that someone commissioned the VMS to be exactly what it is: a mysterious, impressive book that nobody can read because it contains no meaning. This is highly plausible and fits with the physical evidence, linguistic evidence and cultural evidence of the time. I believe it's a pseudo-text, made to look like a language, without actual meaning, for the purpose of impressing people, and I think AI can help us decide one way or another.
The question to explore is what kind of system is this text most like?
The idea is to generate large sets of synthetic manuscripts under different assumptions and see which "universe" the Voynich statistically belongs to. For example:
Then we can measure each synthetic universe against a Voynich "fingerprint panel" (word shapes, entropy, Zipf’s law, affix patterns, section differences, etc.). Rather than asking what does it say? this approach asks "what system is it most like?" If structured pseudo-language consistently fits better than ciphered Latin or conlang universes, that's powerful evidence.
This wouldn’t solve the translation, but it would be an important step in understanding the MS and it would be one box checked off.
Does this kind of “synthetic benchmarking” sound worth trying? Has anyone attempted something like this at scale?
Anyway, here's where AI did a lot of the work in building an outline for how the experiment might go with only off-the-shelf tools. The goal is to see which universe (ciphered language, real language, conlang, structured pseudo-language, shorthand/abbreviation, etc.) best reproduces the Voynich’s full statistical “fingerprint.”
No, I don't have expertise in this kind of research. I'm only seeing where AI can point to help us check off some boxes and let those with the expertise run with it.
1) Define the universes (generate many fakes)
Make 200–2,000 synthetic manuscripts, each matched in length and page/line structure to the VM. Each fake follows one hypothesis with tunable knobs:
A. Ciphered Natural Language
B. Real Language (no cipher) shaped to VM layout
C. Conlang (meaningful but invented)
D. Structured Pseudo-Language (no semantics)
E. Shorthand/Abbreviation Universe
2) Build the Voynich “fingerprint panel”
Compute the same metrics for the true VM and for every synthetic manuscript:
Token/Type structure
Local dependencies
Morphology & segmentation
Positional/structural signals
Compressibility / model fit
Clustering/embedding
3) Scoring: which universe fits best?
Use multiple, complementary criteria:
4) Robustness checks
5) Tooling (all off-the-shelf)
6) Workflow & timeline (lean team)
Week 1–2: Data wrangling (VM EVA, Latin/Italian corpora), page/line schema, metric code scaffolding
Week 3–6: Implement generators A–E; unit tests; produce first 500 synthetics
Week 7–8: Compute full fingerprint panel; initial ranking
Week 9–10: ABC fitting per universe; robustness/ablations
Week 11–12: Write-up, plots, release code & datasets (repro pack)
7) Readouts you can trust (what “success” looks like)
8) “Citizen-science” version (solo, laptop-friendly)
9) Pitfalls & how to avoid them
Rather than trying to decode the VMS outright, I think we can certainly answer other questions about the MS with the help of AI. For example, we should be able to decide with high probability whether the MS has real meaning or not. And I propose a system for doing that.
My own take: what best fits the evidence is that someone commissioned the VMS to be exactly what it is: a mysterious, impressive book that nobody can read because it contains no meaning. This is highly plausible and fits with the physical evidence, linguistic evidence and cultural evidence of the time. I believe it's a pseudo-text, made to look like a language, without actual meaning, for the purpose of impressing people, and I think AI can help us decide one way or another.
The question to explore is what kind of system is this text most like?
The idea is to generate large sets of synthetic manuscripts under different assumptions and see which "universe" the Voynich statistically belongs to. For example:
- Ciphered Latin/Hebrew/Italian texts (various substitution styles)
- Real languages reformatted to look Voynich-like
- Constructed languages with invented grammar
- Structured pseudo-languages (rules for prefixes/suffixes, but no meaning)
- Shorthand/abbreviation systems treated as full glyph sets
Then we can measure each synthetic universe against a Voynich "fingerprint panel" (word shapes, entropy, Zipf’s law, affix patterns, section differences, etc.). Rather than asking what does it say? this approach asks "what system is it most like?" If structured pseudo-language consistently fits better than ciphered Latin or conlang universes, that's powerful evidence.
This wouldn’t solve the translation, but it would be an important step in understanding the MS and it would be one box checked off.
Does this kind of “synthetic benchmarking” sound worth trying? Has anyone attempted something like this at scale?
Anyway, here's where AI did a lot of the work in building an outline for how the experiment might go with only off-the-shelf tools. The goal is to see which universe (ciphered language, real language, conlang, structured pseudo-language, shorthand/abbreviation, etc.) best reproduces the Voynich’s full statistical “fingerprint.”
No, I don't have expertise in this kind of research. I'm only seeing where AI can point to help us check off some boxes and let those with the expertise run with it.
1) Define the universes (generate many fakes)
Make 200–2,000 synthetic manuscripts, each matched in length and page/line structure to the VM. Each fake follows one hypothesis with tunable knobs:
A. Ciphered Natural Language
- Source corpora: medieval Latin, Italian, Occitan, Hebrew (public domain).
- Ciphers to implement:
- Simple monoalphabetic substitution
- Homophonic substitution (n symbols per plaintext char)
- Syllabic substitution (map digraphs/triagraphs)
- Nomenclator (frequent words → special symbols)
- Simple monoalphabetic substitution
- Extras: line-initial/line-final rules, abbreviation expansion (Latin breviographs), occasional nulls.
B. Real Language (no cipher) shaped to VM layout
- Rewrap real Latin/Italian texts into Voynich-like lines/paragraphs and enforce a few layout quirks (e.g., frequent “q-” line-initial tokens) to probe the effect of mise-en-page alone.
C. Conlang (meaningful but invented)
- Generators:
- Finite-state morphology (prefix–stem–suffix classes).
- PCFG (probabilistic context-free grammar) with phonotactics.
- Finite-state morphology (prefix–stem–suffix classes).
- Dialects: Currier-A/B style parameter shifts (suffix set, token length).
D. Structured Pseudo-Language (no semantics)
- Automata with rules like:
- prefix ∈ {qo, q, o, ch, …}, stem ∈ Σ{a,i,e,o,y}, suffix ∈ {dy, n, in, ain, aiin, …}
- position-dependent variants (line-initial gets more “q”)
- tunable affix productivity, stem entropy, and run-lengths
- prefix ∈ {qo, q, o, ch, …}, stem ∈ Σ{a,i,e,o,y}, suffix ∈ {dy, n, in, ain, aiin, …}
- Also include a human-plausible generator: Markov/HMM with simple constraints to simulate “fast fluent scribbling.”
E. Shorthand/Abbreviation Universe
- Start with Latin prose; compress using a learned set of ~30–60 breviographs/abbreviations (e.g., -us, -rum, -que), then hide the mapping (treat abbreviographs as glyphs). Vary aggressiveness.
2) Build the Voynich “fingerprint panel”
Compute the same metrics for the true VM and for every synthetic manuscript:
Token/Type structure
- Zipf slope & curvature; Heaps’ law α, K
- Word-length distribution; KS/AD distance vs VM
- Vocabulary growth by page/quire
Local dependencies
- Char bigram/trigram distributions; JS divergence
- Conditional entropy H(Xₙ|Xₙ₋₁) and H(wordₙ|wordₙ₋₁)
- Mutual information vs distance (Hilberg-style curve)
Morphology & segmentation
- Morfessor/BPE: number of subunits, affix productivity, stem/affix ratio
- Family proliferation: counts of {dain, daiin, daiiin}-type ladders
Positional/structural signals
- Line-initial vs line-final token profiles
- Paragraph-initial bias
- Page/section (Herbal/Astro/Balne/Recipes) drift metrics (KL divergence)
Compressibility / model fit
- LZMA/PPM/ZPAQ ratios
- n-gram perplexity (n=3..7) trained on one half, tested on the other
- Tiny Transformer perplexity (character-level) trained on each universe, tested on VM (cross-perplexity)
Clustering/embedding
- UMAP/t-SNE on character n-gram vectors; silhouette vs VM cluster
- Rank-order correlation (Kendall τ) of frequency lists
3) Scoring: which universe fits best?
Use multiple, complementary criteria:
- Distance aggregation: Normalize each metric to z-scores, then compute a weighted composite distance of each synthetic to the VM. Rank universes by median distance.
- Model selection: Approximate Bayesian Computation (ABC): treat generator knobs as priors, accept parameter settings whose synthetic stats fall within ε of the VM. Compare posterior mass across universes.
- Held-out validation: Fit knobs on half the VM; test distances on the other half (and per section).
4) Robustness checks
- Ablations: remove line-position rules or suffix ladders—does fit collapse?
- Overfitting guard: ensure no generator is trained directly on VM tokens (only statistics), and verify generalization across sections.
- Adversarial baseline: try to force ciphered Latin to match VM—if it still lags pseudo-language on multiple metrics, that’s strong evidence.
5) Tooling (all off-the-shelf)
- Python:
numpy, pandas, scikit-learn, matplotlib, networkx
- NLP/stat:
morfessor, sentencepiece (BPE), nltk for n-grams
- Compressors: builtin, lzma, bz2, zlib; optional PPMd via a Python wrapper
- Dimensionality reduction:
umap-learn, scikit-learn, (t-SNE/UMAP)
- Lightweight Transformers (optional):
transformers with a tiny char-LM
6) Workflow & timeline (lean team)
Week 1–2: Data wrangling (VM EVA, Latin/Italian corpora), page/line schema, metric code scaffolding
Week 3–6: Implement generators A–E; unit tests; produce first 500 synthetics
Week 7–8: Compute full fingerprint panel; initial ranking
Week 9–10: ABC fitting per universe; robustness/ablations
Week 11–12: Write-up, plots, release code & datasets (repro pack)
7) Readouts you can trust (what “success” looks like)
- A league table: per-universe composite distance to VM (with error bars)
- Posterior plots: which parameter regions (e.g., high suffix productivity, low stem entropy) best match VM
- Confusion matrix from a classifier trained to tell universes apart using the fingerprint; if VM gets classified as “structured pseudo-language” with high confidence, that’s decisive.
8) “Citizen-science” version (solo, laptop-friendly)
- Implement Universe D (pseudo-language) and Universe A(1) (mono-substitution over Latin).
- Compute a mini fingerprint: Zipf slope, word-length KS, bigram JS, compression, Morfessor affix productivity.
- Generate 100 synthetics for each universe; plot distance distributions vs VM.
- If pseudo beats ciphered Latin on 4/5 metrics, you’ve got a publishable note.
9) Pitfalls & how to avoid them
- Layout leakage: VM line/page structure matters—always replicate it in synthetics.
- Cherry-picking metrics: pre-register the metric set; report all.
- Over-tuning: do ABC on one half; evaluate on the other.
- Section bias: score by section and overall; the winner should be consistent.