17-01-2026, 02:31 PM
(17-01-2026, 02:17 PM)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view.(17-01-2026, 01:57 PM)Rafal Wrote: You are not allowed to view links. Register or Login to view.And many of those are hapaxes (words that occur only once). How could you have deduced their meaning, since it seems that you still don't know precisely what the language is? (Romance and Latin are very different things. For one thing, Romance has no cases...)Quote:In the meantime, go ahead and ask ChatGPT to build you a 12,000-entry dictionary with 99.67% corpus coverage.What are your entries? Words? That makes some problem because Voynich manuscript has only about 8000 unique words.
All the best, --stolfi
PS. And I bet that if I asked ChatGPT to create a 12,000 entry dictionary with 99.67% coverage of the VMS lexicon, it would not blink once and say "Sure! Here it is ...."
PPS. There is a lot more than 1% of errors in the VMS, including words split in half, joined words, words with illegible or ambiguous glyphs, etc. So if your lexicon has 99.67% coverage, that alone says "bullshit"...
Professor Stolfi - I appreciate the direct challenge. You've done more rigorous statistical work on this manuscript than most, so I'll respond in kind.
On the hapax problem:
You're right that hapaxes can't be deduced from frequency alone. My approach to single-occurrence words:
1. Morphological decomposition- If a hapax follows documented affix patterns, I infer meaning from the components. This is standard for agglutinative and constructed languages.
2. Section context - A hapax in the herbal section adjacent to a plant illustration has constrained possibilities.
3. Flagged as low-confidence - Not all 12,000 entries have equal confidence. Hapaxes are marked accordingly.
Is this perfect? No. But "I don't know" is also an entry in a working dictionary.
On "Romance and Latin are very different":
Fair criticism, and I was imprecise. I'm not claiming the manuscript is in Latin or in a Romance language. I'm claiming:
- The pharmaceutical terminology shows phonetic correspondence to Latin botanical/medical terms
- The grammatical structure shows case marking (which rules out pure Romance, as you note)
- The substrate appears to include Semitic elements (Arabic medical vocabulary was pervasive in medieval pharmacy)
"Latin-Semitic hybrid with systematic abbreviation" is closer to my actual claim than "Romance."
On 99.67% coverage being "bullshit":
You make a legitimate point about transcription errors, split words, joined words, and ambiguous glyphs. Let me restate more honestly:
99.67% of tokens in the transcription I used map to dictionary entries. That transcription has errors. My dictionary also has errors. The number reflects internal consistency of my analysis, not ground truth accuracy of the manuscript.
If that distinction wasn't clear, it should have been. Thank you for highlighting it.
On your ChatGPT challenge:
You're right that an LLM would happily generate 12,000 plausible-looking entries. The difference is whether those entries:
- Follow consistent morphological rules across the corpus
- Produce contextually appropriate translations by section
- Survive statistical validation against control corpora
I claim mine do. That claim is testable, and I expect to be tested.
