22-05-2026, 06:36 AM
(22-05-2026, 03:54 AM)kckluge Wrote: You are not allowed to view links. Register or Login to view.(21-05-2026, 02:28 PM)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view.Can you be more specific?
Sure. See Schinner, Andreas (2007) 'The Voynich Manuscript: Evidence of the Hoax Hypothesis', Cryptologia, 31:2, 95 - 107 (specifically, p 100-103). Key passage: "The Levenshtein distance of two character strings is an integer ranging from 0 (exact match) to the maximum of the two string lengths (no similarity), denoting the number of elementary edit operations necessary to make both strings equal. Mapping this number to the interval [0,100] yields a 'percentage of dissimilarity' for two tokens. In Figure 3, the similar token repetition distance distribution Pn for the VMS compared with normal texts is presented. Here n denotes the number of other tokens between two similar ones, i.e., n = 0 corresponds to the situation of two alike tokens in immediate vicinity. Two words are considered 'similar' if their dissimilarity as defined above is less or equal to 30%; it turns out that the precise value (+/-10%) of this threshold changes Pn only quantitatively, not qualitatively."
I cannot read that article from home; I will have to go to the univ tomorrow for that. Meanwhile, I would say that the statistic above is obviously very dependent on the text, more than on the language. A technical manual, like the a materia medica or herbal, can be much more repetitive than a novel. Which sort of Chinese texts did they use?
Quote:Understood, and the word count of the Roger Bacon excerpt was picked to match the sample of Voynich text I was comparing it to.
Again, a discursive text like Roger Bacon's would not be an adequate comparison to the VMS, which looks like it is mostly a herbal and a collection of recipes.
Quote:I think if you're positing rates that are unusually large compared to known rates of scribal errors in manuscript texts then that's potentially a problem (and way back in this thread I suggested using rates of scribal errors in Greek mss. copied by an "illiterate" scribe as a reasonable proxy for the Voynich).
This is a clip from a large manuscript from the St. Gallen monastery in Switzerland
[attachment=15674]
I cannot judge the main text; but the red text in that clip ("Here ends Bonaventura's Dialogue Between Soul and Reason") has five spelling errors in 7 words, which were corrected by another scribe -- an error rate of 71%. Obviously the monk who had been anointed as The Exclusive Keeper Of The Red Inkwell did not know Latin -- not even enough to tell that "Bonaventura" was a proper name; that, being a genitive, the ending should be "-æ", "-ae", or "-ę" (as corrected), not "e"; and that "raaonem" was not a valid word.
As I posted before, I suspect that the main source of errors in the VMS was due to the Scribe, who apparently did not know the language, misreading the Author's draft -- which possibly was in a semi-cursive handwriting. That would explain, for example, why d seems to have been often replaced by k and l, but not by t or other letters; and why words ending in ir seems to occur in the same contexts as words ending in iin.
Quote:f you think the Voynich Mss. text is Chinese, don't start by translating Voynichese into Chinese. Start by showing how to translate Chinese into Voynichese. And yes, I understand that you don't actually think it's Chinese per se, and per my comment below I understand that you can't follow this path given the nature of your claim. From my point of view, that's a problem.
Well, again, my claimed solution is indeed not of the type that you seem to expect. I am not claiming that I found how to translate Chinese into Voynichese, or vice-versa. I am claiming to have identified the plaintext.
It is like if there was a mysterious encrypted manuscript from 1610, and someone pointed out that the lengths of the words matched those of Act III of Shakespeare's Hamlet:
to be or not to be this is the question whether tis nobler in the
bd df bu rcb hl eh kosh js irb avckrmao rtdhwag dha kdhacd gw vms
Wouldn't that count as a solution, or at least a substantial advance towards it? Even though one would still be unable to specify the encryption scheme (a Vigenère cipher with a one-time-pad for key?) or decode any part that of the text that was not from Hamlet?
(Of course it could be that the plaintext of that hypothetical manuscript was some other text entirely, but the encrypted text was re-spaced to match Hamlet, just to throw would-be crackers off track. And it could be that the true VMS contents is encoded by steganography, as subtle variations of glyph shapes, but the supporting text was chosen to be a translation of the SBJ with the same goal...
)Quote:It's a strawman to suggest anyone doesn't understand that these are things that have some variance around mean values. I think you exagerate the extent to which "those statistics are greatly affected by topic, style, nature of the text, cost of vellum, etc."
I am not exaggerating. When people cite the statistics of English, Latin, etc. they are usually referring to statistics in discursive texts -- like novels, newspapers, theological treatises, etc. Consider the following text
December third: measured latitude twenty three south, longitude ninety seven west.
December fourth: measured latitude twenty four south, longitude ninety seven west.
December fifth: measured latitude twenty four south, longitude ninety eight west.
...
December twentieth: heavy storm, position not measured.
December twenty-first: measured latitude twenty nine south, longitude ninety six west.
December twenty second: squashed a mutiny. Hanged three men.
December twenty third: measured latitude thirty south, longitude ninety five west.
...
That would be a grammatically correct English text, fully meaningful, but with statistics very different from those of Moby Dick. Its lexicon may have less than 100 word types, its repetitiousness would be off the charts, and it may not use the word "the" even once...
Quote:the fact that hill climbing n-gram stats decrypters work as well as they do would appear to be an existence proof that they are not.
AFAIK, those methods work to the extent that the encryption is simple substitution and the plaintexts are discursive and not obfuscated with nulls or polyalphabetic substitutions -- so that their stats are those of discursive texts (modulo the substitution). They would not work if the plaintext was just a list of street addresses of KGB spies around the world. Wouldn't those methods fail if one inserted a random Russian word and a random Finnish word between any two words of the English plantext?
A couple of months ago I posted an anagram of an English phrase to this thread. Being very short, it should have been easily cracked by the cyptographers in the audience, even by brute force. But it seems that only one of the readers did so. Presumably all the others had their efforts thwarted because one of the words was "daiin".
Quote:As for the (statistical) significance of the spacings of the cribs, there is a limit to how closely I've been following the thread after the initial "daiin" match, and I don't want to get into why I'm unconvinced of the significance of the "daiin" spacing.
I am aware of that, and I am working on making the argument irrefutable.
All the best, --stolfi