The Voynich Ninja

Full Version: The 'Chinese' Theory: For and Against
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
(22-05-2026, 03:54 AM)kckluge Wrote: You are not allowed to view links. Register or Login to view.
(21-05-2026, 02:28 PM)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view.Can you be more specific?

Sure. See Schinner, Andreas (2007) 'The Voynich Manuscript: Evidence of the Hoax Hypothesis', Cryptologia,  31:2, 95 - 107 (specifically, p 100-103). Key passage: "The Levenshtein distance of two character strings is an integer ranging from 0 (exact match) to the maximum of the two string lengths (no similarity), denoting the number of elementary edit operations necessary to make both strings equal. Mapping this number to the interval [0,100]  yields a 'percentage of dissimilarity' for two tokens. In Figure 3, the similar token repetition distance distribution Pn for the VMS compared with normal texts is presented. Here n denotes the number of other tokens between two similar ones, i.e., n = 0 corresponds to the situation of two alike tokens in immediate vicinity. Two words are considered 'similar' if their dissimilarity as defined above is less or equal to 30%; it turns out that the precise value (+/-10%) of this threshold changes Pn only quantitatively, not qualitatively."

I cannot read that article from home; I will have to go to the univ tomorrow for that.  Meanwhile, I would say that the statistic above is obviously very dependent on the text, more than on the language.  A technical manual, like the a materia medica or herbal, can be much more repetitive than a novel.  Which sort of Chinese texts did they use?

Quote:Understood, and the word count of the Roger Bacon excerpt was picked to match the sample of Voynich text I was comparing it to.

Again, a discursive text like Roger Bacon's would not be an adequate comparison to the VMS, which looks like it is mostly a herbal and a collection of recipes.  

Quote:I think if you're positing rates that are unusually large compared to known rates of scribal errors in manuscript texts then that's potentially a problem (and way back in this thread I suggested using rates of scribal errors in Greek mss. copied by an "illiterate" scribe as a reasonable proxy for the Voynich).

This is a clip from a large manuscript from the St. Gallen monastery in Switzerland

[attachment=15674]

I cannot judge the main text; but the red text in that clip ("Here ends Bonaventura's Dialogue  Between Soul and Reason") has five spelling errors in 7 words, which were corrected by another scribe -- an error rate of 71%.  Obviously the monk who had been anointed as The Exclusive Keeper Of The Red Inkwell did not know Latin -- not even enough to tell that "Bonaventura" was a proper name; that, being a genitive, the ending should be "-æ", "-ae", or "-ę" (as corrected), not "e"; and that "raaonem" was not a valid word.

As I posted before, I suspect that the main source of errors in the VMS was due to the Scribe, who apparently did not know the language, misreading the Author's draft -- which possibly was in a semi-cursive handwriting.  That would explain, for example, why d seems to have been often replaced by k and l, but not by t or other letters; and why words ending in ir seems to occur in the same contexts as words ending in iin.

Quote:f you think the Voynich Mss. text is Chinese, don't start by translating Voynichese into Chinese. Start by showing how to translate Chinese into Voynichese. And yes, I understand that you don't actually think it's Chinese per se, and per my comment below I understand that you can't follow this path given the nature of your claim. From my point of view, that's a problem.

Well, again, my claimed solution is indeed not of the type that you seem to expect.  I am not claiming that I found how to translate Chinese into Voynichese, or vice-versa.  I am claiming to have identified the plaintext. 

It is like if there was a mysterious encrypted manuscript from 1610, and someone pointed out that the lengths of the words matched those of Act III of Shakespeare's Hamlet:
  to be or not to be this is the question whether tis nobler in the 
  bd df bu rcb hl eh kosh js irb avckrmao rtdhwag dha kdhacd gw vms
Wouldn't that count as a solution, or at least a substantial advance towards it?  Even though one would still be unable to specify the encryption scheme (a Vigenère cipher with a one-time-pad for key?) or decode any part that of the text that was not from Hamlet?  

(Of course it could be that the plaintext of that hypothetical manuscript was some other text entirely, but the encrypted text was re-spaced to match Hamlet, just to throw would-be crackers off track.  And it could be that the true VMS contents is encoded by steganography, as subtle variations of glyph shapes, but the supporting text was chosen to be a translation of the SBJ with the same goal... Tongue )

Quote:It's a strawman to suggest anyone doesn't understand that these are things that have some variance around mean values. I think you exagerate the extent to which "those statistics are greatly affected by topic, style, nature of the text, cost of vellum, etc."

I am not exaggerating.  When people cite the statistics of English, Latin, etc. they are usually referring to statistics in discursive texts -- like novels, newspapers, theological treatises, etc.  Consider the following text

  December third: measured latitude twenty three south, longitude ninety seven west.
  December fourth: measured latitude twenty four south, longitude ninety seven west.
  December fifth: measured latitude twenty four south, longitude ninety eight west.
  ...
  December twentieth: heavy storm, position not measured.
  December twenty-first: measured latitude twenty nine south, longitude ninety six west.
  December twenty second: squashed a mutiny. Hanged three men.
  December twenty third: measured latitude thirty south, longitude ninety five west.
  ...

That would be a grammatically correct English text, fully meaningful, but with statistics very different from those of Moby Dick.  Its lexicon may have less than 100 word types, its repetitiousness would be off the charts, and it may not use the word "the" even once...

Quote:the fact that hill climbing n-gram stats decrypters work as well as they do would appear to be an existence proof that they are not.

AFAIK, those methods work to the extent that the encryption is simple substitution and the plaintexts are discursive and not obfuscated with nulls or polyalphabetic substitutions -- so that their stats are those of discursive texts (modulo the substitution).  They would not work if the plaintext was just a list of street addresses of KGB spies around the world. Wouldn't those methods fail if one inserted a random Russian word and a random Finnish word between any two words of the English plantext?

A couple of months ago I posted an anagram of an English phrase to this thread.  Being very short, it should have been easily cracked by the cyptographers in the audience, even by brute force.  But it seems that only one of the readers did so.  Presumably all the others had their efforts thwarted because one of the words was "daiin". 

Quote:As for the (statistical) significance of the spacings of the cribs, there is a limit to how closely I've been following the thread after the initial "daiin" match, and I don't want to get into why I'm unconvinced of the significance of the "daiin" spacing.

I am aware of that, and I am working on making the argument irrefutable.

All the best, --stolfi
(22-05-2026, 06:36 AM)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view.I cannot judge the main text; but the red text in that clip ("Here ends Bonaventura's Dialogue  Between Soul and Reason") has five spelling errors in 7 words, which were corrected by another scribe -- an error rate of 71%.  Obviously the monk who had been anointed as The Exclusive Keeper Of The Red Inkwell did not know Latin -- not even enough to tell that "Bonaventura" was a proper name; that, being a genitive, the ending should be "-æ", "-ae", or "-ę" (as corrected), not "e"; and that "raaonem" was not a valid word.

You posted this several times and it really isn't a good example of scribal errors. There is only one error, the "e" in "bone" should be an "a". The missing loop of "b" is not an error, it was perfectly readable without it. Most (like 99%) late medieval manuscripts did not use any ę/æ/œ whatsoever, it was not a mistake to write "e" instead of "ę". It's not "raaonem" but "racionem", a medieval spelling, extremely frequent in "ti"+vowel patterns. The "ci" ligature was written without a dot, like the first one in Expli"ci"t.
(21-05-2026, 06:53 PM)nablator Wrote: You are not allowed to view links. Register or Login to view.
(21-05-2026, 02:28 PM)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view.Well, you have that text above to play with.  Do you see "babble-like sequences" in it?
Anyway, this doesn't happen in the VMS [...]

There is something funny going on...

As you may have guessed, that file is the digital file of the Shennong Bencaojing that I was using, read in Mandarin pinyin, obfuscated by an arbitrary spelling into latin letters and digraphs.   I  located the Chinese original entries corresponding to those highly repeated long strings.  The numbers <...> refer to my SBJ file.

6x, 10 words
  vuef jhut quky jhokd mep bifa yitd duitd jhotd xuikd 
  [jiǔ shí] qīng shēn bù lǎo, yán nián shén xiān.
  [久食]轻身不老,延年神仙。
  [Prolonged ingestion] makes the body light, wards off aging, extends one's years, and turns one into a divine immortal.
  <b1.2.018> <b1.2.019> <b1.2.020> <b1.2.021> <b1.2.022> <b1.2.023>
 
8x, 8 words
  vuef fet quky jhokd yitd duitd mep bifa 
  [jiǔ fú] qīng shēn yán nián bù lǎo.
  [久服]轻身延年不老。
  [Prolonged intake] makes the body light, extends one's years, and wards off aging.
  <b1.1.004> <b1.2.025> <b1.2.033> <b1.2.034> <b1.2.043> <b1.2.052> <b1.3.076> <b1.4.094>
 
7x, 6 words
  vuef fet quky jhokd mep bifa   
  [jiǔ fú] qīng shēn bù lǎo.
  [久服]轻身不老。
  [Prolonged intake] makes the body light and wards off aging.
  <b1.1.005> <b1.2.029> <b1.2.046> <b1.3.078> <b1.5.105> <b1.6.113> <b2.2.015>

The repetitions, by themselves, are not surprising: in a succinct materia medica, one should expect the same indications and (in this case) benefits to appear in many recipes.

What is more surprising is that no similar repetitions are seen in the VMS.  I will try to identify the SPS parags that match those entries and let you know.  

Meanwhile, there is another curious detail. Many of those long repeated strings (marked in red above) occur close together in my SBJ fle, with numbers ranging from b1.2.018 to b1.2.046.  That again is not strange given the way the SBJ is organized.  What is curious is that I have tried to find the matching parag for a dozen recipes in that number range, and I could come up with a single convincing match.  

There may be other explanations for this failure, but one guess is that those recipes happened to fall into (a) one of the three "super-parags" on f108v, f11r, and f111v, each with ~10 stars; or (b) in one of the four missing pages.

And here is what the little devil on my left shoulder is whispering: perhaps it was case (b), and Wilfrid noticed those repetitions, and thought that they would make the book look like gibberish, and thus strategically "lost" those four pages.  After all, the book was a Bacon original, so he was not trying to cheat anyone, just removing a source of potential unnecessary confusion, right? 

All the best, --stolfi
(22-05-2026, 10:28 AM)nablator Wrote: You are not allowed to view links. Register or Login to view.The missing loop of "b" is not an error, it was perfectly readable without it.

Please! 

It may have been perfectly readable in spite of all those errors -- but they were errors.  So much so that another scribe had to correct them.  The You are not allowed to view links. Register or Login to view. is precisely "Bonaventurae dialogus inter animam et rationem".   What the Rubricator wrote was  like "This is the end of shekespearz omelet".
  • The "b" should have been uppercase since it is a proper name.
  • The name is "Bona" not "Bone"
  • "Bonaventure" is the English version of the name.  In Latin (from Italian) the name was Bonaventura . In that sentence the name  should have been in the genitive: -ae in Classical Latin, or -æ in Medieval/Church latin, or ę in scribal abbreviation.  Ending it with -e was a grammatical error.  
  • The last word should have been "rationem" not "racionem".  This is not a valid spelling; and anyway that is not how the book's name is written.  Even if that is what he intended to write, it would still have been an error.
  • The Proofreader added a dot over that "i" because it was needed, even after he corrected the "c" to "t".   It would have been needed even it it was a "ci".   Especially if it was a "ci", precisely to avoid confusion with "a".  [attachment=15685] (from that same book, page 63).
  • Indeed there is a missing dot over he "ci" of "Explicit".  The Proofreader missed it, or let it pass because the "c" was well separated from the "i" so it would not be confused with "a".  Either way, that would make six errors in seven words...

And what is the thesis that you are trying to defend?  That the VMS must not have any errors?  Because not even a Scribe who could not read the language would make any errors?

All the best, --stolfi
(21-05-2026, 03:40 PM)eggyk Wrote: You are not allowed to view links. Register or Login to view.What I mean is a word that only appears once or twice in many recipes, but always in the same position when used within a recipe. For example, the word "grows" or the phrase "also known as". If you were to find a VMS match for a word like this, showing a positional match across recipes, that would be interesting.

And that is indeed (part of) what I am doing.  

In my Chinese SBJ file, 216 out of 363 entries (~60%) have a  主 ("Mainly for") within the first 10 Chinese characters.  In the VMS, that 主 maps to daiin or a few variations thereof.  I count 148 out 243 "good" parags (~60%) that have one of those strings within the first 50 letters (ignoring spaces).  And on average 1 Chinese character corresponds almost exactly to 5 EVA letters.

Quote:With something like "daiin", it is so common throughout the text that it will undoubtedly coincidentally match with something.

Even when I have a single 主 in the Chinese text, my programs often fail to find a good match -- that has a daiin or one of its allowed variants, and the right number of EVA letters before and after it, plus or minus a few letters.  

But I have another good crib, 气 ("vital essence" or "vital energy") that matches chedy in the VMS or a few variations thereof.  It occurs in 242 out of 363 recipes.   When the recipe has both 主 and 气,  my programs do what you show in your images.  Even better when the recipe has more than one of either keyword.  

I have 2-3 other candidates for cribs, but I am not sure of them yet.

All the best, --stolfi
Thank you for sharing, I pivoted and went to voynich.nu and looked at the ZL 3b file and followed up on the daiin positional questions. One thing that stands out (and the idea of looking at this positional issue is NOT unique at all I completely recognize) is when daiin is the ending word what word holds the penultimate position and is that regular enough to inform some more structure. And to be clear I mean physical line endings here just like any document, so ZL3b shows:

Penultimate word / Lines / Ends in daiin / %
chol / 43 / 10 / 23%
otol / 13 / 3 / 23%
dy / 18 / 3 / 17%
shol / 19 / 3 / 16%

that's wild! it goes down to dy next at 18 / 3 / 17%. Tiltman and others definitely flagged that certain words prefer front, mid, back positioning, so nothing new conceptually here. But digging into the examples for chol etc right before daiin:

f35v.19    schokey chol daiin
f37v.23    yto chol daiin
f47r.9    chaiin okal chol daiin
f49v.26    otol chol daiin
f56v.16    chol chol daiin
f16r.8    shotchy ydain yky shody otol daiin
f42v.2    dshey tchey y kchey chtchy dan dain otol daiin
f15v.8    ytchor chor ol oiin oty shol daiin
f20v.4    sho or aiin shol daiin
f22v.14    olshly shol daiin

And this is mainly herbal, and then pharma right. Why is herbal signaling the word before daiin the most? In other sections, those same words don't. So is daiin in sps marking a field but daiin in herbal is "caused" by the penultimate word or is grammatically required? And chol is wild, it occurs before other words too... but why before daiin 23% of the time and other words 77% of the time. Some preference but not a rule, a predictor but not a trigger.

Now I need a glass of wine... lol



(22-05-2026, 04:22 AM)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view.
(20-05-2026, 01:35 AM)JoeyB Wrote: You are not allowed to view links. Register or Login to view.FIRST: which SBJ or ZHB version is the right comparator. And SECOND, whether the source text could have existed in the 'right form' around 1400. And I definitely don't have anything useful to add there.

The answer will require a rather long post; it will come.  But in brief: as @richforto noted, the version of the Shennong Bencaojing (SBJ) that became the Starred Parags section (SPS) or the VMS was not the "reconstructed" file that I had been using, but a version that was embedded in some later materia medica; most likely the Zhenghe Bencao (ZHB), which was composed around 1080 CE.  On block-printed copies of the ZHB circulating in the 1300-1400 CE time frame, the SBJ text would have been clearly marked out by being printed in double-size font and white-on-black instead of black-on-white.  Thus a doctor or scholar who had a copy of the ZHB could have read it out aloud easily, excluding certain ~500 CE additions that had become as "sacred" as the SBJ itself.

Quote:BUT, the other issue seems to be more testable: if this is a positional-distance hypothesis, can the method we use to match also recover the rooster/f105v32-38 pair when we run the whole thing blind? Which files shouldI grab for that? EG, compare all SPS paragraphs against all SBJ/ZHB entries without preselecting the rooster pair,

Yes, and and that is exactly what I am doing now for every SPS entry, including the Rooster one.  Except that I am testing only against the 243 "good" SPS parags -- excluding those which have two or more stars on the margin, which presumably are in fact two or more parags run together.  So I expect that at most 2/3 of the SBJ entries will have an identifiable match.

However, the Rooster entry is a very easy case because it is the only one with eight instances of 主. Thus f105v.32 is the only parag with enough daiin to even begin to compete, even allowing for "quillos" like dain, laiin, or kain.  So it easily comes out as the match for the Rooster entry.  I will discuss this in detail later.

Apart from the Rooster one, there are a few SBJ entries with three 主, and  a few with two. The vast majority has only one.   On these, my programs will often find two or more parags that seem to have a daiin (or quillos thereof) at the right places.  So I have to put those entries on hold for now.

On the other hand, I now have a couple more cribs that can be used in the comparison, notably 气 qì (some sort of "vital energy"; sort of like the Western "humors" but completely different).  That is a rather common character in the SBJ, and seems to correspond to chedy (or sometime slight variations thereof, like chdy or cheda) in the SPS.  By including those cribs in the list of keywords to be matched I can often resolve those ambiguities, and make already identified matches more certain.

Quote:I started looking at and playing with files in the ic.unicamp....Notes/077 folder but before I go too far down the road, if the method is bad I don't want to keep going, and if there are files that are final versions or authoritative I'd want to use those.

Sorry, those programs and files were not meant for other people's use.  They are still messy and buggy tools that I am using to match SBJ recipes to SPS parags. A task that is progressing, but much more slowly than I hoped, for reasons that I will detail later.  Feel free to use those programs, but I cannot guarantee them.

All the best, --stolfi
There is an entry in the Shennong Bencaojing (SBJ) that has two instances of the character 血 = "blood".  That character seemed another good candidate for a crib.  I found one parag in the Starred Parags Section (SPS) that matched that entry fairly well, if asked my programs to look only for the four other cribs that I already had.  But there was no Voynichese word in that parag that could be the translation of both characters.

Then I noticed that in the pinyn reading of that entry the two 血 characters had different readings, one as xuè and the other as xiè.  I asked Google AI which was the correct reading. Here is the reply:

Quote: You did not copy the pinyin incorrectly [...]

The character 血 is a polyphone (多音字) that has two standard Mandarin readings: the literary reading (xuè) and the colloquial reading (xiě). 

Because Shanghan Lun (伤寒论) / Jinui Yaolue formulas and symptoms are classical medical texts, the choice between these two readings depends on how the words are structured in Traditional Chinese Medicine (TCM). 

1. 肠澼脓血 (cháng pì nóng xuè)
  • Why it is xuè: In this phrase (which refers to dysentery with pus and blood), 脓血 ([i]nóngxuè[/i]) is a tightly bound, formal compound noun ("pus-blood"). 

  • The Rule: In Mandarin, when 血 forms a stable, multi-syllable compound or technical/medical term (like 血液 [i]xuèyè[/i], 血管 [i]xuèguǎn[/i], or 充血 [i]chōngxuè[/i]), it almost always takes the formal literary reading xuè.

2. 下血赤白 (xià xiè chì bái)
  • Why it is xiè (or colloquially xiě): In this phrase (referring to the passing of red and white stool/blood), 下血 is a verb-object construction where 下 act as a verb ("to discharge / pass down") and 血 stands alone as the direct object noun ("blood"). 

  • The Rule: When 血 stands alone as an independent noun or functions as the object after a verb (like 吐血 tùxiě "to vomit blood" or 流血 liúxiě "to bleed"), native speakers naturally shift to the colloquial reading xiě

Sigh.  If my problem was not already hard enough...

All the best, --stolfi
nablator Wrote:
Jorge_Stolfi Wrote:I cannot judge the main text; but the red text in that clip ("Here ends Bonaventura's Dialogue  Between Soul and Reason") has five spelling errors in 7 words, which were corrected by another scribe -- an error rate of 71%.  Obviously the monk who had been anointed as The Exclusive Keeper Of The Red Inkwell did not know Latin -- not even enough to tell that "Bonaventura" was a proper name; that, being a genitive, the ending should be "-æ", "-ae", or "-ę" (as corrected), not "e"; and that "raaonem" was not a valid word.

You posted this several times and it really isn't a good example of scribal errors. There is only one error, the "e" in "bone" should be an "a". The missing loop of "b" is not an error, it was perfectly readable without it. Most (like 99%) late medieval manuscripts did not use any ę/æ/œ whatsoever, it was not a mistake to write "e" instead of "ę". It's not "raaonem" but "racionem", a medieval spelling, extremely frequent in "ti"+vowel patterns. The "ci" ligature was written without a dot, like the first one in Expli"ci"t.

Hi Nablator, of course I fully agree with what you write.

I investigated a little about the origins of the name "Bonaventura" and it's a nickname based on "bona ventura" (good luck). Therefore the genitive "boneventure" (or the equivalent "bonaeventurae") was clearly understandable to those familiar with Franciscan lore.

You are not allowed to view links. Register or Login to view.

"Amoris stimulū Bonevēture"
Amoris stimulum Boneventure

[attachment=15832]
You are not allowed to view links. Register or Login to view.
(29-05-2026, 08:43 AM)MarcoP Wrote: You are not allowed to view links. Register or Login to view.I investigated a little about the origins of the name "Bonaventura" and it's a nickname based on "bona ventura" (good luck). Therefore the genitive "boneventure" (or the equivalent "bonaeventurae") was clearly understandable to those familiar with Franciscan lore.

That may have been "understandable", and a common error -- but it was wrong.  It is not me saying that, it is the Proofreading scribe who corrected that "e" to "a", together with all the other Rubricator's mistakes.

Whatever the origin, "Bonaventura" was a proper first name (not a nickname, although it  was based on the phrase "bona ventura") and it was Italian not Latin.  So the internal declension of "Bona" to "Bone" was not appropriate.  

And the final "e" in your image clip seems to have a  cedilla on top instead of on the bottom.  Isn't that evidence that a simple "-e" ending would be incorrect (as the Proofreader apparently thought)?

And anyway the whole attribution to St. Bonaventura seems to be an error -- the true author was Jacobus Mediolanensis (Giacomo da Milano?).

And even if we conceded that "Bone" was correct in that "explicit", and the Proifreader was wrong, we still have the other five or six errors...

I still stand by my point: that "explicit" rubric is an example of how many errors a scribe can make if he is not fluent in the language he is copying.

All the best, --stolfi
The temptation to label as “errors” cultural expressions that are different from our modern point of view is always hard to resist. We are so used to Latin as a frozen, dead language, that we cannot imagine that in the middle ages it was still very much alive and subject to all the changes and individual variants that are typical of living languages.

Emphasis mine:

University of Oxford Wrote:The most obvious difference in appearance between Medieval Latin and Classical Latin is in how words were spelled. Although Classical spellings were generally retained for inherited vocabulary, changes in pronunciation which had happened over the centuries — many the same as those which had led to the divergence of the everyday Romance languages from Latin and from each other — influenced the corresponding spelling of the words. Thus we often find ci before a vowel where the Classical spelling would have been ti (e.g racio for ratio), and the diphthongs ae and oe which had come to be pronounced the same as the simple e sound are often written e. (We also find as a result examples where ae or oe are written where the expected spelling would be just e.) Other alternations in spelling arising from changes in pronunciation are the interchange of b and v, the insertion or deletion of h, the use of single consonants for double ones (and vice versa), and the substitution of y for i. Sometimes spellings were also influenced by the pronunciation of a word in the everyday local language related to or derived from the Latin word (or thought to have been so).

You are not allowed to view links. Register or Login to view.