The Voynich Ninja

Full Version: The 'Chinese' Theory: For and Against
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
(06-05-2026, 03:02 PM)Jonas Barnun Wrote: You are not allowed to view links. Register or Login to view.I tried to map some other words in the rooster. More specifically the word for 鸡, 下 and 子 which each appear at least 2 times in the Chinese.  Are you able to find potential candidates for these characters?

I am trying, but not in a very efficient manner.  I spent a lot of time over the past month trying to automate the matching of SBJ entries to SPS parags, but it is still not good enough. I must improve the metric I use for measuring the quality of the match.  Maybe if I had spent that time doing the matching by hand I would have got more results by now...

One potential problem is that the same Chinese character may map to two or more different words in the VMS language.  For example, take 鸡子 = "chicken child" = "egg" in that Rooster recipe.  If the Dictator who read the book to the Author was Cantonese or Vietnamese, he may have said the word for "egg" in his language, rather than the word for "chicken" then the word for "child". 

Quote:Especially the versions of shennong bencao jing are multiple, so the possibilities are also quite numerous on that front as well, we are not talking about a canonical text.

I am aware that there are variations.  For instance, just this weekend I found that in the "cinnabar" recipe, the digital files have a 治 character while other sources have 主. It seems that the former is a relatively modern change, and it would have been the latter in 1400.

An the whole 鸡白蠹 = "chicken white grub" part of the "Rooster" recipe, being related to farming rather than medicine, was probably missing already in the copy that was the source for the VMS.

My digital file has quite a few notations in parenthesis that specify the class of patients for which the following diseases or benefits apply.  Like(女子)in the Rooster recipe.  Those were generally omitted by the Author, and may have been absent from the original source as well.

But so far those variations have been a minor problem.  There are also errors in the digital files available from the net, including a few recipes that were mashed together.

Quote:It would be helpful for me to have the rooster in EVA pronunciation (latin characters) to better visualize and avoid the trap of the Voynich script misreading.

Beware that the EVA encoding is not meant to be how Voynichese is pronounced.  It was designed to be almost pronounceable, but only because it was hoped that it would be helpful when transcribing the text and talking about it.   We still have not the foggiest idea of how daiin or chedy were actually pronounced.  It may well be that d is actually a vowel and a is a consonant, or a tone mark...

The EVA text of f105v.32 is in You are not allowed to view links. Register or Login to view., in the section "Aligning..."

All the best, --stolfi
Thank you Stolfi for sharing the EVA file, it's super helpful and fully understood, the EVA is just a mere convention to codify the characters in more easily readable letters with no phonetic value. 

(07-05-2026, 04:38 AM)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view.One potential problem is that the same Chinese character may map to two or more different words in the VMS language.  For example, take 鸡子 = "chicken child" = "egg" in that Rooster recipe.  If the Dictator who read the book to the Author was Cantonese or Vietnamese, he may have said the word for "egg" in his language, rather than the word for "chicken" then the word for "child". 

Regarding this, I don't know for Vietnam, but I can speak based on my experience of China. There is a distinction between speaking in dialect and reading a text in a dialect. Typically a spoken Wu dialect or Cantonese would have specific syntaxic structures and specific words which are uncommon or inexistant in the Chinese lingua franca. However, when a Wu or Cantonese speaker reads a classical text, he would read it entirely word by word using his dialect pronunciation but without the need to changing any words. If we take Wu dialect as example (and even Mandarin today) we don't say 鸡子 = to say "egg", we say 鸡蛋 = Chicken eggs as there is a specific character for eggs nowadays. However, when one reads the original text, one would still read 鸡子 (with a specific pronounciation depending on the dialect). That's how Chinese was maintained as a single language, conserving characters while each region pronounced them very differently and had local words which were only used in spoken language. In other words, if the text say 子 in two places, the reader would typically read the 子 in the two places and would not switch to a different dialectal word in any instances. 

I did more digging in the Rooster, it's rough to find equivalents for other words which have more than 1 occurence (there is 杀, 血 also which have at least 2 occurences). To me the aiin / daiin hypothesis seem to hold pretty well, but so far it's hard to find other matches. I would be delighted if your efforts are more fruitful than mine! I have to say I am very impressed by what you have done especially you don't speak Chinese!

Jonas
(07-05-2026, 03:36 PM)Jonas Barnun Wrote: You are not allowed to view links. Register or Login to view.Regarding this, I don't know for Vietnam, but I can speak based on my experience of China. There is a distinction between speaking in dialect and reading a text in a dialect. Typically a spoken Wu dialect or Cantonese would have specific syntaxic structures and specific words which are uncommon or inexistant in the Chinese lingua franca. However, when a Wu or Cantonese speaker reads a classical text, he would read it entirely word by word using his dialect pronunciation but without the need to changing any words. If we take Wu dialect as example (and even Mandarin today) we don't say 鸡子 = to say "egg", we say 鸡蛋 = Chicken eggs as there is a specific character for eggs nowadays. However, when one reads the original text, one would still read 鸡子 (with a specific pronounciation depending on the dialect). [...] In other words, if the text say 子 in two places, the reader would typically read the 子 in the two places and would not switch to a different dialectal word in any instances.

That may be bad news, because it removes one possible explanation for why it is hard to find correspondences.  Possibly the language is not a Chinese "dialect" but some non-Chinese language like Vietnamese of Tibetan.  But then I would expect the inconsistencies would be bigger.

So far the cribs I have are 
  • MAIN-USES: 主治 or just 主 in the SBJ, daiin (and variations) in the SPS
  • QI:   气 in the SBJ, Chedy (") in the SPS
  • MAKES:  令 in the SBJ, Ched (") in the SPS.
  • LONG-TAKE: 久服 or 久食 in the SBJ, qokeedy or qokaiin (") in the SPS

These cribs seem to work in most cases, although sometimes one need to allow substantial variations.  The last one is still somewhat uncertain.

The variations may include adding or omitting i and e, in certain contexts; omitting q before o; or switching between glyphs that may have similar phonetic value (like [aoy]) or similar shapes when written in "cursive" (like [dkls], or ir <-> iin) or seem to be beautified versions of other letters (like p and f). 

For instance, in the Rooster alone, as you can see in that write-up, I must accept dair and laiin as variations of daiin. In the Flying Squirrel entry I must accept dain.  And the pairing of those two entries with f105v.32 and f112v.35 is practically certain because they are lone outliers in the respective parag length distributions. 

Conversely there are many occurrences of daiin, laiin, dain etc.  in the SPS which cannot be translations of 主治/主 . Needless to say, the more liberal I am on the allowed variations of those keywords, the more false matches I will find, and the less convincing they will become.
I am still working out what is the sweet spot between too strict and too liberal.

Quote:there is 杀, 血 also which have at least 2 occurences

I am trying to figure out the Voynichese for 血 = "blood".  In the table below, each line is an SBJ recipe that (1) I have already found a tentative match in the SPS and (2) contains that caracter.  Some of those recipes contain it two or more times. 
Code:
[/font]
# 血

b1.1.014 | five_clays       | ~ | f103r.01 | CLAY | ... 血    ... | ... opchedyp?olchep ...
b1.4.090 | dragon_bone      | ~ | f104v.01 | DRAG | ... 血    ... | ... chedaiinqoteedchockhy ...
b1.4.096 | red_rooster      | + | f105v.32 | ROOS | ... 血    ... | ... taiinoteeyoteeoolotaiin ...
b3.3.088 | willow_blossom   | ~ | f107r.21 | WILL | ... 血    ... | ... qokalraiiinokaiin...

b1.1.014 | five_clays       | ~ | f103r.01 | CLAY | ... 下血   ... | ... arotchysallkchysarain...
b1.2.061 | heavenly_essence | ~ | f114r.39 | HEAV | ... 下血   ... | ... dainopchedyqetcharyteedyq...

b1.2.061 | heavenly_essence | ~ | f114r.39 | HEAV | ... 止血   ... | ... qetcharyteedyqoteyqoedaii...

b1.2.061 | heavenly_essence | ~ | f114r.39 | HEAV | ~~~ 瘀血   ... | ~~~ daiinokarqoeedainych...
b3.5.119 | peach_kernel     | ~ | f104r.30 | PCHK | ~~~ 瘀血   ... | ~~~ lkeodaiinqokeeolkecheyokeo ...

b2.2.066 | wall_moss        | ~ | f113r.34 | WMOS | ... 血气暴热 ... | ... hkaiinqopchdyqoteedy.rchedy ...

b2.3.090 | mulberry_root    | + | f105v.08 | MULB | ... 血病   ... | ... oeesaiiinolokaiino. ..

b1.2.061 | heavenly_essence | ~ | f114r.39 | HEAV | ... 血瘕   ... | ... okarqoeedainychooedainop...
b3.5.119 | peach_kernel     | ~ | f104r.30 | PCHK | ... 血瘕   ... | ... cheodainqokaralchdokchechy ...

b3.5.119 | peach_kernel     | ~ | f104r.30 | PCHK | ... 血闭   ... | ... daiinqokeeolkecheyokeody ...
b1.4.096 | red_rooster      | + | f105v.32 | ROOS | ... 血闭   ... | ... ockhhyysheyckhysheo ...


The last two columns are the compound or phrase from the SBJ entry that contains 血, and the section of the matched SPS parag that roughly corresponds to that 血, based on the count of EVA characters that separate it from the nearest keywords.   The lines that have just 血 are those where I guess that the character is used as a noun, as in "diarrhea with pus and blood" or "boosts blood and qi".  The others are occurrences where I guessed that it was being used as an adjective, as in 血闭 which Google translates as "amenorrhea" ("absence of menstruation")

There seem to be matches between the EVA strings, enough that I will try to add 血 to my set of cribs.  But there are also notable failures to match.  That may be because the Dictator used a different reading of the character or compound, or because I picked the wrong SPS parag. (I am reasonably confident only about the two identifications marked '+').  

All the best, --stolfi

Don't mention Tibetan or Jessica Scott Dunn will turn up
Here's my twice-yearly reminder to please limit the size of any quoted text, for the benefit of other forum users. Especially when replying with a single line that's hard to see under the quote-wall.
(08-05-2026, 12:00 PM)Koen G Wrote: You are not allowed to view links. Register or Login to view.Here's my twice-yearly reminder to please limit the size of any quoted text, for the benefit of other forum users. Especially when replying with a single line that's hard to see under the quote-wall.

Yes, apologies..
Hi Stolfi,

I am really impressed by the amazing work you are doing on this. It's quite something, I have deep respect for that!

(07-05-2026, 07:03 PM)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view.That may be bad news, because it removes one possible explanation for why it is hard to find correspondences.  Possibly the language is not a Chinese "dialect" but some non-Chinese language like Vietnamese of Tibetan.  But then I would expect the inconsistencies would be bigger.

Second thought, I am wondering if Japanese might have the properties you are looking for? From what I know, Japanese has been heavily influenced by Chinese, but they have 2 families of reading for each character: the onyomi, which is derived from the Chinese pronunciation and the kunyomi, derived from their vernacular local pronunciation. Example: 山 = moutain is "San" in Onyomi (from Shan in Chinese) and "Yama" in Kunyomi.

They also have a tradition for reading Classical Chinese texts which is quite fascinating as they would spontaneously replace and substitude each character to Onyomi and Kunyomi depending on the context (as a rule of thumb but don't quote me on that, I'd say if a character is alone it would be more Kunyomi, and if two characters composing a concept it would be more Onyomi but there are probably a lot of exceptions and this is just my own interpretation). The result is while reading a Classical Chinese text, a Japanese reader would switch from Onyomi to Kunyomi pronunciations depending on the context for each character. To add to the complexity, it's not uncommon for a character to have different variations of Onyomi and Kunyomi pronunciations. This leads to a lot of variability for a same character in the case of a Japanese reader. Unfortunately my knowledge on Japanese is limited, I am not able to tell how a Japanese person would read the Shennong Bencao Jing.

(07-05-2026, 07:03 PM)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view.So far the cribs I have are 
  • MAIN-USES: 主治 or just 主 in the SBJ, daiin (and variations) in the SPS
  • QI:   气 in the SBJ, Chedy (") in the SPS
  • MAKES:  令 in the SBJ, Ched (") in the SPS.

One small remark, if the original language used in the VMS is Chinese, that 令 and 气 would have a relatively close orthography would be quite surprising, as 气 is pronounced Ki in older Chinese (Tchi in mandarin today), while 令 starts with L and is normally a longer word (Ling / Lian). So it could confirm the language used here is not directly Chinese? Or one of the two is not a good match? Or that the Voynechese is not phonetical at all?

(07-05-2026, 07:03 PM)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view.For instance, in the Rooster alone, as you can see in that write-up, I must accept dair and laiin as variations of daiin. In the Flying Squirrel entry I must accept dain.  And the pairing of those two entries with f105v.32 and f112v.35 is practically certain because they are lone outliers in the respective parag length distributions. 

Assuming the original language is Chinese, I would not be too bothered by some level of variations for a same word if transcribed phonetically because even today while mandarin has been pretty much standardized, prunounciations may fluctuate especially to an untrained ear (example, what a Chinese person considers as the B sound can sound like P or B to a Western ear, D can sound like D or T, J could sound like a Z or J, some dialects mix L and N). So this part is not a strong issue.

Kind regards,
Jonas
(08-05-2026, 04:06 PM)Jonas Barnun Wrote: You are not allowed to view links. Register or Login to view.Or that the Voynechese is not phonetical at all?

From a purely rhetorical standpoint---I've made my doubts about the identification itself abundantly clear and have nothing new to add at this time---I think it bears mentioning how much strain the "SPS = SBJ" identification puts on what I took to be basic tenants of the "Chinese Theory":
  1. Voynichese is phonetic; see above
  2. Voynichese word breaks delineate syllables; Tiltman's "suffixes" are treated as words to get enough -aiin-type endings in the Rooster identification, though I am aware that another theory properly on another thread underwrites this
  3. Voynichese is "Chinese", better expressed as "From the Mainland Southeast Asian Language Area"; Japanese largely lacks finals and is very amicable to Romanization, though I do recognize other proposals are still in play)
  4. "Chinese"/MSEA Languages account for the low entropy; the many-to-one match system destabilizes the connection with the entropy data

The COT as I understood it was supposed to put constraints on how the identification of a text like the Shennong Bencaojing would be done, and important constraints are being relaxed or discarded in this discussion. I understand this might be a case where the correct conjecture was made for incorrect reasons---the many bad arguments for heliocentrism should be familiar to most Voynicheers as an example. However, as someone who lent "the COT" a lot more goodwill than you may realize on the grounds that it really does kind of look like Pinyin or chữ Quốc ngữ, it seems like most of the reasons I did so are not relevant to the identification here. That's all conceptually before we rehearse all the reasons I don't think this identification is proved on its own terms, so, paradoxically, I think accepting "SPS = SBJ" increasingly looks like rejecting much of the "Chinese Theory".
(08-05-2026, 05:35 PM)rikforto Wrote: You are not allowed to view links. Register or Login to view.From a purely rhetorical standpoint---I've made my doubts about the identification itself abundantly clear and have nothing new to add at this time---I think it bears mentioning how much strain the "SPS = SBJ" identification puts on what I took to be basic tenants of the "Chinese Theory":

As I wrote before, those who are convinced that the VMS is encrypted have come to expect that a valid solution will be like the solution to a typical cryptographic puzzle.  Namely, someone deduces the encryption algorithm, inverts it, and presto -- it automatically turns the whole VMS into a plausible original plaintext in a known language, with few or any spelling errors, with common vocabulary and proper orthography and grammar.  They rightly dismiss "solutions" where the the decryption "algorithm" requires manual tuning or picking alternatives at every step.  They know that, with such slack, any random string can be "decrypted" into a text in any language, that is grammatically correct and sort of kinda like seems to make sense.  At least, if one does not read more than one sentence at a time.

But the "SPS=SBJ" theory implies that the VMS is not encrypted; and that the authorship, motivation, means, and creation process are totally unlike those of encrypted documents, Medieval or not.  Thus the VMS is not a cryptographic puzzle, but more like a lost language puzzle.  And solutions to problems of this type are not clean, instantaneous, automatic.  There will be no "decryption algorithm".

And I understand that those who casually browse through this thread and see that I must accept variant spellings and splitting of words, omitted parts, unknown language, etc. may get the impression that the matching of SBJ entries to SPS parags has enough slack to "succeed" independently of whether the theory is correct or  not.   To see that such is not the case one has to look at the proposed matchings (like the Rooster or Bee Larva one) and estimate the probability that the keyword-to-keyword distances could match that well by mere chance.  

Quote:[The COT says that]Voynichese is phonetic

Most likely.  The statistics and word structure are compatible also with some codebook-like cipher; but that would be extremely impractical for a book of this size.

Quote:Voynichese word breaks delineate syllables; Tiltman's "suffixes" are treated as words to get enough -aiin-type endings in the Rooster identification, though I am aware that another theory properly on another thread underwrites this

Mostly, yes.  Independently of the COT or of the SPS=SBJ claim, the structure of most Voynichese "words" is consistent with each "word" being a single syllable.  However, in a small fraction of cases, the structure of the "word" is consistent with it being two or more syllables  stuck together. 

Conversely, the existence of dubious word spaces means that, in some cases, a VMS "word" may be only part of a syllable.  Some of those cases are marked by commas in the transcription, but there must be many cases where the spurious break is marked with a period. 

By the way, note that in Mandarin (and presumably in all other candidate languages) a syllable may be just a single vowel, without any consonant.  Thus, when the Dictator read  辟恶 "pì è" the Author may have written down "piè".

Nevertheless, one important piece of evidence for the SPS=SBJ claim is that, after removing the fields that were (consistently) omitted by the Author, on average each Chinese character corresponds almost exactly to one VMS "word" -- a bit more if commas are treated as spaces, a bit less of commas are ignored.  

Moreover, even tough each word has a variable number of EVA letters, on average each Chinese character corresponds rather closely to 5 EVA letters, not counting spaces.  

These ratios are verified not only in the total length of the entries, but also in the spacing between  keywords.  For instance, here is the SBJ entry named "dragon bone" (which is actually about fossil bones):

<b1.4.090> 73 hanzi 
龙骨味甘平主治心腹鬼注精物老魅咳逆泄利脓血女子漏下症瘕坚结小儿热气惊痫齿主治小儿 大人惊痫癫疾狂走心下结气不能喘息诸痉杀精物久服轻身通神明延年生山谷

Excluding the parts that are consistently omitted in the VMS (the "Flavor" 味甘平 , the "Provenance" 生山谷, and the patient type tags 女子 = "women", 小儿 = "children", and 小儿大人 = "children and adults") we are left with this:

<b1.4.090>  59 hanzi
       2    龙骨
    2 13 主治 心腹鬼注精物老魅咳逆泄利脓
    1  7 血  漏下症瘕坚结热
    1  3 气  惊痫齿
    2  9 主治 惊痫癫疾狂走心下结
    1  9 气  不能喘息诸痉杀精物
    2  7 久服 轻身通神明延年

The third column has the keywords in that entry for which I have a tentative VMS translation: 主治 = "main uses", 血 = "blood", 久服 = "prolonged consumption", etc..  The fourth column are the strings before, between, and after those keywords.  The first two columns are the respective character counts.  

This trimmed entry has 59 hanzi (Chinese characters), so by the proportion observed so far we expect the matching SPS paragraph to have about 59 x 5.06 = ~298 EVA letters.  And indeed the best matching SPS parag has 285 letters:

<f104v.1>    3.15  285(-13)  0.200
            3(-7)    1.94 pch
  daiin    64(-1)    0.00 opcheedyoraroltcheeyopchedyolearaiiralycheodaiincheekaindamyched
  aiinqot  37(+1)    0.01 eedchockhyotaiinydaiinqokamdyotararal
  chedo    25(+9)    0.94 tairoramshodchedyqotaiino
  daiin    47(+1)    0.01 okeolockhhycholqokeedyqotairoeedaiinoldlqoteedy
  cheda    44(-1)    0.01 iinchokarqotolqotchedcholcheyqolchedyqoeeeyq
  okeedy   32(-3)    0.04 dcheolchdeeyoeeodainsairolchedal

Here the first column is my tentative VMS translation of the keywords identified in the SBJ entry.  Namely, the 主治 keyword is assumed to correspond to the first occurrence of daiin, its "canonical" translation.  The character 血 is matched with the EVA string aiinqot which seems to be highly correlated with it in other pairs (although the 血 may in fact turnout to be only part of that string).  And so on.

The discrepancy between the actual and predicted total lengths is -13 EVA letters (~4%), which would be equivalent to ~1.6 hanzi on the original. 

Moreover, the counts of EVA letters between the VMS keywords also match the counts of hanzi between the hanzi keywords.  For instance, the gap between 主治 and 血 in the SBJ entry is 13 hanzi long.  The corresponding gap in the SPS was therefore expected to be 13 x 5.06 = ~65 EVA letters.  In fact the gap between the first daiin and the next aiinqot is 64 letters, of just one letter (equivalent to ~0.2 hanzi) short from the expectation.  And the same holds for the other six gaps.  The biggest discrepancy is in the gap between the first 气=chedo after the aiinqot and the next daiin, which is 25 EVA letters instead of the predicted 3 x 5.06 = ~16; a discrepancy of +9 EVA letters (equivalent to 1.8 hanzi).

Can you claim that those numbers are just chance coincidences?

Quote:Voynichese is "Chinese", better expressed as "From the Mainland Southeast Asian Language Area"; Japanese largely lacks finals and is very amicable to Romanization, though I do recognize other proposals are still in play)

I don't understand what you mean by "lacks finals". As @Jonas_Barnun explained, if the Dictator was Japanese, he probably read each Chinese character as a single syllable, with its "Chinese" reading.  So that "Japanese" language would be essentially another "dialect" of Chinese.

Quote:"Chinese"/MSEA Languages account for the low entropy; the many-to-one match system destabilizes the connection with the entropy data

The trope that the VMS has "low entropy" is about character entropy, that depends entirely on the spelling system and the transcription alphabet.  IIRC the word entropy, that does not depend on spelling or encoding, is well in the normal range -- about 10 bits per word.

Variant spellings, spelling errors, missing or spurious word spaces, etc. can affect the word entropy; but some will increase it, some will decrease it.  Thus even a 10% error rate would have little effect on the word entropy. 

Quote:paradoxically, I think accepting "SPS = SBJ" increasingly looks like rejecting much of the "Chinese Theory".

Well, strictly speaking the two theories are independent.  The SPS being a translation/transcription of the SBJ does not imply that it was created like the COT proposed.  Maybe a bored Chinese ship doctor stranded in Alexandria encoded the whole SBJ with a codebook cipher, replacing each Chinese character with a randomly assigned number in some Roman-like number notation, just to kill time while waiting for the monsoon season to be over; and that papyrus manuscript was then copied onto paper by a Venetian merchant, whose widow then sold it to a Bavarian doctor from Trent who donated it to a convent near Basel where it was copied onto vellum by the convent's five dirty lesbian nuns dreaming of hot baths.  But ...

All the best, --stolfi
 “ The character 血 is matched with the EVA string aiinqot “

Hi Jorge, does this mean spaces are irrelevant in your theory?