The Voynich Ninja

Full Version: The 'Chinese' Theory: For and Against
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Pages: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
(15-04-2026, 02:52 PM)rikforto Wrote: You are not allowed to view links. Register or Login to view.Can you elucidate how you arrived at You are not allowed to view links. Register or Login to view. cribs beyond frequency?

For the first crib, 主 = daiin, I thought of comparing the longest entry in the SBJ, namely "red rooster", with the longest parag in the SPS (excluding parags that seemed to be two or more parags smashed together), namely f105v.32.  The two were extreme isolated outliers in the entry/parag length distributions, so either f105v.32 was the match of the "rooster", or there would be no match.  But fortunately f105v.32 the  had survived the loss of the four pages and the big parag mash-ups.  

The SBJ entry had 92 hanzi while the SPS parag had "only" 74 words. But the "rooster" entry is exceptional also in that it is 9 separate sub-entries, and the first 7 of these use 主.  Comparing the entry to the parag I noticed that there were 5 occurrences of daiin in the latter that could be matched to 5 of those 7 主, in such a way that the gaps between them, scaled by the right factor, were remarkably similar.  And, moreover, the other two 主 could be paired to a laiin and a dair, with equally consistent spacings.

It turned out that the SPS parag omitted some fields of the "rooster" entry, and the last of the 9 sub-entries entirely.  But fortunately those omissions were at the ends of the entry, and did not affect the spacing between the seven 主s.

The other two cribs were found basically by the same method, but now by comparing other entry-parag pairs that were identified with the help of the first crib.  I noticed that 气 qì was a very common character. It always occurred as a word by itself (rather than part of a compound) and had a specialized meaning that was unlike to have multiple "translations" into Voynichese.  That made it a good candidate for a crib.  And indeed, comparing the pairs that I had, the positional correlation between  气 and Chedy was rather obvious.

The last crib, 久服 = qokaiin, was found the same way.  But this one is still not entirely certain, because it is not clear whether qokaiin is 久, 服 or the compound; and it may be that the thing is sometimes translated as qokeedy instead (like English translations of 久服 sometimes say "prolonged intake", sometimes "long-term intake", etc.)

Quote:This broad method has been attempted in many languages, and it is always possible to identify potential cribs by assuming high frequency words correspond.

That works best on languages that have high-frequency function words, like articles, prepositions, copulas, etc.  Unfortunately Chinese does not have such things; and most words in the SBJ are names of diseases -- which do repeat, but not often enough to be useful with the limited pairings I have.   Also, some common Chinese characters like 不 bù = "not" are hard to use as cribs because it is possible that they are translated into Voynichese in several different ways.  For example, 不饥 = bù jī , literally "not hungry", is listed as an effect of long-term use of Chinese onions; in the English translation I got, it is rendered as "prevents hunger". 

 
Quote:As an aside, you may also want to check if the Shennong Bencaojing has a Zipfian distribution of characters. Usually Chinese characters are not distributed that way, which is another reason I have doubted a Chinese source

It is an interesting question, but, whatever its result, it will not affect the "SPS=SBJ" claim.  For me, it would be like checking whether the table of Catholic patron saints for the days of the year has a Zipf-like word distribution.

Again, statistics -- like word and character frequencies, entropy, distance correlations -- are not properties of a language. They are properties of a specific text, or of a corpus (collection of specific texts).  English does not follow Zipf's law.  Most English novels and newspaper articles do.  But one probably can construct a grammatically correct and meaningful English text of 10000 10003 words that repeats every word exactly seven times. 

All the best, --stolfi
Jorge, I read your paper about the "rooster paragraph" and responded to it at length. (For those interested, my issues with supposing the text has been altered can be found You are not allowed to view links. Register or Login to view. and You are not allowed to view links. Register or Login to view., and my difficulties extending glosses can be found You are not allowed to view links. Register or Login to view.. My other comments thereabouts in the thread are salient too.)

At present, I am asking for the 11 new identifications so I can evaluate them. If they are not ready for presentation, I understand, but then they cannot be relied upon as proof for the rest of us
(15-04-2026, 08:34 PM)Rafal Wrote: You are not allowed to view links. Register or Login to view.So how do you know it is not a cipher?

First, there is nothing the VMS itself that remotely suggests that it is in cipher.  The only reason people even think of cipher is because of that mistaken conclusion that the language must be "European".  Once you accept that the language is an East Asian monosyllabic one, there is no reason to suspect that it may be encrypted.

Second, any complicated cipher (like the Naibbe one, or even a simple Vigenère) would be a pain to write and a pain to read.  Ciphers make sense for letters and other short documents, where the need for secrecy justifies the inconvenience.  Or for short sensitive passages in a longer document.  It seems very unlikely that one would use a complicated cipher for a whole reference book.  The examples I know of whole books written in "hard" cipher are "demos" by cryptographers like Trithemius or Raphael Mnishowski who wanted to show off their skill.  Are there any others?

And third, why would the Author want to encrypt a transcription of the Shennong Bencaojing?  The plain text in phonetic Chinese would be unreadable by anyone else in Europe without a glossary.  Even by the best cryptographers of the time.  Even by today's cryptographers, if they don't know Chinese...

One may think that perhaps he translated the SBJ to Latin (word for word?!?) and then decided to encrypt it to prevent his rivals from learning that they could connect to the cosmic intelligence by eating hemp flowers (recipe b1.6.114).  But curiously the sophisticated encryption method that he devised happened to turns his Latin text into something with all the statistical properties of a phonetic rendering of an East Asian language...

All the best, --stolfi
(16-04-2026, 12:08 AM)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view. The only reason people even think of cipher is because of that mistaken conclusion that the language must be "European". 


Every time you claim this, I will carry on repeating that it is for you to prove and that you have not proved it.

Making these sort of generalisations isn't doing your theory any favours.
(15-04-2026, 11:44 PM)rikforto Wrote: You are not allowed to view links. Register or Login to view.Jorge, I read your paper about the "rooster paragraph" and responded to it at length. (For those interested, my issues with supposing the text has been altered can be found You are not allowed to view links. Register or Login to view. and You are not allowed to view links. Register or Login to view., and my difficulties extending glosses can be found You are not allowed to view links. Register or Login to view.. My other comments thereabouts in the thread are salient too.)

Sorry if I did not respond to all those concerns; there were many posts on this thread on those days, and this "wonderful" forum software will mark unread posts 10, 11, 12, 13 as "read" if you read just 10 and reply to it.

But I hope that the evidence will eventually be enough to make those supposed weaknesses irrelevant.

Quote:At present, I am asking for the 11 new identifications so I can evaluate them. If they are not ready for presentation, I understand, but then they cannot be relied upon as proof for the rest of us

I have already checked more than 30 SBJ entries and found matches for more than 20 (the expected ratio).  But with varying degrees of confidence.

You are not allowed to view links. Register or Login to view..  The main information is at the very end.  

There are a number of "paragraph evaluation blocks" in that page. Their format is described below.  They were computed for all "good" (non-smashed) parags of the SPS, but only a few of those with best (lowest) scores are shown.

Most of that HTML page is generated by a bunch of python functions that I wrote over the past month or so.   But some parts had to be generated manually, including the detailed parsing of the SPS entry and its matching English translation, the identification of omitted fields, the hanzi keywords to use in the structural matching, the patterns for the possible Voynichese translations, etc.  And then the various "best matches" had to be inspected to see whether they look like real or just coincidences.

There are still some bugs and shortcomings in my functions and in the formatting of those pages.  I am working on that.

IMPORTANT: the alignment of the SPS parag words with the SBJ entry hanzi, shown at the very end of that page, is tentative.  Only the matching of the keywords is reasonably certain; for the words between the keywords, the correct alignment may be off by one word or more.  And even for the keywords there may be other possible assignments; that one shown is the one which gave the smallest badness score.
All the best, --stolfi

In that report there are a number of "paragraph evaluation blocks" that detail the comparison of the SBJ entry with a candidate parag of the SPS.


Each block refers to a specific parsing of the SBJ entry into "hits" (instances of the known keywords, like 主) and "gaps" (the strings before, between, and after those instances).

Each candidate SPS parag was tentatively parsed in the same way, into "hits" and "gaps", by looking for RE patterns that describe the possible Voynichese spellings of those same keywords.  For example, the pattern for the 主 keyword is 'daiin|laiin|dair' when trying for a match under "strict" criteria, or '[dlkrs][ao]iin|dain|[dlkrs][ao]ir' if that strict attempt fails and the search is repeated with more "liberal" criteria.

The first line of each block is the locus ID of the candidate parag's head line, like "f114.1", followed by a numric "badness score" that summarizes how badly that parag matches the SBJ entry in question.  More precisely, how different are the actual lengths of the gaps in the SPS parag from the values predicted from the corresponding gaps in the SBJ entry, assuming the ratio of ~5 EVA letters per hanzi. The score would be zero for a perfect match (all gap sizes match the perdicted values).

The second line has {N+2} numbers, where {N} is the number of specified keywords (such as '主') in that SBJ entry which were matched in the SPS parag.  Those numbers are the percent errors (deviation from predicted value ranges) of the parag's total gap length and of the lengths of the {N+1} individual gaps before, between, and after the identified {N} keywords.""")

The third line gives the actual total gap length and the {N+1} actual gap lengths.

The fourth line gives the predictions for the numbers on the third line. Namely, the gap sizes of the SBJ entry times 5. They are ranges rather than numbers to account for rounding errors.

The following {N+1} lines show the pasrsing of the SPS parag defined by keyword patterns.  The left column shows the "hits" (the actual substrings of the parag that matched the keyword patterns), the right column is the gap after each hit.

If the parag had multiple possible matches for the specified keyword patterns, the block shows the choices of the {N} matches that gave the lowest badness score.

All the substrings of the parag that matched the keyword patterns are highlighted in bold, even those that were not considered hits for score computation.
 
All sizes are in EVA characters.
(16-04-2026, 12:49 AM)tavie Wrote: You are not allowed to view links. Register or Login to view.
(16-04-2026, 12:08 AM)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view. The only reason people even think of cipher is because of that mistaken conclusion that the language must be "European". 

Every time you claim this, I will carry on repeating that it is for you to prove and that you have not proved it.

And every time I must apologize for the sweeping generalization.  Agreed, there must be many people who got to different conclusions by different lines of reasoning.

But somehow they all apparently came to strongly reject the possibility that it may be plaintext in an East Asian language.  Anyone who seriously admitted this possibility would have seen what I see, a hundred years ago.  How did they get there?

Quote:[it] isn't doing your theory any favours.

Well, I don't need people to believe in my theory.  I am not after YouTube subscriptions or selling books about it.  

I presented the evidence I have because I felt that I must.   Really, I think it should have been enough to convince anyone who was not already committed to some other theory.   

But if some people are not convinced, okay, it is their "right"...  I will keep hacking, and eventually post more...

By the way, the "SPS=SBJ" claim is independent of the "Chinese Origin" theory per se.  It says only that the SPS is a mostly word-for-word transcription of the Chinese SBJ, or of a mostly word-for-word translation of the same into some other East Asian language.  It does not say anything about how that came to be. 

All the best, --stolfi
Also, some people are agnostic about it.
Not in the sense of: "I refuse to know", but in the sense: "I know that I don't know".
(16-04-2026, 03:51 AM)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view.
(16-04-2026, 12:49 AM)tavie Wrote: You are not allowed to view links. Register or Login to view.
(16-04-2026, 12:08 AM)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view. The only reason people even think of cipher is because of that mistaken conclusion that the language must be "European". 

Every time you claim this, I will carry on repeating that it is for you to prove and that you have not proved it.

And every time I must apologize for the sweeping generalization.  Agreed, there must be many people who got to different conclusions by different lines of reasoning.

But somehow they all apparently came to strongly reject the possibility that it may be plaintext in an East Asian language.  Anyone who seriously admitted this possibility would have seen what I see, a hundred years ago.  How did they get there?

Worth saying I precipitated this by saying that you have not shown how the text is encoded and you took this to be a claim it's a cipher. I mean you have not shown how some version of the Shennong Bencaojing has been converted to Voynichese or recovered a version of the Shennong Bencaojing from it even if I accept your tentative identifications. The process you use is not capable of doing that, and has not yet opened an avenue to doing so. I would also note that a novel, undeciphered orthography functions much like a cipher practically, and I don't see how splitting that hair would make the issue that the orthography is undescribed less acute, which is why I let that comment pass initially.

This also isn't the first time I've told you that I am not starting from the assumption that the text is European, and at any rate you have seen me criticize linguistic claims that would back a European origin when I feel those claims are not warranted. I've been candid that I view the Alpine region as "first among equals" when talking about the origin, but my objections to the identification with the Shennong Bencaojing are focused on the substance of those claims, not whether or not the presentation in the VMS supports it. For instance, I think identifying 主 with daiin runs into the problem that daiin is not distributed through the SPS in a way that is amenable to the formulaic use in the Shennong Bencaojing. I have outlined conceptual issues with identifying 主 with a family of Voynich words to get around that. You may find this unconvincing, and I have responded to your justifications for this approach, but importantly these criticisms are not logically dependent on my thoughts about the artifact's origin.

I have some comments about the specific new identification you linked above, but those are going to take a bit more time to write up
Quote:First, there is nothing the VMS itself that remotely suggests that it is in cipher.  The only reason people even think of cipher is because of that mistaken conclusion that the language must be "European".  Once you accept that the language is an East Asian monosyllabic one, there is no reason to suspect that it may be encrypted.

You know, a script vs a cipher thing is one of my "fixations" that I developed while working with Rohonc Codex  Wink So let me share a few ideas.

Quote:Second, any complicated cipher (like the Naibbe one, or even a simple Vigenère) would be a pain to write and a pain to read.
I agree. But cipher doesn't have to be complicated. A cipher may be simple substitution, a sign for a letter. In such case when you memorize the signs you can write with the same fluency as when writing in standard script.

It is not a level of complication that makes something a cipher.
A cipher is a notation that is supposed on purpuse to hide the meaning of the text from some people.
You develop a cipher despite having some standard way of writing because your standard script is known to your enemies (or more generally to some random people that you want to hide the meaning from).

On the other hand a natural script is something that some group of people (in extreme cases just one man) uses because they don't know anything else or it is just the best option for them. People writing with natural script aren't afraid that someone may understand the text, in most cases (a book, a newspaper, a legal paper) they want that everyne could understand it.

Why Linear B, Cherokee Syllabary or even Rohonc Codex (if I am right) aren't ciphers? Because just like you have a first language so you have a first script and it was the first script for their users.

And If I understand the Chinese theory correctly, Voynichese wasn't a first script for anyone there.
As I understand in this theory some European traveller wanted to have a copy of some Chinese treatise.
Somehow he didn't chose the most obvious options:
- he didn't write it down translated to his own language
- he didn't write it with original Asian letters
- he didn't write it transliterated with Latin letters

But he decided to invent a totally new notation for that.
People are usually "lazy" and do their job with as little effort as possible. Here he seems to break this rule.

Some explanation could be that he wanted to hide the meaning. It would make the text a cipher.
But you say it isn't a cipher but a natural script. Yet someone invented it even if he had easier, already available options. Why?
And why is it a script?
(16-04-2026, 11:19 AM)rikforto Wrote: You are not allowed to view links. Register or Login to view.Worth saying I precipitated this by saying that you have not shown how the text is encoded

Indeed, I still don't know how each Chinese character was pronounced by the hypothetical Dictator, nor how the Author encoded those sounds into glyphs.   But I believe that the evidence is categorical that each Chinese character got mapped to a Voynichese word -- most of the time, and possibly with certain variations that could be errors, sandhi, or inflections.

Quote:and you took this to be a claim it's a cipher.

What I meant is that you are still looking at the VMS as a cryptographic puzzle, and therefore you are demanding a solution in the form that cryptographers expect for such puzzles: namely, a fairly complete description of the encoding algorithm, and a full plaintext that results from applying the inverse algorithm to the ciphertext.

Well, sorry, but it is almost certain that there will be no such solution.  For the SPS, and for the other sections.  Because this is not a cryptography puzzle, it is more like a lost language puzzle. Like deciphering a Sumerian, Minoan, or Etruscan text.  What I found is a Rosetta Stone; and even though it seems to be an amazingly good one (long, mostly word for word, etc), building a dictionary for the language and understanding the writing system will be a long and hard process, a few words at a time.

Quote:I would also note that a novel, undeciphered orthography functions much like a cipher practically, and I don't see how splitting that hair would make the issue that the orthography is undescribed less acute, which is why I let that comment pass initially.

The main difference between an unknown orthography and a cipher is that the latter is usually designed to be non-ambiguous and easily invertibe, which the former is often not.

For instance, the Arabic and Hebrew scripts usually omit all short vowels, even though those may be semantically significant.  Italian spelling does not mark the stressed syllable, even though there are pairs of words like ancora (anchor) and ancora (again, still) that differ only in that detail. The casual English transcription system for Chinese words, like "feng shui", omits the all-important tones.  Even in English you have "read" (present) and "read" (past).

And, as I mentioned before, some stenography systems omit the voiced/unvoiced distinction altogether, and other supposedly essential details (like the final "s" of plurals).

Quote:For instance, I think identifying 主 with daiin runs into the problem that daiin is not distributed through the SPS in a way that is amenable to the formulaic use in the Shennong Bencaojing.

But not every daiin is a 主, just like not every -ed in English is a past tense suffix.  

And, conversely, not every 主 was written as daiin. In the rooster entry, out of the seven 主, one became laiin, another became dair.  Variations which seem quite compatible with in and d being written in cursive handwriting on the Author's draft.  

Again, a messy correspondence like that would discredit a proposed solution to a cryptographic puzzle, but would be quite expected from an improvised transcription effort like it is implied here. 

All the best, --stolfi
Pages: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47