The Voynich Ninja

Full Version: The 'Chinese' Theory: For and Against
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
@stolfi You didn't answer my question about Spaces (it probably got lost in the mix), so I took the initiative and asked Claude myself. Since it's AI, take the following with a grain of salt, but it's still interesting:

The VMS’s restriction to a few high-frequency token beginnings (Top 5 = 72%) and very few token endings (Top 5 = 91%) corresponds quantitatively to Chinese syllable phonotactics: approx. 19–20 initials and 3–5 codas in Mandarin around 1400. If each token = one syllable, the limited variety of initial and final glyphs is a necessary consequence of Chinese syllable structure and consistent with it.

I would like to emphasize that I don’t have the slightest knowledge of Chinese and therefore cannot verify this in any way....
(14-05-2026, 06:59 AM)JoJo_Jost Wrote: You are not allowed to view links. Register or Login to view.Since I’m currently looking into word boundaries and the glyph stream, I asked myself how Chinese theory can be reconciled with the 7–8 rules that define about 90% of all spaces in VMS?... If it were written phonetically, would that mean the Chinese language implicitly follows these rules?

Sorry for missing your question.

The answer will depend on the particular "Chinese" language (the Net says that there are more than 50 of them) and on the phonetic notation used. 

For Mandarin (the "Chinese" of Beijing) there is a widely used phonetic notation called pinyin, that uses Latin letters plus diacritics.  It looks like this
  • léi shǔ zhǔ duò tāi lìng yì chǎn shēng píng gǔ

Each syllable (word) there corresponds to one Chinese character.  In that notation, it is fairly straightforward to recover the spaces if they are deleted.  In fact, compounds of two or more syllables are usually strung together without a space.  There re only a few ambiguous cases where the compounds have to be separated by an apostrophe.  Like, shengú could be shēng ú or shēn gú; the orthography rules say that one of them is the default split, and the other one needs an apostrophe (I forgot which one).

Sometimes, when Mandarin names and places are mentioned in English texts, the diacritics are omitted.  Technically Mao Zedong (Mao Tse-Tung in older spelling) could be Máo Zédōng or Mào Zēdǒng or a few dozen other names, and only one of them is correct.  But usually the context says which one is meant.  But even without diacritics the splitting of compounds into syllables (Like Zedong -> Ze Dong) is almost always trivial and unambiguous.

Another variant of pinyin uses digit suffixes 1-4, instead of diacritcs, to indicate the tones. Like this
  • lei2 shu3 zhu3 duo4 tai1 ling4 yi4 chan3 sheng1 ping2 gu3

In this notation the spaces are totally obviously completely recoverable.

This numeric encoding of tones is more often used for tonal monosyllabic languages other than Mandarin, especially those that have more than four tones.  In Cantonese, the "dialect" of HongKong, the same characters that correspond to that pinyin text would be read and phonetically transcribed as
  • to4 syu2 zyu2 do6 toi1 ling6 ji6 caan2 saang1 ping4 guk1

Vietnamese is another monosyllabic tonal language. It has an official script and spelling system based on Latin letters with lots of diacritics.  The literal translation of that text into Vietnamese, as would be read by a scholar from the same Chinese characters, would be written down as something like this:
  • đà thử chủ đọa thai linh dị sản sinh bình cốc.

(At least that is what Google says...)  A more natural translation, that tries to respect the grammar and modern vocabulary of Vietnamese, would be
  • chuột đà chủ trị phá thai giúp dễ đẻ sống ở thung lũng bằng phẳng.

Either way, I bet that recovering deleted spaces would be almost as easy as with pinyin.

All the best, --stolfi
I’m not entirely sure why the underlying syllables have to be Asian in origin other than the fairly weak match to a specific document. I think a lot of languages have some highly frequently used syllables, and some would make similar sense to zhu if it was indeed a set of recipes or even if the frequent syllable was a definite article.
(19-05-2026, 04:47 PM)Grove Wrote: You are not allowed to view links. Register or Login to view.I’m not entirely sure why the underlying syllables have to be Asian in origin

Well, if the text matches a specific Chinese document practically word for word, that conclusion seems all but certain.

But if you don't accept that particular claim, there is still the fact that the length distribution and internal structure of the VMS words is consistent with the structure of syllables, and incompatible with them being polysyllabic words, even under any encryption that maps longer words to longer strings.  

Thus we are left with (a) each word of the Voynichese language is a single syllable, or (b) the words of the Voy ni che se lan gua ge were split into syllables as part of the encoding, or © the encryption algorithm maps words of arbitrary length to strings of bounded length, with a syllable-like structure.

Alternative © would basically be a codebook- or nomenclator-type cipher, where each word is replaced by a number according to a dictionary.  Many years ago I described You are not allowed to view links. Register or Login to view. that would produce numbers with a syllable-like structure.

Quote:some [languages] would make similar sense to zhu if it was indeed a set of recipes or even if the frequent syllable was a definite article.

The Voynichese daiin (and sometimes slight variants like dainkaiin and laiin, and the abbreviation dam) does not behave like a definite article.  That hypothesis has been explored an dismissed long ago.  In the Starred Parags section, it does behave like a keyword that comes after the title of the recipe and introduces a list of indications.  Maybe it works like that in the Herbal sections too; I haven't looked into it.

It may be significant that daiin is one of  a very few words that are about as frequent in Herbal-A as in Herbal-B.

All the best, --stolfi
I may be misunderstanding but it seems like two issues may be getting mixed together, and I definitely don't understand all the rooster/f105v.32 nuances...

One issue reads historical, FIRST: which SBJ or ZHB version is the right comparator. And SECOND, whether the source text could have existed in the 'right form' around 1400. And I definitely don't have anything useful to add there.

BUT, the other issue seems to be more testable: if this is a positional-distance hypothesis, can the method we use to match also recover the rooster/f105v32-38 pair when we run the whole thing blind? Which files shouldI grab for that?

EG, compare all SPS paragraphs against all SBJ/ZHB entries without preselecting the rooster pair, and ask whether the repeated ones line up well for f105.v32-38 and rooster, compared with other pairings that pop out. Same with daiin/dair/etc pairs, are they popping out better than other plausible pairs.

Is this the right road method-wise? I started looking at and playing with files in the ic.unicamp....Notes/077 folder but before I go too far down the road, if the method is bad I don't want to keep going, and if there are files that are final versions or authoritative I'd want to use those.

thanks,
Joey
(19-05-2026, 08:25 PM)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view.Thus we are left with (a) each word of the Voynichese language is a single syllable, or (b) the words of the Voy ni che se lan gua ge were split into syllables as part of the encoding, or © the encryption algorithm maps words of arbitrary length to strings of bounded length, with a syllable-like structure.

With regard to option (b), assumling Latin is typical (I picked Latin because the rules for breaking Latin words into syllables are straightforward), while the resulting "vocabulary" is highly Zipfian the type/token ration is way too low compared to the Voynich text because the total number of possible syllables is relatively limited -- you add types much less frequently as the number of tokens grows.
(19-05-2026, 08:25 PM)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view.[..]
The Voynichese daiin (and sometimes slight variants like dainkaiin and laiin, and the abbreviation dam)

There is no proof for some of them being just „slight variants“ or even an „abbreviation“ for daiin or anything else.
(20-05-2026, 03:01 PM)kckluge Wrote: You are not allowed to view links. Register or Login to view.the rules for breaking Latin words into syllables are straightforward

Not really: there are some difficulties especially if you use medieval Latin without phonetic distinction of v/u, i/j and exceptions to the basic rules if you want to do it right, like prefixes that you're not supposed to break so you need to check the etymology...

Quote:the total number of possible syllables is relatively limited

I estimate it to 800-1000 in long texts (without many exotic words), more than double what Mandarin has (~400).
(20-05-2026, 03:43 PM)nablator Wrote: You are not allowed to view links. Register or Login to view.
(20-05-2026, 03:01 PM)kckluge Wrote: You are not allowed to view links. Register or Login to view.the rules for breaking Latin words into syllables are straightforward

Not really: there are some difficulties especially if you use medieval Latin without phonetic distinction of v/u, i/j and exceptions to the basic rules if you want to do it right, like prefixes that you're not supposed to break so you need to check the etymology...

OK, *relatively* straightforward. Certainly compared to (say) English. Code was based on You are not allowed to view links. Register or Login to view.. Yes, I recognize that the results of that code won't be 100% correct because there are subtleties in applying those rules.

Quote:
Quote:the total number of possible syllables is relatively limited

I estimate it to 800-1000 in long texts (without many exotic words), more than double what Mandarin has (~400).

Latin text used was an excerpt from Bacon's _Opus Majus_, Bk 4, beginning "postquam manifesta est necessitas". 

Zipf's Law fit to 300 most-frequent "words" (Latin syllables): ln(freq) = -0.968299 * ln(rank) + 3.637783. 777  types, 24895 tokens, 18.79% hapax,  type/token ratio = 0.03121

All of which is getting into the weeds. The point is that *if* Voynichese were a cipher that breaks up words into smaller chunks, the process that breaks them up is unlikely to be syllabification (and likely isn't deterministic in general?) due to the extremely low TTR that results.
(20-05-2026, 06:22 PM)kckluge Wrote: You are not allowed to view links. Register or Login to view.All of which is getting into the weeds. The point is that *if* Voynichese were a cipher that breaks up words into smaller chunks, the process that breaks them up is unlikely to be syllabification (and likely isn't deterministic in general?) due to the extremely low TTR that results.

The TTR of Voynichese can be reduced as much as you need... by simplifications, equivalences, re-spacing. However the babble-like sequences of similar words would not produced plausible Latin (or Chinese).