The Voynich Ninja - An Artificial Construction

Pages: 1 2 3 4 5 6 7

(14-05-2026, 10:06 AM)oshfdk Wrote: You are not allowed to view links. Register or Login to view.Here it is. I repeated @dashstofsk computation as described in the original post. This is what I would expect from a natural language - a lot of underrepresented combinations (and a few hugely overrepresented). Nothing like the Voynich MS chart for which the upper left corner mostly consists of numbers close to one.

@dashtofsk concluded that it was an artificial language because his table looked totally unlike what one would get from an "European" language (including Hebrew, Arabic, Turkish, etc.).

In fact, it is not even clear how one could build such a table for those languages, because there is no obvious and manageable way to split words into prefix-core-suffix. Thus the striking difference that @dashtofsk saw is mainly due to the peculiar structure of the VMS words, with a small number of "slots" each its own set of alternatives. A structure which has been modeled in many ways by many people.

But Mandarin and other monosyllabic languages too have a broadly similar "slot" structure, and thus admit natural prefix-core-suffix decompositions.  (The one I suggested is arbitrary, there are many other possibilities.)

And, comparing your table for pinyin and @dashtofsk table for the VMS,  the similarities are much more striking than the differences. Note that in the VMS table there are significant deviations from unity even in the part where the sampling error should be small -- say, the first three rows and the first seven columns. There, we see 1.31 for qokedy and a 0.24 for ky.

In fact, one could argue that those discrepancies are evidence against his thesis, because they show that the choice of suffix is not independent of the choice of prefix.

And that was expected, even without considering frequencies. For example, according to You are not allowed to view links. Register or Login to view., the strings that can follow a gallows letter depend on what came before it. Specifically, a word can have at most two of the elements X = {{ch} {sh} {ee} {che} {she} {eee}} and at most three of Q ∪ D ∪ N = ({q} {d} {l} {r} {s} {n} {in} {iin} {iiin}). Thus, for instance, if @dashtofsk's prefix has an X element, the suffix can have at most one X.

And Mandarin pinyin has similar constraints too. Ignoring the tones, there is a limited repertoire of vowel combinations: all single vowels, many vowel pairs, but only a few vowel triples. Thus, in any prefix-core-suffix segmentation, if the prefix and core have two vowels, the suffix will almost always have none.

But it is true that the deviations from unity in your pinyin table are more dramatic than those in @dashtofsk's VMS table. There are several possible explanations for that, that still allow Voynichese to be a natural language:

In the written Voynichese language, the prefix and suffix of each syllable happen to be more independent than they are in Mandarin
"The accented vowel" was a bad choice for the core of a Mandarin pinyin syllable.
The VMS text has many more errors than the Mandarin pinyin one. We know that there are transcription errors, because many glyphs are hard to identify and transcribers often pick one possible reading at random. Transcribers also disagree on word spaces, so what is ykaral to one maybe y karal or ykar al to others. And some transcribers will, consciously or unconsciously, base their decisions on what they came to view as "valid" prefixes and suffixes. And then there is an unknown amount of spelling and spacing errors made by the BEEEPers, the Scribe, and the Author himself. All these errors will make the prefixes seem more independent of the suffixes than what they would be in correct Voynichese.
IIUC, the @dashtofsk tables were computed over all language B pages, mixing text from different sections. Thus the lexicon of his text was quite varied. The pinyin file I provided is (as you may have guessed) the Shennong Bencaojing (SBJ), a collection of 365 "recipes" with a rather limited vocabulary (~630 distinct words) and a specific format. There is a small set of "keywords" (like "wèi" = "flavor", "zhǔ" = "mainly for", "yī míng" = "another name", "shēng" ="provenance", etc.) that occur in practically every recipe. These features will make the distribution of prefix x suffix pairs much "lumpier" than it would be in a more varied text. In other words, even though the pinyin text is fairly long, there is substantial sampling error at the lexicon level. It would be interesting to see @dastofsk tables computed over You are not allowed to view links. Register or Login to view. alone (which, as you know, I am claiming to be a transcription of the SBJ in some unidentified language).

In conclusion, I don't think that @dashtofsk tables prove that Voynichese is not a natural language. Not at all. One may even argue that they are evidence that it is...

All the best, --stolfi

(14-05-2026, 02:17 PM)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view.In conclusion, I don't think that @dashtofsk tables prove that Voynichese is not a natural language. Not at all. One may even argue that they are evidence that it is...

You of course can choose how to interpret evidence. One thing to note, your suggestion of splitting pinyin by any main vowel is at odds with what I intended and mixing many different B together dilutes the signal. The below is 6 charts, one is the same pinyin file split by 'i', then two samples of Latin, one sample of the modern English, the Voynich MS and the pinyin split by a class of characters, as you suggested. Color coding assigns values of 0.1 and below to red, values of 10.0 and above to green, and log linearly interpolates in between.

If we only compare the first five charts, Voynich is an absolute clear outlier among Chinese, Latin and English. If we add the pinyin split that you proposed at the lower right corner, it still looks closer to other languages to me, but this can be argued either way, however it is not important at all, because for any single central character or sequence B (like k used in Voynichese), as in the pinyin example in the upper left corner, the distinction between a natural language and Voynichese is very clear.

[attachment=15557]

(14-05-2026, 02:17 PM)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view.I don't think that @dashtofsk tables prove that Voynichese is not a natural language

It was never my intention to prove this. Just to provide additional evidence for the text being artificial and constructed.

(14-05-2026, 03:19 PM)dashstofsk Wrote: You are not allowed to view links. Register or Login to view.It was never my intention to prove this. Just to provide additional evidence for the text being artificial and constructed.

However, I do claim this. I think this result is incompatible with any ordinary script and text for any natural language. If anyone finds any character or sequence B that will produce Voynich-like statistical patterns of A-C pairs when splitting the corpus by any fixed boundaries (spaces, any characters, whatever) and then extracting and counting all ABC sequences, I will be very surprised.

I cannot believe I'm about to say this as The Voynich Ninja's Most Vocal Chinese Theory Detractor™, but I am going to suggest running this again with a two changes:

Break it on <a> not . The letter in Pinyin can be a medial as well as a nuclear vowel, so it can appear in any of three central "slots". I wouldn't have expected to act like Voynichese <k> simply because it's not being used consistently the way the slot alphabet model predicts
Use an encoding with tone numbers (e.g., mei2). This adds a tone slot and decreases the number of choices for the vowel slot by like a factor of 5, the net effect being it acts more like a Voynichese slot alphabet.

I'm not sure if you need a larger corpus, but it might help. (Doesn't someone have that?) At any rate, the finding breaking across all nuclear vowels has enough in common with Voynichese to warrant putting the best version of the test through, I think

(14-05-2026, 08:36 PM)rikforto Wrote: You are not allowed to view links. Register or Login to view.Break it on <a> not . The letter in Pinyin can be a medial as well as a nuclear vowel, so it can appear in any of three central "slots". I wouldn't have expected to act like Voynichese <k> simply because it's not being used consistently the way the slot alphabet model predicts

Use an encoding with tone numbers (e.g., mei2). This adds a tone slot and decreases the number of choices for the vowel slot by like a factor of 5, the net effect being it acts more like a Voynichese slot alphabet.

Thank you for the suggestions. First of all, two disclaimers:
1) the code to generate the charts was created by Codex, but it looks legit and the results look plausible, I'm fluent in Python
2) I had HSK3 level in Mandarin Chinese some ~20 years ago, so I used to be familiar with some 500+ characters (including the traditional and cursive ones, which I learned for fun) and pinyin and even though I forgot most of it, I think I generally know what I'm doing here
3) however, I don't treat this very seriously and don't double check everything, so if someone reproduces this would be nice

I fixed an error Codex made with the first version of the pinyin split by i where it wouldn't detect ì ǐ í ī as variants of i, so in the first chart I made only syllables like shuái and qiàn were processed, while píng and shì weren't. The fixed version is now the bottom left chart. Next to it is the pinyin split by a and pinyin split by a with numeric pinyin.

Note what while the Chinese graphs are whiter and have more cells closer to 1.0, they still show the natural random pattern with missing pairs and very low frequency pairs that is completely absent from the top 6 prefixes/suffixes in Voynichese. If we exclude the artificial cross lines made by ckh and chckh combinations from the Voynichese graph, the first prefixes/suffixes that show any significant deviation from random co-occurrence would have rank 7 or more, this doesn't happen in natural writing. I suspect that the evolution of natural languages would never perfectly balance characters or syllables in common words.

[attachment=15569]

Quote:I cannot believe I'm about to say this as The Voynich Ninja's Most Vocal Chinese Theory Detractor™

The most vocal only because I stopped caring Smile

(14-05-2026, 03:03 PM)oshfdk Wrote: You are not allowed to view links. Register or Login to view.One thing to note, your suggestion of splitting pinyin by any main vowel is at odds with what I intended and mixing many different B together dilutes the signal.

Agreed.

However, this comment and the comparison of the top left and bottom right tables shows a problem with this approach. Namely, in order to show that LanguageX "does not behave like Voynichese", you would have to compute the table for every possible spelling system for LanguageX, every possible set of prefix-core-suffix splitting rules, and any choice of the core B that gives sufficient data.

To make the table for Mandarin/pinyin slightly less incomparable to that of Voynichese, you could consider mapping the tone diacritic to a digit 1-4 immediately preceding the vowel, define that digit as the core, and use any of the four digits as the target core "B". For example, "xiáng" would become "xi2ang" and would be parsed as «xi-2-ang». Then, if you pick "4" as B, you will still have ~190 word types in that sample file.

Another problem is that the color scheme maps to red every cell whose combination is absent from the file, even if its expected count is well below 1. The English and Latin tables have lots of red because they have lots of prefixes and suffixes, therefore most prefix-suffix combinations simply have no chance of showing up in the sample text. The Voynichese tables have little red because there are very few prefixes and suffixes, and thus most combinations can show up.

This problem could be alleviated by using a color formula like

(yellow)*alpha*gamma + (blue)*(1-alpha)*gamma + (gray)* (1-gamma)

where alpha is a number in [0,1] that is a function of log(actualCount/expectedCount), as you have now, and gamma is a number that is near 1 when the expected count is large, and near 0 when the expected count is 1 or less. (But be sure to compute the expectedCount as float, not int.)

Quote:If we add the pinyin split that you proposed at the lower right corner, it still looks closer to other languages to me

The visual similarity is mostly due to the fact that, with those splitting rules, there are basically only six suffixes that occur in significant numbers: empty, "n", "ng", "o", "i", "u". For this reason, that table has four columns of almost solid red; and that is is the main similarity between that table and those of English and Latin. If we exclude those four columns, the last table looks a lot more similar to Voynichese and less similar to the other languages.

Quote:the distinction between a natural language and Voynichese is very clear.

True if the "a" is understood as "one". But there are still some 50 monosyllabic languages to try, each with a zillion possible spelling systems and p-c-s parsing rules...

All the best, --stolfi

In exercises like these, one always has numerous options, and the combination of each set of options will lead to a different result. In practice, it is tempting to select a single set of options, based on some subjective (or even subconscious) preferences.

What do I mean with that?
Examples for Pinyin transcribed Mandarin.

Option 1: "how to represent tones"
a) one can simply leave out the tone information. (This is of course a very significant loss of information)
b) one can use the most common 'accent' representation, and thereby end up with multiple vowels
c) one can represent the tone by a separate character
c1) this separate character can be written after the vowel
c2) this separate character can be written at the end of the syllable
(c1 and c2 often coincide)

Option 2: "how to distinguish the two different sounds both represented by 'i' "
a) Follow the Pinyin writing. Both sounds are mapped onto the same character
b) Knowing the simple rule (it depends on the preceding consonant), write them as different characters

Option 3: "how to distinguish the two different sounds represented by 'u' and 'ü' "
a) Ignore the difference. Both sounds are mapped onto the same character
b) Follow the Pinyin writing, which leads to incorrect Mandarin
c) Knowing the rule, write them correctly as different characters
(The rule here is a bit more tricky. Some consonants are always followed by 'u', and 'u' is written.
Some consonants are always followed by 'ü' and then also 'u' is written.
Some consonants can be followed by both, and only then 'ü' is written when it sounds like that.)

This already results in 4x2x3 = 24 combinations, leading to different statistics

(14-05-2026, 11:42 PM)ReneZ Wrote: You are not allowed to view links. Register or Login to view.This already results in 4x2x3 = 24 combinations, leading to different statistics

This is true, that's why I'm willing to consider as many charts as needed. My intuition tells me that no natural language would behave in the way the Voynich Manuscripts k words behave, simply because this is not the way language sequences are constructed. No elements of any language, be it words, syllables or phonemes, would produce a near perfect independently uniform distribution of popular prefixes and suffixes for any fixed central sequence. There is no mechanism in any language I know that could cause this even by accident. I have no proof of this and I don't know if it's possible to prove this.

In any language allowed elements are usually restricted by the preceding elements. For example, you can start a sentence with any letter, but you cannot continue "this day was ver..." with most letters of the alphabet, you are restricted to only a few letters that will make sense. The independence that the Voynich MS charts show demonstrate that in the Voynich MS most popular suffixes after some central characters can be chosen independently of the prefix. The prefix doesn't appear to affect which suffixes you can use. This looks unlike any language to me.

(14-05-2026, 10:24 PM)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view.Another problem is that the color scheme maps to red every cell whose combination is absent from the file, even if its expected count is well below 1.

Just a quick comment, I probably won't have time for a full reply now. The charts only show 10 most popular prefixes and suffixes, but we can limit ourselves to the first 5 and still see the effect clearly. There is no way any of the expected counts in the top left 5x5 area on any of the charts are below 1. I can add the expected values to the charts if you wish. Mathematically any cell that is both to the right and bottom from another cell has a lower expected count. If you see a value above 0 and below 1 in any cell, this means cells in the rectangle to the top/left of this cell have expected values of 1 or above (for act/exp to be below 1 and above 0 actual should be non-zero and it cannot be less than 1, so min possible expected is the reverse of the value in the cell).

Pages: 1 2 3 4 5 6 7