The Voynich Ninja - An Artificial Construction

Pages: 1 2 3 4 5 6 7

(14-05-2026, 11:42 PM)ReneZ Wrote: You are not allowed to view links. Register or Login to view.Option 2: "how to distinguish the two different sounds both represented by 'i' "

By the way, in pinyin the "consonants" y and w are actually silent. Logically, "yí" and "wú" should be written just "í" and "ú".

AFAIK those dummy consonants are hacks (maybe used already by Matteo Ricci?) to remove ambiguity when compound terms are written together without tone marks. Thus "mai" is one syllable, while "mayi" is a compound with two syllables, "ma" and "yi" (which is basically "i"). Apparently this ambiguity does not arise with other vowels.

When I found that the histogram of word type lengths for Vietnamese had a nice symmetrical distribution, I checked that of Mandarin in pinyin, and was disappointed because it was rather lopsided. It turned out that the culprit was those two dummy consonants. Erasing the w and y from all words made that histogram symmetric too.

Which shows how a small spelling quirk can ruin language startistics and confuse the statitsticians...

All the best, --stolfi

(15-05-2026, 12:04 AM)oshfdk Wrote: You are not allowed to view links. Register or Login to view.This is true, that's why I'm willing to consider as many charts as needed. My intuition tells me that no natural language would behave in the way the Voynich Manuscripts k words behave, simply because this is not the way language sequences are constructed. No elements of any language, be it words, syllables or phonemes, would produce a near perfect independently uniform distribution of popular prefixes and suffixes for any fixed central sequence. There is no mechanism in any language I know that could cause this even by accident. I have no proof of this and I don't know if it's possible to prove this.

But what if these are numbers?
These would get very uniform.

(15-05-2026, 12:27 AM)ReneZ Wrote: You are not allowed to view links. Register or Login to view.But what if these are numbers?
These would get very uniform.

Yes, I think lists of numbers may work like Voynichese in this test.

Edit: However, I'm not sure about a simple nomenclator style cipher. It's relatively easy to test, just make a similar chart but with words as basic elements. In a nomenclator the selection of words that follow should depend on the words that precede them. I did run similar tests a few years ago, and I think Voynichese behaved rather uniformly at different scales. But worth retesting of course.

(15-05-2026, 12:04 AM)oshfdk Wrote: You are not allowed to view links. Register or Login to view.My intuition tells me that no natural language would behave in the way the Voynich Manuscripts k words behave, simply because this is not the way language sequences are constructed. No elements of any language, be it words, syllables or phonemes, would produce a near perfect independently uniform distribution of popular prefixes and suffixes for any fixed central sequence.

There are two separate properties here. For each core,
P1. The word types with that core are all possible combinations of prefixes and suffixes.
P2. The prefixes and suffixes are chosen independently when composing the text.
Property 1 could be true, for example, if the text is encrypted with a codebook cipher. In fact, You are not allowed to view links. Register or Login to view., that could also make the distribution of word type lengths nicely fit a binomial function choose(n,k), like that of the VMS (and of Vietnamese...).

Property 1 does not hold for European languages in the plain (or with simple substitution ciphers), for sure. But monosyllabic languages come closer to satisfying it. Typically, the set of potential syllables (those that are allowed by phonetic constraints) is only a few thousand. On the other hand, the size of any language's vocabulary -- the number of terms whose meaning cannot be deduced from the grammar, but must be explicitly listed in a dictionary -- seems to be sort of like a "linguistic universal": several tens of thousands.

Monosyllabic languages generally cope with that mismatch by (1) having many homophones - words like English "to", "two", and "too", with very different meanings but same sounds, that can be distinguished by context; (2) massive use of compounds - combinations of two or more syllables with specific meanings that are only vaguely related to the meanings of the parts, like English "typewriter", and (3) using a large percentage of the potential syllables.

Joining all possible prefixes (initial consonants), cores (initial glides and main vowels), and suffixes (final glides and final consonants) of Mandarin would give You are not allowed to view links. Register or Login to view.. However, each core has a limited subset of compatible prefixes and suffixes. Taking these restrictions into account reduces the number of potential syllables to less than 2900 (not sure about the exact number). Of those, about 1300 (more than 44%) are actually used in Mandarin. That is, Property 1 is much closer to being satisfied by the Mandarin words with a given core than it is by all English words that contain "t".

Property 2 is normally not seen in natural languages, because meanings are assigned to individual prefix-core-suffix combinations; and any text will use only a "random" subset of the possible combinations., and the frequencies of those combinations will be "random" too.   Even the monosyllabic languages that come close to satisfying Property 1 will usually fail Property 2. That Shennong Bencao file that I posted uses only ~630 of the ~1300 meaningful Mandarin syllables, and surely it does not satisfy either property.

For the same reason, Property 2 is also not expected in a natural language text encoded with a codebook cipher, not even one that satisfies Property 1.

But, in spite of what one may think glancing at the colored tables, Voynichese does not satisfy Property 2 either. The frequencies of word types with a given core are not simply the prefix frequencies times the suffix frequencies. The deviations from independence are smaller than what we see in the SBJ file, but they exist and are significant. For example, in the Starred Parags section, I count

56 otedy 56 oteedy
  2 ytedy 12 yteedy

47 okal 44 okar
0 ykal 6 ykar

Quote:in the Voynich MS most popular suffixes after some central characters can be chosen independently of the prefix. The prefix doesn't appear to affect which suffixes you can use.

So this claim seems to be false.

Granted, Voynichese seem to be somewhat closer to satisfying Property 2 than Mandarin is. But there are all those possible explanations that I listed for why this may be happening.

All the best,

Thank you for running that again oshfdk!

I'm surprised my tone-number conjecture did worse, but that's why we look at data. So, the logic of the "slots" alone does not drive the similarities we're seeing in the <a> chart and to a lesser extent the nuclear vowel chart, which is good to know. After reflection, I suppose it makes sense. Sound change famously causes uneven distributions in the phonotactics of a language, so that roughness is an expected feature of tracking more distinctions. You would need a representation that evened that out, which I believe phonological representations are very unlikely to do. I would predict this roughness would be present for any Romanized Mainland Southeast Asian language because that roughness is a product of sound change which affects all languages.

I cannot slam the door closed with this analysis, and I invite a counterexample, but I do take this as further grounds to argue Voynichese is not a phonological representation of an underlying language. The emphasis of my conclusion here is on "phonological representation", but that comes part and parcel with the hypothesis that there is no underlying language, and that's a fair interpretation as well.

I think I found a way to make the result more universal by removing all parameters. This was inspired by Mauro's Nbit grammars, Rene's comment about many possible combinations and BPE encoding used to tokenize texts for machine learning tasks.

The idea is to perform a deterministic transformation to identify the most frequent tokens (independent of the writing system or any delimeters) of each text and then compare the pair statistics for top 15 most frequent tokens. So, there is no central characters and no arbitrary splitting algorithm, there is nothing to fine tune.

This works as follows: initially the text is treated as a sequence of unicode characters (one character is one token).

At each step the algorithm identifies the most frequent pair of characters, assigns a new token value to this pair, replaces this pair everywhere in the text with the new token value and then computes the You are not allowed to view links. Register or Login to view. of the new text (this part was suggested by Gemini, I asked if something similar to Mauro's Nbit metric can be used here). If this new MDL is smaller than the previous MDL, the new token value is accepted and this whole step is repeated, if not then the update is reverted and the algorithm proceeds to the next most frequent pair and checks if replacing it with a new token would reduce MDL, and if no such pair is found at all the algorithm stops the encoding identifying the single best possible pair encoding of this text according to this algorithm.
In other words, the algorithm tries to compress the text by replacing pairs of tokens with new tokens until it reaches the point where there is no way to further reduce the total size of the representation (token sequence + token vocabulary), coming to a single most compact BPE representation of the text.

This lets us find some comparable text representation regardless of the language and the writing system. One strong piece of evidence that it works correctly is when running this algorithm on pinyin and the Chinese versions of bencao, the resulting most frequent tokens largely overlap - the Chinese sequences up to three characters long and the corresponding pinyin sequences up to three characters long made it to top 15.

After completing the unique tokenization we get the top 15 tokens and compute the expected and actual counts of their combinations, that is, for tokens 'da' and 'in' we would count the actual number of 'dain' and the expected number given the counts of 'da' and 'in' tokens in the text. Then we produce the same actual/expected charts as before.

The results shown are for English, Latin, Arabic, Chinese characters, Chinese pinyin and Voynichese and three special charts at the bottom, explained below. All texts are of different sizes between 20 and 800 kilobytes in UTF8 text form, the algorithm is not very sensitive to the size of the text. Voynichese is a clear outlier with many more token pairs that appear close to the number of times they would appear is the selection of tokens was made independently of one another. Among texts, Arabic looks the most similar to, but still very far from Voynichese.

For clarity the spaces are represented as mid height dots in the tables (to make them visible) and I also replaced Voynichese spaces (.,) with the same symbol.
We can try other languages and texts, but I suspect that none of them would reach the level of uniformity of Voynichese.

Given that a possible explanation for various peculiarities of the text is a lot of errors made in preparing it, I also computed the chart for "mangled English", where each character with 15% probability was replaced with a similar looking or similar sounding letter, producing a result like this: "soil for the supdort of the planf. The raot, therefare, fulfils a". The resulting chart looks like somewhere between English and Voynichese, however I'm not sure we can treat this result as valid, since what happened here is a nearly perfect machine randomization using a modern statistically stable algorithm, something unlikely to happen when creating the actual MS. I don't think actual mistakes made by humans would be this random and there still would be many more patterns. In any case, the level of mangling to reach the same state as the chart of Voynichese is to mangle 30% of characters, which makes the English text completely unreadable. If we are talking about this number of mistakes, the text is essentially lost.

To me all this is a very strong indication that Voynichese is not a straightforward representation of any natural language, be it phonetic or logographic, faithful or with some errors (but still readable).

I also tried a nomenclator - assigning each different word in an English text to a unique decimal or roman numeral. While roman numerals look more like ordinary language, decimal numerals absolutely overshoot the mark and look even more uniform than Voynichese, something I think Rene hinted at previously in this thread.

(One note about reading the charts, because this looked confusing for me at first: the chart for Voynichese lists d + aiin actual count as 0. This is not a mistake, because d+aiin in the best compressed BPE is encoded as a separate token daiin and not as a combination of d + aiin.)

[attachment=15592]

I don't mean to be a pain, but that doesn't solve the problem of word boundaries.... I think this applies to Chinese characters and, of course, VMS, because the spaces here are most likely not normal spaces—at least not all of them... So what exactly are u comparing here? That's the interesting question... Wink

(16-05-2026, 05:21 AM)JoJo_Jost Wrote: You are not allowed to view links. Register or Login to view.and, of course, VMS, because the spaces here are most likely not normal spaces—at least not all of them...

Just to avoid a misunderstanding....
While I clearly expressed that I have some doubt that the apparent spaces in the MS delimit units that are equivalent with words in a plain text, I am also not aware of any evidence that they are not.
This is just a hypothesis.
Or, rather than a hypothesis, I want to point out that this is an assumption that appears completely reasonable, but is not a given.

The fact that certain character shapes preferably appear adjacent to these apparent spaces could be seen as evidence for either option. (I don't think it is valid evidence either way).

(15-05-2026, 12:24 AM)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view.By the way, in pinyin the "consonants" y and w are actually silent. Logically, "yí" and "wú" should be written just "í" and "ú".

Both can be heard. In fact, in a post from many years ago, I (tongue in cheek) suggested that the only difference between Voynich words starting with o vs. qo is that these represent the arbitrary choice between initial (Mandarin) "i" and "yi".

(16-05-2026, 05:39 AM)ReneZ Wrote: You are not allowed to view links. Register or Login to view.Just to avoid a misunderstanding....
While I clearly expressed that I have some doubt that the apparent spaces in the MS delimit units that are equivalent with words in a plain text, I am also not aware of any evidence that they are not.
This is just a hypothesis.
Or, rather than a hypothesis, I want to point out that this is an assumption that appears completely reasonable, but is not a given.

The fact that certain character shapes preferably appear adjacent to these apparent spaces could be seen as evidence for either option. (I don't think it is valid evidence either way).

I definitely agree with you, but that’s not the point—we’re still comparing apples to oranges here.

If a text has only 7–8 rules to define 90 percent of the spaces (the rest are likely artifacts), then you CANNOT compare this highly rule-based system to a normal language, in which there are significantly more rules (well over 50 for MHD and Latin, depending on the text type). That in itself makes no sense. It doesn’t matter whether they are actually word boundaries or not.

I think that’s the problem: it logically weakens the validity of any statistics based on word boundaries in the VMS.

Or am I wrong?

Pages: 1 2 3 4 5 6 7