The Voynich Ninja - An Artificial Construction

Pages: 1 2 3 4 5 6 7

(16-05-2026, 05:22 PM)oshfdk Wrote: You are not allowed to view links. Register or Login to view.Here's side by side the computation run on an English text and on Voyniches with spaces and with all spaces removed before processing. As you can see, for the most frequent 5 token combinations there is no much difference. English shows a lot of unbalanced prefix-suffix pairs (mostly just missing from the text), Voynichese only shows a few.

The difference between Voynichese and other languages seems to be a matter of degree, not a fundamental one.

The visual comparison is misleading because the color scale is such that the only info we see is whether a cell is zero (dark red) or nonzero (some light pastel color). Thus it is enough for *one* daiin chedy to occur for it to seem like "the occurence of chedy is independent of whether the previous word was daiin or not". But in fact the "almost white" Voynichese cells have ratios higher than 2 or as low as 0.5.

In my previous post I gave several possible explanations for why Voynichese would seem to be more "Cartesian" than other languages, while still being a natural language. They included the unknown incidence of spelling errors in the VMS text. You did a test with artificially misspelled English, and it did make English look more Cartesian, correct? Perhaps the amount of errors in Voynichese is more than what you assumed?

Another variant of that "spelling errors" explanation is that the Author's spelling of the language could be mapping many different words to the same string of glyphs. I am still unable to distinguish spoken English "and" from "end", "man" from "men", etc.; if I were to take a dictation of a string of unfamiliar words, I would write both vowels as "e".   If the Author was an Englishman or German writing French, he might have omitted all diacritics for them being "just another case of French silly nonsense". If he was an Arab merchant writing Finnish, he might have omitted all short vowels, because surely anyone who speaks Finnish can guess them from the context, no? And so on...

And if the Author was taking dictation, he probably would use some form of shorthand, not the language's standard spelling. And shorthand systems typically will use the same sign to represent multiple similar sounds, like "p" and "b", "è" and "à", etc.

Either way, if Voynichese "daiin" and "chedy" each represent several different words of the language, the pair "daiin chedy" may be common because "bright star" is common, while "daiin shedy" may be common because "water pipe" is common -- even though the document never uses "bright pipe".

And yet another possible explanation is that the VMS may be a terse style of prose which omits most function words and violates syntax for the sake of brevity. That is, instead of

"Goblin's Carrot is a tall bush that grows on the mountains. Its bark, made into a tea, will cure baldness and increase one's chance of picking up girls at the tavern"

it may say

"Goblin carrot: tall bush; mountain. Bark tea: bald head, girls"

except that without any punctuation, of course. Then one will see many pairs like "bush mountain" and "tea bald" that would not occur in a normal grammatical text.

And this possibility brings up another point: when comparing Voynichese with other languages, one must use texts that are hopefully in the same style. That is, the Herbal section should be compared to herbals, the Starred Parags section should be compared to a book of recipes (or whatever one guesses that it may be), etc. And even then there is a huge range of styles...

All the best, --stolfi

(16-05-2026, 10:12 PM)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view.The difference between Voynichese and other languages seems to be a matter of degree, not a fundamental one.

I'm not sure what a fundamental distinction would look like in quantitative analysis? Voynichese so far is a clear outlier among all faithful representations of plaintext languages. It actually on the opposite part of the spectrum from Chinese texts you provided (both pinyin and characters) when using this scale. So this is the fundamental distinction of being the top one. When someone presents a sample of any historical plaintext (historical to make sure it's not specifically designed to subvert this test, which maybe is possible using modern computational techniques) that would surpass Voynichese in this metric of maximum independence of token pairs, this will become just a question of degree.

(16-05-2026, 10:12 PM)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view.You did a test with artificially misspelled English, and it did make English look more Cartesian, correct? Perhaps the amount of errors in Voynichese is more than what you assumed?

Yes, this would work. The only caveat is that the text would probably become unrecoverable, because it appears that more ~30% error rate is needed to replicate the statistics of Voynichese. And surely if it has this kind of error rate any attempts of matching it with existing plaintexts are futile. Also I think that systematic mistakes, like misspelling 'man' as 'men', won't affect the results that much, some actual randomness is required.

(16-05-2026, 10:12 PM)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view.And yet another possible explanation is that the VMS may be a terse style of prose which omits most function words and violates syntax for the sake of brevity. That is, instead of

"Goblin's Carrot is a tall bush that grows on the mountains. Its bark, made into a tea, will cure baldness and increase one's chance of picking up girls at the tavern"

it may say

"Goblin carrot: tall bush; mountain. Bark tea: bald head, girls"

except that without any punctuation, of course. Then one will see many pairs like "bush mountain" and "tea bald" that would not occur in a normal grammatical text.

As far as I can see, this specific change would rather skew the results in the opposite direction. To make the results Voynichese-like both "tall bush" and "bush tall" should appear roughly the same number of times, since their expected co-occurrence is the same. This terse technical style would probably make one of them greatly underrepresented. Edit: on the other hand, it's actually hard to predict how certain styles would affect the outcome, since very predictable patters would just get merged into a single token. I'd say it's not very clear what the result of this style would be. Isn't it roughly the style of SJB? Then it's very different from Voynichese.

(16-05-2026, 10:12 PM)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view.And this possibility brings up another point: when comparing Voynichese with other languages, one must use texts that are hopefully in the same style. That is, the Herbal section should be compared to herbals, the Starred Parags section should be compared to a book of recipes (or whatever one guesses that it may be), etc. And even then there is a huge range of styles...

My claim is that a faithful representation of any natural language in any style won't reach the levels of Voynichese. Especially repetitive texts like recipes or encyclopedias. They are most likely a step in the wrong direction. If anything, maybe some fancy experimental prose might close the gap, like authors that start each word in a sentence from a different letter or similar experiments making the text much less predictable. Also maybe simple word lists, like a shopping list or a list of names, might work. Or not, because if the text gets too unpredictable, this will first and foremost lead to shorter tokens and intraword letter combinations would begin to dominate the chart.

Pages: 1 2 3 4 5 6 7