The Voynich Ninja - List of "weird" vords

Pages: 1 2 3 4 5

Putting a big lot of null text into cipher actually happened in the past.

See for example Copiale cipher: You are not allowed to view links. Register or Login to view.
It is not a 1-for-1 substitution but rather a homphonic ciper. Each ciphertext character stands for a particular plaintext character, but several ciphertext characters may encode the same plaintext character. For example, all the unaccented Roman characters encode a space.
If you check it, these nulls make about 50% of the text.

There is also Ave Maria cipher which maybe doesn't have nulls but is very "bloated", single letters are coded by long full words.
You could add there some null words of course to make it more grammatical:
You are not allowed to view links. Register or Login to view.
For example, the word “Hello” is encoded as
Optimus dominus immortalis immortalis imperator

I think if the text of the Voynich can indeed be split into two, null/filler text and non-null/real text then how does one do that? It seems to me that the repetitive text is going to constitute the null/filler text. It remains to workout how to partition the text. This seems easy at the extremes clearly identifying the most repetitive text as being filler text and the most abnormal text as real text. However, at the boundary between the two types of text it seems more difficult to determine which is which. But maybe it is only necessary to identify some of the real text even if identifying all of the real text is difficulty. This would mean focusing only on the most abnormal words. If the real text were encoded using a simple substitution cipher then applying frequency analysis on a sufficiently large quantity of real text would be enough to decipher it. If the real text used a more complex substitution cipher then deciphering it could be much harder. It would still remain to determine the algorithm used to generate the filler text. It also remains to determine with the filler text is spread relatively evenly throughout the pages of the manuscript or not. How would someone writing the text know when to include real words and when to have filler words or was it arbitrary? And how could someone reading back the text spot which were filler words and which real words?

It could be something simple like every word where the 2nd letter is a gallows character is a filler word.
It becomes more complicated if what is null/filler and what is not cuts across words and so operates at the glyph level not the word level. I am inclined to think and hope that this is not the case.

While it is possible in principle, I suspect an encoding like this should leave its traces in various character combination statistics, allowing for clear separation of character combinations into two classes. I'm not aware of any computational results that would support this. I don't think there is a very clear boundary between statistically ordinary and unusual words, for example. When you look at character combination statistics, there appears to be a very smooth continuum. I would expect some boundary if part of the text was the actual encoding and another large part just nulls.

Also, producing a large number of statistically plausible nulls is not an easy task by itself. Voynichese shows low string repetition counts for multiword strings compared to natural text. I prefer explaining this with the use of a one to many cipher. (With a simple one to many substitution cipher it would be enough to have 2 options per character to reduce repetitions dramatically, since for a string of 20 characters to perfectly repeat one would have to make the same choice 20 times in a row.)

If a large portion of the manuscript was gibberish created by randomly stringing filler words, I would expect a lot of coincidental repetitions. It's hard to produce pages and pages of gibberish without repeating oneself.

(22-06-2025, 05:50 PM)oshfdk Wrote: You are not allowed to view links. Register or Login to view.If a large portion of the manuscript was gibberish created by randomly stringing filler words, I would expect a lot of coincidental repetitions. It's hard to produce pages and pages of gibberish without repeating oneself.

I would expect that each page would contain some filler text and some real text with the larger proportion being filler. So, I doubt there would be any pages with just filler text and no real text. Likewise, I doubt there would be any pages with just real text and no filler text.

So, I suspect the gibberish would be interspersed with real words. The question of how gibberish text was produced is akin to the question that people who argue the manuscript is a hoax and therefore all gibberish have to address.

(22-06-2025, 05:50 PM)oshfdk Wrote: You are not allowed to view links. Register or Login to view.While it is possible in principle, I suspect an encoding like this should leave its traces in various character combination statistics, allowing for clear separation of character combinations into two classes. I'm not aware of any computational results that would support this. I don't think there is a very clear boundary between statistically ordinary and unusual words, for example. When you look at character combination statistics, there appears to be a very smooth continuum. I would expect some boundary if part of the text was the actual encoding and another large part just nulls.

If you can provide statistical evidence for this that would be interesting.

(22-06-2025, 06:05 PM)Mark Knowles Wrote: You are not allowed to view links. Register or Login to view.I would expect that each page would contain some filler text and some real text with the larger proportion being filler. So, I doubt there would be any pages with just filler text and no real text. Likewise, I doubt there would be any pages with just real text and no filler text.

So, I suspect the gibberish would be interspersed with real words.

I don't think this changes much as long as there is more gibberish than words. Gibberish generated without some proper randomization would probably drive repetitions up.

(22-06-2025, 06:05 PM)Mark Knowles Wrote: You are not allowed to view links. Register or Login to view.The question of how gibberish text was produced is akin to the question that people who argue the manuscript is a hoax and therefore all gibberish have to address.

Agree.

(22-06-2025, 06:05 PM)Mark Knowles Wrote: You are not allowed to view links. Register or Login to view.If you can provide statistical evidence for this that would be interesting.

Ok, suppose some of the character sequences in the manuscript are the filler and some are the actual ciphertext. This means that character sequences would be sampled from two different distributions with different properties. Let's try a simple experiment, take 1000 most common substrings (character sequences of any length, including dots, commas, etc) from ZL EVA transcription, and plot a bar chart of their repeat counts (logarithmic to make the data easy to see).

Then repeat the same experiment with Opus Majus and then take Opus Majus and replace 60% of words with random gibberish of the same lengths.

In the image below you can see three resulting charts (bar colors code the lengths of corresponding substrings, and each 100th substring is shown below the chart just for a bit better exposition of the data). And while the Voynich and Opus Majus charts look quite similar in shape (not in color, though, highlighting unusually strict structure of Voynichese with many repeating sequences of 5-6 characters), the chart that combines Opus Majus with truly random gibberish looks quite different from both.

[attachment=10864]

This doesn't really prove anything, but I expect that in most cases if the Voynich MS was created as a combination of two distributions - meaningful and meaningless - this would have been quite within reach of modern computational analysis for many years now, and someone somewhere would create a reasonable split of Voynichese into two statistically dissimilar distributions according to some simple rule (as you said yourself, there should be a rule governing which parts to read and which to skip).

(22-06-2025, 07:19 PM)oshfdk Wrote: You are not allowed to view links. Register or Login to view.This doesn't really prove anything, but I expect that in most cases if the Voynich MS was created as a combination of two distributions - meaningful and meaningless - this would have been quite within reach of modern computational analysis for many years now, and someone somewhere would create a reasonable split of Voynichese into two statistically dissimilar distributions according to some simple rule (as you said yourself, there should be a rule governing which parts to read and which to skip).

I am not suggesting that the filler text is random gibberish. It makes sense that there is structure to it, if you like "structured" gibberish. Likewise, the real text will not be random; how it will be structured will depend on the sophistication of the cipher used if not simple substitution(assuming the underlying language is latin for example). So, I am not sure that it is helpful to think of them as two dissimilar distributions.

At the moment you could partition all words into two categories: those for which the 2nd glyph is one of the 4 simple gallows characters and those words for which that is not the case. (It could include all words that begin with an 'o' or end with a 'g') However, I suspect a bit more sophisticated partition.

And now a bit more realistic example, I just typed about 500 words of random language-like gibberish and instead of proper random data replaced 60% of words in Opus Majus with random pieces of this human produced gibberish (the selection of gibberish words is still properly random). As you can see, this chart's shape is even more unlike the rest of them, there are sudden drops where my badly randomized gibberish switches to plaintext frequencies.

I think when two distributions with different entropies are combined you can expect jumps like that. But all of this of course more of a hunch, than science.

[attachment=10865]

Maybe, I should explain why this hypothesis appeals to me.
(1) It helps to explain the very repetitive nature of Voynichese.
(2) It accounts for the presence of Voynichese words that are not repetitive in nature and have their own structure.
(3) Unlike, the "hoax"/gibberish theory, which implies all the text in the Voynich manuscript is meaningless and therefore the manuscript as a whole must be meaningless, this hypothesis implies the manuscript is perfectly meaningful and so not a product of hoax or psychosis.

Whilst, I haven't studied Trithemius's Steganographia I believe it provides another contemporary example of the use of filler text.

But for this theory to work well one probably needs to devise a good way of split the text in null/filler and real text. And I am not sure how to best do that.

I am still of the mindset that the most distinctive and abnormal words are the ones we should pay most attention to, whereas historically in Voynich research they often tend to be ignored.

Pages: 1 2 3 4 5