The Voynich Ninja

Pages: 1 2 3 4 5 6 7 8

(05-08-2025, 12:15 AM)Juan_Sali Wrote: You are not allowed to view links. Register or Login to view.Some thoughts on the Naibbe cipher.
The Naibbe cipher with 6 tables has 18 different ways to encrypt every plain text letter.
If I have understood correctly how the randon way to choose the encryption works, in the case of simplfied spacing, it will split the size of the sample of every letter in 18 encrypted parts of equal size.
For every plaintext letter: Are there 18 samples of equal size for the unigrams, prefixes and sufixes needed for the encription?

Page 10:
The Naibbe cipher’ sability to disguise a given plaintext bigram in one of 36(6×6) ways through letter-by-lettert ables election is by far its greatest strength

The problem is again the random encryption, the samples of the 36 ways will have the same size, a weakiness.
A non random choice would be better, the preferences of the scribes in the use on some ways over another ones would split the
size of the 36 ways in samples of different sizes, making them more difficult to decrypt

Thanks for your question! The cipher can generate 36 different ways of encrypting the same bigram, through 6 prefixes/letter and 6 suffixes/letter. But critically, not all prefixes and not all suffixes are equally likely to be chosen. The single commonest way to encrypt a given bigram occurs ~25/169 of the time, when the prefix and suffix are each encrypted using the primary table (aka 5/13 * 5/13). The 4 rarest ways to encrypt a given bigram each occur ~1/169 of the time, when the prefix and suffix are each encrypted using a tertiary table (aka 1/13 * 1/13). In short, the different ways of encrypting a given bigram can vary in their frequencies by a factor of 25, purely randomly—and that doesn't even include the possibility of a scribe developing a non-random preference for certain bigrams. In addition, the available prefix and suffix options all vary in their glyph lengths, so the 36 different bigrams can all vary in their glyph length, too.

To your general point about the re-spacing scheme: I'm incredibly open to people riffing on the general structure of the Naibbe cipher and coming up with their own re-spacing schemes, table approaches, etc. This is intended to be a useful reference model that delivers well-defined, reliable performance.

Yes, I fully agree with your observations oshfdk and magnesium! A nomenclator is wonderfully simple and transparent and it’s easy to see what it does well or poorly: its relationship with the plain text is straightforward. That’s what makes it a valuable thought experiment in my eyes. In some way, the idea fits well with Bowern and Lindemann’s conclusions about the Linguistics of the Voynich Manuscript:

Quote:Our work argues that the character level metrics show Voynich to be unusual, while the word and line level metrics show it to be regular natural language and within the range of a number of plausible languages. The higher structure of the manuscript itself is completely consistent with natural language and is very unlikely to be manufactured. This therefore implies that the script is not structure-preserving in that the graphemes are not one to one, but they do encode words in a regular orthography.

I also like how Rene could replicate quasi-reduplication with his clever mod2 system. Anyway, I don’t feel I understand enough of Voynichese to have strong convictions or theories of my own, so I don’t think I can contribute anything at the moment.

Just out of curiosity, what does it look like when you use the cipher to "solve" part of the MS?

(05-08-2025, 08:31 AM)Koen G Wrote: You are not allowed to view links. Register or Login to view.Just out of curiosity, what does it look like when you use the cipher to "solve" part of the MS?

I have parsed all of Voynich B as if it were exactly a Naibbe ciphertext. To be extremely conservative, I have skipped all tokens that cannot be made using the exact procedure I describe in the paper (though theoretically I could generate some, if not most, currently missing word types by mashing 2+ valid Naibbe word types together).

The quality of the Naibbe cipher's coverage varies widely by folio, from 62% of all folio tokens (f57r) to 93% (f81r). In the following folder, I have an Excel file with every token of Voynich B parsed, as well as a PDF that shows You are not allowed to view links. Register or Login to view. parsed as if it were literally a Naibbe ciphertext.

You are not allowed to view links. Register or Login to view.

A word of warning: I very deliberately did not want to fine-tune the cipher's specific prefix and suffix assignments too much in the pursuit of yielding readable text...as doing so would have been exactly equivalent to a decryption attempt of the VMS where the current version of the Naibbe cipher is taken to be the actual VMS cipher. So unsurprisingly, the inferred Naibbe plaintext is Latinish gibberish.

Really interesting stuff!

This cipher is very "verbose", isn't it? To get a word of 5 letters you need to write down about 15 Voynich letters or so.

It generally agrees with intuition of many people that either the original text is very heavily abbreviated or it is quite short
with nulls in Voynich texts or several symbols coding a single letter.

But frankly I don't believe Voynich Manuscript was created this way. Simply feels like too much work Wink

Also the labels wouldn't make sense, they would be in most cases single letters or at most pairs of letters.

(05-08-2025, 01:19 PM)Rafal Wrote: You are not allowed to view links. Register or Login to view.Really interesting stuff!

This cipher is very "verbose", isn't it? To get a word of 5 letters you need to write down about 15 Voynich letters or so.

Yes, it is! This is the fundamental tradeoff with the Naibbe cipher: extremely reliable replication of a lot of word-level VMS properties, at the expense of verbosity.

Quote:But frankly I don't believe Voynich Manuscript was created this way. Simply feels like too much work
Also the labels wouldn't make sense, they would be in most cases single letters or at most pairs of letters.

The labels are one of the biggest issues with the Naibbe cipher, as I mentioned in the paper and presentation. But the overall properties of VMS text imply that, if there's a plaintext lurking there, the cipher has to be verbose to some degree. When applying René's "NOT the solution" cipher to the Divine Comedy, for example, the median word type is 3 letters long, and the longest word type is 4 letters long, a milder version of the same fundamental problem. If the VMS "labels" were each only a few plaintext letters long at most, then how would that change how we interpret them?

As we're talking about possible cipher constructions that are consistent with the VMS, I want to highlight one of the issues I ran into when developing the Naibbe cipher. I'll call it the type generation efficiency problem. Here's the gist of the problem:

According to Reddy and Knight (2011), the VMS is ~38,000 tokens long and contains ~8100 word types. Because the Naibbe cipher is based on Voynich B, let's use Voynich B as our reference here. Within the VT transliteration of the VMS, Voynich B contains ~23,000 tokens and contains ~4900 words types. These exact values are going to vary transliteration to transliteration, so let's take them as directional with a ±10% error.

The type generation efficiency problem is simply this: The VMS text generation method needs to reliably generate close to these numbers of unique word types at these varying text lengths, while obeying other constraints. In effect, a VMS-mimic cipher needs to generate a broadly VMS-like Heaps' law curve: You are not allowed to view links. Register or Login to view.

As it turns out, this is extremely hard to do consistently. René's "NOT the solution" cipher provides us with an excellent example of why this is a difficult problem. I provide a working Excel version of the cipher here: You are not allowed to view links. Register or Login to view.

For those who aren't familiar, René describes the cipher here, including all the caveats implied by the cipher's name: You are not allowed to view links. Register or Login to view.

There is a lot to like about this cipher: It's easy to do, it's easy to read, it achieves low conditional character entropy, it obeys the VMS word grammar, it passes the eye test, etc. And to be clear, this cipher is, well, not the solution: René is very clearly making simplifying assumptions and is explicit about having done so.

One of this particular cipher's modes of failure is that relative to the VMS, it's just not efficient enough at generating unique word types. It can encrypt 30,000+ letters of Italian as ciphertexts in the neighborhood of 14,000 tokens long—but when encrypting the Divine Comedy, these ciphertexts contain only ~600 unique word types. Across a tract of Voynich B that's roughly 14,000 tokens long, Voynich B contains ~3000 unique word types or more:

[Image: ltJIKbH.png]

The question, then, is: How do you take something like René's cipher and make it more efficient at generating unique word types?

One obvious answer is that the cipher could be expanded to have more substitution options per plaintext alphabet letter. More options per letter would mean there would be more ways of representing a given plaintext n-gram, thereby generating more unique word types. But there's a catch: The more substitution options you add, the more you tend to slice and dice the frequency of a plaintext n-gram. If there are S options per letter in a homophonic substitution cipher, there are S ways to represent a unigram, S^2 ways to represent a bigram, S^3 ways to represent a trigram, and so on. To match the behavior of the VMS, you need a few dozen of these word types to have absolute frequencies of 0.45% or more within the text, so you can't slice and dice too terribly much.

Another way to generate word types would be to have longer plaintext n-grams. More letters in the n-gram means more ways of representing the n-gram within the cipher, thereby generating more unique word types. But the VMS's token and word type length distributions place powerful constraints here, and so does the VMS's anomalously low entropy. For the text's entropy to be consistent with most natural languages, a putative VMS cipher has to be verbose (aka often mapping a string of 2+ Voynichese glyphs to a given plaintext letter). But if the cipher is verbose, then the observed token and word type length distributions place major constraints on how many plaintext letters you can cram into the average ciphertext token. There are only so many unique word types within the VMS that are 8 or more glyphs long, however you define a glyph.

As @Koen has noted, abbreviation is not a reliable solution here, as abbreviation only works because you can abbreviate down to only the most informative parts of a given n-gram, which may not necessarily lend themselves to generating low conditional character entropy.

To make sure you have a steady supply of common words in just the right proportions, you could also alter the plaintext re-spacing method so you make shorter n-grams more often. Shorter n-grams tend to have higher plaintext frequencies than longer n-grams do, and a letter-by-letter substitution scheme slices and dices these n-grams' frequencies much less than those of longer n-grams. But then you run into another problem: the observed frequency-rank distribution of VMS word types.

Within this general kind of cipher, if there's an appreciable probability of there being unigrams in the plaintext, they will automatically become many of the commonest word types in the ciphertext. Assuming no unigrams are created, the same is also true for bigrams. So looking across the commonest word types, many of them will represent the shortest types of n-grams your plaintext contains. Given the way your plaintext is structured, can your substitution scheme reproduce which words are commonest within the VMS?

At bottom, one of the reasons the Naibbe cipher is structured the way it is is because the degree of verbosity and the number of substitution options per letter combine to let the cipher reliably generate word types with broadly VMS-like efficiency while also replicating many aspects of Voynich B's observed frequency-rank distribution of word types. In a ciphertext of 20,000-21,000 tokens, it reliably generates 4500-4700 unique word types or so, on pace with Voynich B. Timm and Schinner's self-citation algorithm also does an excellent job of achieving this efficiency.

If the VMS is in fact a ciphertext, the combination of the cipher and the plaintext content needs to explain why the ciphertext contains as many unique word types as it does. What's more, any reference model cipher for the VMS needs to be able to do something like this.

This is really interesting. Would love to hear Nick Pelling's thoughts on this.

Thanks for the very interesting voynich day presentation magnesium. I was thinking about the limitations of the Naibbe cypher in replicating the observed strong positionality of voynich text (i.e. that certain vord tokens appear disproportionality at the beginning, middle or end of lines). And I was wondering if this could be reflected in Naibbe by keeping a running total of drawn card values across a line and using it to modify the table selection? Perhaps this would add an intolerable amount of complexity to the cipher though.

Personally, I think it's fascinating that the curve is so close to those for moderately inflected natural languages like Italian or German, or the very simple Latin of the Vulgate.

As Lindemann said in his discussion of the subject (which he tackled through MATTR): “an intriguing aspect of the Voynich text is that while the distribution of characters in the text is extremely unusual, in many ways the distribution of words is not.” (Crux of the MATTR)

[attachment=11144]

Pages: 1 2 3 4 5 6 7 8

magnesium

MarcoP

Koen G

magnesium

Rafal

magnesium

magnesium

joben

oaken

MarcoP