As we're talking about possible cipher constructions that are consistent with the VMS, I want to highlight one of the issues I ran into when developing the Naibbe cipher. I'll call it the
type generation efficiency problem. Here's the gist of the problem:
According to Reddy and Knight (2011), the VMS is ~38,000 tokens long and contains ~8100 word types. Because the Naibbe cipher is based on Voynich B, let's use Voynich B as our reference here. Within the VT transliteration of the VMS, Voynich B contains ~23,000 tokens and contains ~4900 words types. These exact values are going to vary transliteration to transliteration, so let's take them as directional with a ±10% error.
The type generation efficiency problem is simply this: The VMS text generation method needs to reliably generate close to these numbers of unique word types at these varying text lengths, while obeying other constraints. In effect, a VMS-mimic cipher needs to generate a broadly VMS-like Heaps' law curve: You are not allowed to view links.
Register or
Login to view.
As it turns out, this is extremely hard to do consistently. René's "NOT the solution" cipher provides us with an excellent example of why this is a difficult problem. I provide a working Excel version of the cipher here: You are not allowed to view links.
Register or
Login to view.
For those who aren't familiar, René describes the cipher here, including all the caveats implied by the cipher's name: You are not allowed to view links.
Register or
Login to view.
There is a lot to like about this cipher: It's easy to do, it's easy to read, it achieves low conditional character entropy, it obeys the VMS word grammar, it passes the eye test, etc. And to be clear, this cipher is, well, not the solution: René is very clearly making simplifying assumptions and is explicit about having done so.
One of this particular cipher's modes of failure is that relative to the VMS, it's just not efficient enough at generating unique word types. It can encrypt 30,000+ letters of Italian as ciphertexts in the neighborhood of 14,000 tokens long—but when encrypting the Divine Comedy, these ciphertexts contain only ~600 unique word types. Across a tract of Voynich B that's roughly 14,000 tokens long, Voynich B contains ~3000 unique word types or more:
The question, then, is: How do you take something like René's cipher and make it more efficient at generating unique word types?
One obvious answer is that the cipher could be expanded to have more substitution options per plaintext alphabet letter. More options per letter would mean there would be more ways of representing a given plaintext n-gram, thereby generating more unique word types. But there's a catch: The more substitution options you add, the more you tend to slice and dice the frequency of a plaintext n-gram. If there are S options per letter in a homophonic substitution cipher, there are S ways to represent a unigram, S^2 ways to represent a bigram, S^3 ways to represent a trigram, and so on. To match the behavior of the VMS, you need a few dozen of these word types to have absolute frequencies of 0.45% or more within the text, so you can't slice and dice too terribly much.
Another way to generate word types would be to have longer plaintext n-grams. More letters in the n-gram means more ways of representing the n-gram within the cipher, thereby generating more unique word types. But the VMS's token and word type length distributions place powerful constraints here, and so does the VMS's anomalously low entropy. For the text's entropy to be consistent with most natural languages, a putative VMS cipher has to be verbose (aka often mapping a string of 2+ Voynichese glyphs to a given plaintext letter). But if the cipher is verbose, then the observed token and word type length distributions place major constraints on how many plaintext letters you can cram into the average ciphertext token. There are only so many unique word types within the VMS that are 8 or more glyphs long, however you define a glyph.
As @Koen has noted, abbreviation is not a reliable solution here, as abbreviation only works because you can abbreviate down to only the most informative parts of a given n-gram, which may not necessarily lend themselves to generating low conditional character entropy.
To make sure you have a steady supply of common words in just the right proportions, you could also alter the plaintext re-spacing method so you make shorter n-grams more often. Shorter n-grams tend to have higher plaintext frequencies than longer n-grams do, and a letter-by-letter substitution scheme slices and dices these n-grams' frequencies much less than those of longer n-grams. But then you run into another problem: the observed frequency-rank distribution of VMS word types.
Within this general kind of cipher, if there's an appreciable probability of there being unigrams in the plaintext, they will automatically become many of the commonest word types in the ciphertext. Assuming no unigrams are created, the same is also true for bigrams. So looking across the commonest word types, many of them will represent the shortest types of n-grams your plaintext contains. Given the way your plaintext is structured, can your substitution scheme reproduce which words are commonest within the VMS?
At bottom, one of the reasons the Naibbe cipher is structured the way it is is because the degree of verbosity and the number of substitution options per letter combine to let the cipher reliably generate word types with broadly VMS-like efficiency while also replicating many aspects of Voynich B's observed frequency-rank distribution of word types. In a ciphertext of 20,000-21,000 tokens, it reliably generates 4500-4700 unique word types or so, on pace with Voynich B. Timm and Schinner's self-citation algorithm also does an excellent job of achieving this efficiency.
If the VMS is in fact a ciphertext, the combination of the cipher and the plaintext content needs to explain why the ciphertext contains as many unique word types as it does. What's more, any reference model cipher for the VMS needs to be able to do something like this.