Question about unicity distance

Question about unicity distance - Printable Version

+- The Voynich Ninja (https://www.voynich.ninja)
+-- Forum: Voynich Research (https://www.voynich.ninja/forum-27.html)
+--- Forum: Analysis of the text (https://www.voynich.ninja/forum-41.html)
+--- Thread: Question about unicity distance (/thread-4715.html)

Pages: 1 2

Question about unicity distance - kckluge - 21-05-2025

I could swear there was a reference to unicity distance in a recent comment, and I was going to ask this there, but for the life of me I can't find it -- so here we go...

Unicity distance "is the length of an original ciphertext needed to break the cipher by reducing the number of possible spurious keys to zero in a brute force attack" (You are not allowed to view links. Register or Login to view.). It's a potential way of trying to fomalize the "too many degrees of freedom" critique of proposed Voynich solutions as well as other possible prunings (i.e., if you can't get a coherent stretch of plaintext from a 28+ character long stretch of the ciphertext you haven't proposed a credible solution of the text as a simple monoalphabetic cipher).

What I can't quite figure out is how to correctly compute the unicity distance for something like Brumbaugh's proposed cipher. To recap for those unfamiliar with his "solution", encipherment proceeded in two stages:

1) Convert from plaintext letters to digits using the following grid:

a, j, v = 1
b, k, r = 2
c, l, w = 3
d, m, s = 4
e, n, x = 5
f, o, t = 6
g, p, y = 7
h, q, u = 8
i, -us, z = 9

2) replace each digit with one of several glyphs corresponding to that digit

So, for instance, his hypothetical enciperhing of one of the labels near the upper right corner of You are not allowed to view links. Register or Login to view. is

P E P P E R --> 7 5 7 7 5 2 --> (EVA) s a r ch a r

so each of the three instances of the digit 7 gets replaced by a different Voynich script glyph ('s', 'r', and 'ch'), while both 5's get replaced by EVA 'a' (for the purposes of this discussion ignore that EVA 'r' also maps to the digit 2 here -- that has to do with how Brunbaugh sees variant forms that get grouped together as 'r' in EVA...)

Is there someone out there who knows how to compute the unicity distance for a cipher like this, and if so could you walk me through it?

Thanks,

Karl

RE: Question about unicity distance - dashstofsk - 22-05-2025

This is a one-way encryption. Almost impossible to get back to the original. But why would the authors of the VMS want to do this? They themselves would not be able to decypher their own writing. And to do this for ~36000 words, and for each word to have to stop to think about which conversion to use?

It is my belief that people are thinking too hard about the manuscript. The authors probably wanted to use a more primitive method. The solution is probably more simple.

RE: Question about unicity distance - kckluge - 22-05-2025

(22-05-2025, 07:20 AM)dashstofsk Wrote: You are not allowed to view links. Register or Login to view.This is a one-way encryption. Almost impossible to get back to the original. But why would the authors of the VMS want to do this?

The question wasn't whether Brumbaugh was wrong -- it's clear he was -- it was a specific technical question about how to compute unicity distance for such a cipher. Even with multiple options for individual words, there almost certainly has to be some text length such that only one set of possible options for the words forms a coherent text.

RE: Question about unicity distance - oshfdk - 22-05-2025

I think this particular scheme is practically reversible, with occasional ambiguities, but only if the second stage is reversible (the digit is unambiguously identified by the glyph). Maybe with some practice it's possible to learn to read it slowly without external tools.

For example (suppose we know that the plaintext is in English and the spacing is preserved):

4631524 3933 46315 3856852 768 3925 96 62 566

Let's start with the short words: 96 and 62.
96 is (I | US | Z) + (F | O | T), given English it's either IT or IF.
62 is (F | O | T) + (B | K | R), there should be a vowel, so it starts with O and the word is OR. (Edit: or could be OK, which would work for IF OK, so it's more ambiguous than I thought, but I still think it's possible to deduce the correct plaintext.)

Now, IF OR looks odd, so it's more likely IT OR.

Then 566 and 768.

566 is (E | N | X) + (F | O | T) + (F | O | T), even without the context given by IT OR ..., I see only one word here: NOT (which means all standalone 566's in the text will be not's)

768 is (G | P | Y) + (F | O | T) + (H | Q | U), this one is very easy, given the first and the last letters are relatively rare. I can only see YOU here and no other alternatives.

So, we have:

YOU 3925 IT OR NOT. It's possible to just guess and check, but let's try and do it the slow way:

3925 is (C | L | W) + (I | US | Z) + (B | K | R) + (E | N | X), let's try 9 as I as the most frequent, especially given the first letter is a consonant (also there is a possibility of CUS/LUS too): YOU ?I?? IT OR NOT. The ending is I + (B | K | R) + (E | N | X), which given normal letter combinations in English is most likely I + (B | K | R) + E, so YOU ?I?E IT OR NOT. Not that many words would work here (C | L | W) + I + (B | K | R) + E, like WIRE, but this is most likely LIKE.

4631524 3933 46315 3856852 YOU LIKE IT OR NOT

Now we can just check if WHETHER fits 3856852 and then there are only 3 words left to crack.

3933 is (C | L | W) + (I | US | Z) + (C | L | W) + (C | L | W), etc.

I think after a while one will get used to common combinations of symbols and the decoding will get much easier.

RE: Question about unicity distance - oshfdk - 22-05-2025

(22-05-2025, 08:17 AM)kckluge Wrote: You are not allowed to view links. Register or Login to view.The question wasn't whether Brumbaugh was wrong -- it's clear he was -- it was a specific technical question about how to compute unicity distance for such a cipher. Even with multiple options for individual words, there almost certainly has to be some text length such that only one set of possible options for the words forms a coherent text.

(Edit: the below is wrong, I'll add a correction in a later post.)

Given there is no answer so far, I'd say that one can just naively apply the Shannon formula, as shown in the Wikipedia article. That's what I would usually do when I have no idea what I'm doing. So,

U = Hk/D

Here Hk is the key entropy. Given no specific key constraints, this is the binary logarithm of the number of ways to group Latin alphabet (+ one extra sequence like "us" to make it split evenly) into sets of three. D is the information density in bits per character. English, as per the same article, has about 1.5 bits per character, but we reduce the source set by the factor of 3 by merging letters into groups. Without much thinking (I'm not sure I can get to the exact solution here is reasonable time Smile

), it's likely this should increase the unicity distance, so let's just divide by 3 and say we have 0.5 bits per character.

Hk for splitting 27 items into 9 labelled groups of 3 items is (I delegated this to Claude, so won't vouch for this): lb( (27! / 3!) ** 9 ), which is roughly 70.

U = 70 / 0.5 = 140

So, you need about 140 characters of the ciphertext to identify the key for an English plaintext using this encoding scheme. I'd say this looks reasonable.

RE: Question about unicity distance - Rafal - 22-05-2025

Quote:So, you need about 140 characters of the ciphertext to identify the key for an English plaintext using this encoding scheme.

I'm not familiar with that measure but this link You are not allowed to view links. Register or Login to view. says:

Note that just because you have 30 characters of ciphertext for a substitution cipher, this does not mean you will be able to break it. The Unicity distance is a theoretical minimum. In general you may need 4 times more characters (100 or more) to reliably break substitution ciphers. The same problem exists with the other ciphers.

I also have a feeling that this value depends on the type of the cipher used. As we don't know the type of cipher used in VM, then we don't know how to count it. We don't even know the language which is probably a problem as well.

But Voynich Manuscript is really long. Common sense tells us that with this amount of text it should be easily broken. Yet it isn't.

RE: Question about unicity distance - oshfdk - 22-05-2025

(22-05-2025, 10:47 AM)Rafal Wrote: You are not allowed to view links. Register or Login to view.
Quote:So, you need about 140 characters of the ciphertext to identify the key for an English plaintext using this encoding scheme.
I'm not familiar with that measure but this link You are not allowed to view links. Register or Login to view. says:

Note that just because you have 30 characters of ciphertext for a substitution cipher, this does not mean you will be able to break it. The Unicity distance is a theoretical minimum. In general you may need 4 times more characters (100 or more) to reliably break substitution ciphers. The same problem exists with the other ciphers.

From my understanding this is not exactly correct. Maybe the author talks about statistical methods of breaking a cipher, but then the unicity distance seems to be of little use. Unicity distance looks like a solid theoretical average metric for the brute force exhaustive attack. With a known type of cipher (e.g., simple alphabet substitution) one should be able to reliably identify the key using just a few bits more than the unicity distance amount of ciphertext (albeit it might take prohibitively long time in practice). On the other hand, if one doesn't know the exact type of the cipher or if the attack is not an exhaustive brute force, which is the case in most real life situations, then the unicity distance simply doesn't apply, as far as I understand.

(22-05-2025, 10:47 AM)Rafal Wrote: You are not allowed to view links. Register or Login to view.I also have a feeling that this value depends on the type of the cipher used. As we don't know the type of cipher used in VM, then we don't know how to count it. We don't even know the language which is probably a problem as well.

But Voynich Manuscript is really long. Common sense tells us that with this amount of text it should be easily broken. Yet it isn't.

Yes, I agree. I was just answering a specific question about possible ways to compute the unicity distance. The Voynich Manuscript is definitely long enough for all kinds of attacks, but its script is definitely weird and ambiguous enough to prevent computing even simple metrics unambiguously.

RE: Question about unicity distance - davidd - 22-05-2025

Missing or reordered pages are a big problem for this cipher, unless they restart on every page.
Restarting every page would make repetition a lot more.

RE: Question about unicity distance - oshfdk - 22-05-2025

After some thinking I found the unicity distance formula weird, so I reread the article and sure enough I misunderstood what D is, it's not the information density it's the difference between the actual information density and the maximum information density given the source character set size.

Which probably means that for D we need to compute the difference between lb(9) ≈ 3.17 and the actual per character information of an English text after mapping 27 source tokens to 9 ciphertext tokens. But here we have a problem: I think this latter number depends on the actual mapping used. Intuitively this is easy to understand, when grouping letters into sets of 3, one can group each of popular letters with a pair of significantly more rare ones and make a code that would be almost readable by just replacing each number with the most common letter associated with this number, so the information density will be closer to plain English. Alternatively, one can group all frequent letters and all rare letters and get a code with so much information loss that it would be very hard or even impossible to break.

So, I'm not sure how to approach this. If we are talking about the unicity distance for an arbitrary grouping of source characters into triplets, then maybe some averaging should be done for D. And in a bad enough case this distance can be infinite (no information redundancy after mapping letters to digits).

Again, as I said in the very beginning, I have no idea what I'm doing and it could even be the first time I encounter this "unicity distance" metric at all.

RE: Question about unicity distance - oshfdk - 23-05-2025

Looks like I had to sleep on it, I think now I know how to apply the formula to this cipher. Note that I'm not a cryptographer, treat this with a generous portion of salt.

U = H(k)/D

Assuming all keys are equally likely, H(k) is just the logarithm of the total number of possible keys. For Brumbaugh's cipher this is fixed and easy to compute combinatorially as in You are not allowed to view links. Register or Login to view.. H(k) reflects how many keys we need to go through in a brute force attack.

And D is the logarithm of the total number of all possible strings in the given alphabet divided by the number of all meaningful strings (those that we would find plausible as the solution to the cipher) per each new character of the plaintext. For example, for English appending a new character creates 26 possible new random strings, at the same time given 1.5 bits per character in English (as per the Wikipedia article), adding a new character to a string would on average produce 2 ** 1.5 or 2.83 new meaningful strings. So, D for a simple substitution in English is log2(26 / 2.83), or log2(26) - log2(1.5) or ~3.2.

Overall, D reflects how rare meaningful strings are. For example, if all possible combinations of Latin letters (say, AAA, AAB, AAC, ... ZZZZZZY, ZZZZZZZ...) were meaningful to us, we won't be able to decode the cipher, because any key would produce a different meaningful solution, and the proportion of all strings to meaningful strings would be 1, its logarithm would be 0 and U would be H(k) / 0 or infinite.

Instead of U as the length, we can think about base charset size to the power of U as the total number of all possible cipher texts of length U characters and then the equation in preudomath would just become:

log(total number of possible ciphertext of length U) = log(total number of possible keys) / log(number of new random plaintext strings when adding one character / mean number of new meaningful plaintext strings when adding one character)

or:

how long is the ciphertext we'll need = how many keys we need to go through vs how rare meaningful plaintexts are among random plaintexts

which looks understandable to me. The more keys we have to go through and the larger the number of possible meaningful plaintexts is, the longer should the ciphertext be to leave us with just one single working combination of a key and meaningful plaintext.

This would work as is for a simple substitution cipher, but the Brumbaugh's cipher is lossy, a few source characters are mapped to the same digit. The effect of this is that for the same number of meaningful and random plaintexts we have a lower number of the digit sequences, and what is important, the same digit sequence will correspond to multiple plaintexts and possibly multiple meaninful plaintexts.

If the goal is to identify the correct key AND the correct unique plaintext, then the outcome we are looking for is that, after bruteforcing all possible keys, not only there is only one digit sequence corresponding to a meaningful English plaintext, but this sequence also corresponds to one meaningful English plaintext only.

So, when computing D we can take the logarithm of (all of them per character): the total number of possible digit sequences (A) divided by the total number of digit sequences corresponding to meaningful plaintexts (B) and multiplied by the total number of meaninful plaintexts divided by the total number of corresponding unique digit sequences /C/.

(A) is simply 9, as there are 9 distinct digits. Assuming uniform distribution of plaintexts, /C/ is 3, as there are 3 times as many characters (27) as there are digits (9). (B) is 2.83 * 3, since by adding a new single digit we effectively add 3 possible characters (here is a caveat, see below).

So, if D = log2(A / B * C), then D = log2(9 / (2.83 * 3) * 3) = log2(9 / 2.83) ~ 0.5

If this is the right way of looking at it, then the result is the same as I computed based on a wrong assumption in my previous post:

U = 70 / 0.5 = 140

*the caveat: I think B depends on the specific mapping of plaintext into digits. In principle some variants of the key for Brumbaugh's cipher would be nearly impossible to use making the cipher too lossy to recover the plaintext no matter its length and hence no amount of ciphertext would lead to a unique solution.