The Voynich Ninja

Pages: 1 2 3 4 5 6 7 8 9 10 11 12

(26-03-2025, 06:34 PM)nablator Wrote: You are not allowed to view links. Register or Login to view.I don't understand why you said that each of the strings in each column can't be statically mapped to the English alphabet.

I think I got confused by 26/26/26 mapping as if it was something set in stone. As I understand now, there are lots of ways to split these words into three chunks, especially if there are additional code points for uppercase letters and punctuation (and spaces?). I'm not sure it makes sense to optimize for 26 specifically. I think this would imply that each English letter is present in each of three positions in the text, and there are no spaces, and the system is case-insensitive.

(26-03-2025, 07:10 PM)nablator Wrote: You are not allowed to view links. Register or Login to view.Frequencies:

1st column:
0.1538 F
0.1325 Ti
0.1197 K
0.1154 m
0.0812 n
0.0684 T

2nd column:
0.1368 a
0.1325
0.1154 m
0.1026 ai
0.0769 iii
0.0641 ae
0.0598 ei
0.0513 oo

3rd column:
0.1325 yun
0.1239 g
0.0769 an
0.0726 z

If all of prefixes, infixes and suffixes map into the same set (plaintext letters, syllables, ngrams), the top elements should have very similar frequency distributions, I'd expect somewhat more similar than in this split. Maybe this can be used as the optimization criterion when looking for the best split?

(26-03-2025, 11:03 PM)oshfdk Wrote: You are not allowed to view links. Register or Login to view.I think I got confused by 26/26/26 mapping as if it was something set in stone. As I understand now, there are lots of ways to split these words into three chunks, especially if there are additional code points for uppercase letters and punctuation (and spaces?). I'm not sure it makes sense to optimize for 26 specifically. I think this would imply that each English letter is present in each of three positions in the text, and there are no spaces, and the system is case-insensitive.

There are lots of ways to split these words into three chunks but few partitions are 1) not ambiguous (needed for reversibility, you get a lot of cases like "ab" + "c" = "a" + "bc" in almost all partitions) and 2) compatible with the size of the alphabet + 1 maybe for space. Punctuation is possible but unlikely in a ciphertext, so it makes sense to optimize for 27 or less, because all letters don't have to be present in all three positions in such a short text.

Getting the size of all three sets close to 27 is extremely hard so I was happy to find a 26-26-26 partition, but it is not completely unambiguous. I'll try to improve it.

I hoped that getting close enough to the correct partition would make it possible to identify a few of the most frequent letters or spaces.

This is the word token length distribution of this cipher:

[attachment=10223]

I think we can safely ignore the only word of length 2, since it's at the very end of the ciphertext and could be a truncated token due to the lack of further plaintext.

To me it looks a bit suspiciously narrow for a sum of three independent token lengths, I would expect a longer more gradual right tail. But I don't have a good working understanding of how multiple independent distributions combine into one, so I decided to run a test.

I ran 10 million iterations combining three sets of tokens of length from 0 to 5 with random distributions of lengths within sets and then the same for two sets.

Result for 3 sets (the known distribution is shown as the counts of the actual word tokens, combined distributions are scaled to the same sum of tokens):

Code:
Best sets, % of token lengths from 0 to 5: [[3, 37, 33, 12, 5, 7], [3, 14, 40, 38, 0, 2], [38, 17, 41, 0, 0, 1]]

Combined distribution: [ 0  1  8 24 43 50 45 28 15  9  4  0  0  0  0  0]

   Known distribution: [ 0  0  0 36 50 57 55 33  2  0  0  0  0  0  0  0]

Error: 42.97

Result for 2 sets:

Code:
Best sets, % of token lengths from 0 to 5: [[0, 2, 4, 42, 44, 5], [23, 19, 25, 26, 1, 2]]

Combined distribution: [ 0  1  3 26 47 51 56 33  7  3  0]

   Known distribution: [ 0  0  0 36 50 57 55 33  2  0  0]

Error: 16.16

Interestingly, both solutions have one set with a large percentage of the empty token (0 length), about 1/4 for the 2 set solution and 38% for the 3 set solution. 2 set solution has much lower mean squared error. Obviously, it's possible to approximate the distribution perfectly with a 1 set solution.

So, for random combinations of chunks from a few sets, 2 set solutions seem to produce closer length distribution than 3 set solutions, but still it was impossible to find a 2 set solution which would mimic the distribution perfectly.

Now, the actual combinations of tokens for the cipher are not random. So the above doesn't necessarily apply to the plausibility of splitting the tokens into three parts. But I would reexamine the rationale of choosing a three-way split.

I think I will also repeat the experiment with no empty tokens.

On the other hand, maybe random sampling is not enough to find a good approximation for the distribution. Because it still should be possible to use 3 sets to perfectly mimic the word length distribution by only allowing 1 character tokens in two of three sets and setting the target distribution shifted by two places for the third set. If random search missed this solution, who knows what else it missed Huh

If spaces are encoded, they should account for 12-17% of all characters (I sampled a few English texts.) In this post I assume there are spaces and they are encoded as dedicated ciphertext codes; either of the assumptions may be not true for the actual cipher.

The following are the top prefixes and suffixes of all words (the percentage is computed using the number of word tokens with the prefix/suffix):

Top most common prefixes:
T: 22.65%
F: 20.09%
K: 16.67%
Ti: 15.38%
m: 12.39%
Ka: 9.83%
f: 9.83%
n: 8.97%
Fi: 6.41%

Top most common suffixes:
n: 28.63%
g: 13.25%
un: 13.25%
yun: 13.25%
uy: 8.55%
y: 8.55%
an: 7.69%

There are several potential prefixes that could encode the space character, but only two suffixes fall in the right frequency bracket: -g and -yun.

If we look at distributions of -g and -yun in the text, it looks like there is a large gap of 38 words without -g suffix and only of 25 words for -yun suffix, which makes interpreting -yun as the space slightly more likely.

If -yun is a space, then the following excerpt seems to indicate that Fiii and Fai are very short words (not unlike "it is" in the other cipher). Another short word then would be Tai, near the beginning of the ciphertext, not in the image below.

[attachment=10236]

Repeated word grams. (word-grams), count

4-grams:
(Tiaeean, bar, nmaei, Teib), 2

3-grams:
(Tiaeean, bar, nmaei), 2
(bar, nmaei, Teib), 2
(Tian, Fooyun, Fiiiyun), 2
(fiiinuy, Timyun, Taeaei), 2

2-grams:
(bar, nmaei), 3
(FaiT, Kzz), 2
(Tiaeean, bar), 2
(nmaei, Teib), 2
(qeK, niinuy), 2
(meaei, Tiavinn), 2
(Tian, Fooyun), 2
(Fooyun, Fiiiyun), 2
(TiiiT, fiiinuy), 2
(fiiinuy, Timyun), 2
(Timyun, Taeaei), 2

Could it be that there are different encryption rules for words in mixed case and others for all lower case, maybe even for camel case (uppercase in middle)?

Like (and not assuming anything here is correct, just as an example):

Ciphertext Plaintext Rules Applied
FaiT WHAT Mixed-case
Tiiiig THREE! Mixed-case
nmaei HELLO All-lowercase
fieeean HOWEVER All-lowercase
Kimb HALT Mixed-case

Cipher [font=Courier New]Plaintext[/font]
letter letter

t O
e T/E/V/W Position-dependent
a H/E Context-dependent
i O/R Context-dependent

Example Encryption Process

Split input into words and for every word, If contains uppercase letters use mixed-case rules else use all-lowercase rules. Some exceptions.
Then Combine results

Example Encrypt "WHAT" -> Mixed-case rules: F (W) + a (H) + i (A) + T (T) → "FaiT"

Encrypt "THE": All-lowercase exception → "qar"

(28-03-2025, 09:59 AM)Scarecrow Wrote: You are not allowed to view links. Register or Login to view.Could it be that there are different encryption rules for words in mixed case and others for all lower case, maybe even for camel case (uppercase in middle)?

This is exactly the problem with short cithertexts, anything is possible. Normally I would try to break an unknown cipher by iterating over various methods of encryption and trying to exclude them one by one as quickly as possible. So, instead of looking for ways to support a particular method, one looks for ways to prove it wrong and move on. With a long ciphertext it's quite easy to identify unrealistic or highly improbable examples in the ciphertext for most methods of encryption, until you stumble upon the right one. With a short ciphertext it's impossible to find counterexamples, so you have to perform a deep search of methods, following each of them to its logical end. This is slow and tedious, and this is what I think most authors of experiments like "let me design a cipher that looks like Voynichese and you try to break it" don't really understand. Design the cipher, ok, but do provide some 50 kilobytes of ciphertext and then we can run some meaningful statistical attacks like those we can run on Voynichese. Inventing a cipher that is virtually unbreakable from just a few lines of ciphertext is not very hard and doesn't prove any point.

byatan refers to an unknown thread, what about this one from 2019
The Voynich Ninja > Voynich Research > Analysis of the text > Voynich text generation
You are not allowed to view links. Register or Login to view.

Pages: 1 2 3 4 5 6 7 8 9 10 11 12

oshfdk

oshfdk

nablator

oshfdk

oshfdk

oshfdk

RobGea

Scarecrow

oshfdk

RobGea