The Voynich Ninja

Full Version: How multi-character substitution might explain the voynich’s strange entropy
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Pages: 1 2 3
(27-06-2025, 01:15 PM)Koen G Wrote: You are not allowed to view links. Register or Login to view.We don't know if Voynichese is a phonetic representation at all. Whatever it is, is very much dependent on its writing system.

All the 'quirks' that we are seeing are aspects of writing (incomplete list):
- The small selection of paragraph-starting characters
- The predominance of single-leg gallows on top lines
- The predominance of Eva-m and Eva-g at line ends
- The more subtle 'rightwards and downwards' trends of Patrick Feaster.

This makes it quite unlikely that the MS text is a direct rendition of a spoken text.
(27-06-2025, 10:16 PM)quimqu Wrote: You are not allowed to view links. Register or Login to view.I was actually working with perplexity

OK, and I was commenting on plots labelled "Entropy," in the interest of not multiplying jargon without necessity.  But the formal relationship between the two is clear enough.  Meanwhile I have had time to illustrate my comment with a graph.

Imagine a tireless Cuva Character Selector that generates symbols, independently, according to probabilities observed in a VMS transliteration.  By construction, the CCS's true entropy is 3.85 bits at all orders, shown as a blue dashed line below:
[attachment=10918]
The blue dots represent calculated conditional entropy estimates if we have only 144 000 characters of output to analyze.  The dramatic decrease is a pure artifact of limited sample length.  Orders 1, 2, and 3 are reliable, n>3 falsely suggest correlations in the Character Selector.  The computational procedure is what it is, but the high-n results should not be interpreted as "entropy" in the received sense.  Maybe renaming it will help!

The unscrambled Cuva sample (continuous text) is shown as grey stars, in satisfactory agreement with your results.  They show that the VMS generator is correlated on a scale of 2-4 characters, before the supposed-entropy curve is degraded by its underlying numerical limitations.  Any features appearing farther to the right are suspect.  We already know that sample length shapes such curves;  to make a strong case that encoding method also has a differential effect on them, some kind of sample-length validations would help.  How large is your Docta Ignorantia?  and how is its length affected by the transformations applied?
(28-06-2025, 09:52 PM)obelus Wrote: You are not allowed to view links. Register or Login to view.The unscrambled Cuva sample (continuous text) is shown as grey stars, in satisfactory agreement with your results.  They show that the VMS generator is correlated on a scale of 2-4 characters, before the supposed-entropy curve is degraded by its underlying numerical limitations.  Any features appearing farther to the right are suspect.  We already know that sample length shapes such curves;  to make a strong case that encoding method also has a differential effect on them, some kind of sample-length validations would help.  How large is your Docta Ignorantia?  and how is its length affected by the transformations applied?

Thank you for the explanation, I was a bit struggling to follow the discussion otherwise.

As far as I remember, quimqu was experimenting with GPT models. I think it still could be useful to try longer n-grams with a small GPT model, if the size of the model won't let it memorize all patterns of length n. If it turns out a model can improve its score (get lower perplexity) for high order ngrams despite not having enough weights, it would mean there are some rules or correlations between characters separated by n positions, and the model is able to learn these rules. Does this sound plausible?
(28-06-2025, 09:52 PM)obelus Wrote: You are not allowed to view links. Register or Login to view.
(27-06-2025, 10:16 PM)quimqu Wrote: You are not allowed to view links. Register or Login to view.I was actually working with perplexity
How large is your Docta Ignorantia?  and how is its length affected by the transformations applied?

Hi Obelus, thank you for your reply. I try to answer you. The lengths of the different books analyzed are as shown here:

File: Ambrosius_Medionalensis_Cleaned.txt
Total words: 117734
Total characters: 665138
Word length counts:
  Length 1: 2029
  Length 2: 13199
  Length 3: 17140
  Length 4: 15437
  Length 5: 14091
  Length 6: 13050
  Length 7: 12480
  Length 8: 10507
  Length 9: 7776
  Length 10: 5424
  Length 11: 3266
  Length 12: 1777
  Length 13: 804
  Length 14: 405
  Length 15: 193
  Length 16: 59
  Length 17: 35
  Length 18: 14
  Length 19: 8
  Length 20: 11
  Length 21: 5
  Length 22: 6
  Length 23: 3
  Length 24: 2
  Length 25: 2
  Length 26: 5
  Length 28: 1
  Length 29: 1
  Length 30: 2
  Length 33: 1
  Length 34: 1


File: La_reine_margot_clean.txt
Total words: 115730
Total characters: 486685
Word length counts:
  Length 1: 6395
  Length 2: 29893
  Length 3: 17986
  Length 4: 18381
  Length 5: 13470
  Length 6: 9752
  Length 7: 7134
  Length 8: 5016
  Length 9: 3537
  Length 10: 2284
  Length 11: 954
  Length 12: 431
  Length 13: 318
  Length 14: 101
  Length 15: 46
  Length 16: 27
  Length 17: 5


File: Romeo_and_Juliet_clean.txt
Total words: 27716
Total characters: 110486
Word length counts:
  Length 1: 2003
  Length 2: 4804
  Length 3: 5680
  Length 4: 6542
  Length 5: 3456
  Length 6: 1999
  Length 7: 1355
  Length 8: 776
  Length 9: 600
  Length 10: 324
  Length 11: 104
  Length 12: 48
  Length 13: 13
  Length 14: 10
  Length 15: 1
  Length 16: 1


File: de_docta_ignorantia_punts.txt
Total words: 37256
Total characters: 212657
Word length counts:
  Length 1: 174
  Length 2: 5104
  Length 3: 5500
  Length 4: 4923
  Length 5: 4965
  Length 6: 3359
  Length 7: 3491
  Length 8: 2722
  Length 9: 2293
  Length 10: 1657
  Length 11: 1207
  Length 12: 854
  Length 13: 443
  Length 14: 344
  Length 15: 122
  Length 16: 38
  Length 17: 23
  Length 18: 16
  Length 19: 4
  Length 20: 13


About the ciphers, the multi-character substitution (previously wrongly named homophonic cipher, please check that I corrected the initial post) makes the word's length the double, as it adds 0, 1 or 3 after each character. Note that this cannot be 100% the Voynich cipher, as there are no words of length 1 left, and there are plenty of 1 words in the MS.

I totally agree with you that above, let's say, 6-7 ngrams, the entropy gives us no information, as there are few words of those lengths and above and obviously the ngram model is overfitting (so tending to 0). I agree with your comment on your data visualization "The dramatic decrease is a pure artifact of limited sample length.". In fact, I don't care much for ngrams greater than 6 in the Voynich analyse. I care specially for the 2, 3 and 4 ngrams entropy. Natural language ngram entropy analyses gives a soft drop as ngrams increase. The Voynich has a bump. It decreases suddenly by ngram=2 and then keeps "flat" until ngram 4-5. That means that the perplexity/entropy of the next character, given 2 characters (2-gram) decreases suddenly, making the next caracter after 2 caracters more predictable. This is not found in natural languages.

By the way: I changed to Entropy as it is the most common metric here in the forum. Please read perplexity or entropy as "the same" (I mean, they are obviously not the same but equivalent in terms of this analysis). 

Looking forward to your next comments.

(28-06-2025, 10:20 PM)oshfdk Wrote: You are not allowed to view links. Register or Login to view.As far as I remember, quimqu was experimenting with GPT models.

Yes, I started training nanoGPT and then I run ngram models (they are different models, even if I work with perplexity/entropy for both of them).

Now I am trying to see if I can combine the models or, at least, their results, and get something interesting.
Pages: 1 2 3