(28-06-2025, 09:52 PM)obelus Wrote: You are not allowed to view links. Register or Login to view. (27-06-2025, 10:16 PM)quimqu Wrote: You are not allowed to view links. Register or Login to view.I was actually working with perplexity
How large is your Docta Ignorantia? and how is its length affected by the transformations applied?
Hi Obelus, thank you for your reply. I try to answer you. The lengths of the different books analyzed are as shown here:
File: Ambrosius_Medionalensis_Cleaned.txt
Total words: 117734
Total characters: 665138
Word length counts:
Length 1: 2029
Length 2: 13199
Length 3: 17140
Length 4: 15437
Length 5: 14091
Length 6: 13050
Length 7: 12480
Length 8: 10507
Length 9: 7776
Length 10: 5424
Length 11: 3266
Length 12: 1777
Length 13: 804
Length 14: 405
Length 15: 193
Length 16: 59
Length 17: 35
Length 18: 14
Length 19: 8
Length 20: 11
Length 21: 5
Length 22: 6
Length 23: 3
Length 24: 2
Length 25: 2
Length 26: 5
Length 28: 1
Length 29: 1
Length 30: 2
Length 33: 1
Length 34: 1
File: La_reine_margot_clean.txt
Total words: 115730
Total characters: 486685
Word length counts:
Length 1: 6395
Length 2: 29893
Length 3: 17986
Length 4: 18381
Length 5: 13470
Length 6: 9752
Length 7: 7134
Length 8: 5016
Length 9: 3537
Length 10: 2284
Length 11: 954
Length 12: 431
Length 13: 318
Length 14: 101
Length 15: 46
Length 16: 27
Length 17: 5
File: Romeo_and_Juliet_clean.txt
Total words: 27716
Total characters: 110486
Word length counts:
Length 1: 2003
Length 2: 4804
Length 3: 5680
Length 4: 6542
Length 5: 3456
Length 6: 1999
Length 7: 1355
Length 8: 776
Length 9: 600
Length 10: 324
Length 11: 104
Length 12: 48
Length 13: 13
Length 14: 10
Length 15: 1
Length 16: 1
File: de_docta_ignorantia_punts.txt
Total words: 37256
Total characters: 212657
Word length counts:
Length 1: 174
Length 2: 5104
Length 3: 5500
Length 4: 4923
Length 5: 4965
Length 6: 3359
Length 7: 3491
Length 8: 2722
Length 9: 2293
Length 10: 1657
Length 11: 1207
Length 12: 854
Length 13: 443
Length 14: 344
Length 15: 122
Length 16: 38
Length 17: 23
Length 18: 16
Length 19: 4
Length 20: 13
About the ciphers, the multi-character substitution (previously wrongly named homophonic cipher, please check that I corrected the initial post) makes the word's length the double, as it adds 0, 1 or 3 after each character. Note that this cannot be 100% the Voynich cipher, as there are no words of length 1 left, and there are plenty of 1 words in the MS.
I totally agree with you that above, let's say, 6-7 ngrams, the entropy gives us no information, as there are few words of those lengths and above and obviously the ngram model is overfitting (so tending to 0). I agree with your comment on your data visualization "The dramatic decrease is a pure artifact of limited sample length.". In fact, I don't care much for ngrams greater than 6 in the Voynich analyse. I care specially for the 2, 3 and 4 ngrams entropy. Natural language ngram entropy analyses gives a soft drop as ngrams increase. The Voynich has a bump. It decreases suddenly by ngram=2 and then keeps "flat" until ngram 4-5. That means that the perplexity/entropy of the next character, given 2 characters (2-gram) decreases suddenly, making the next caracter after 2 caracters more predictable. This is not found in natural languages.
By the way: I changed to Entropy as it is the most common metric here in the forum. Please read perplexity or entropy as "the same" (I mean, they are obviously not the same but equivalent in terms of this analysis).
Looking forward to your next comments.
(28-06-2025, 10:20 PM)oshfdk Wrote: You are not allowed to view links. Register or Login to view.As far as I remember, quimqu was experimenting with GPT models.
Yes, I started training nanoGPT and then I run ngram models (they are different models, even if I work with perplexity/entropy for both of them).
Now I am trying to see if I can combine the models or, at least, their results, and get something interesting.