(27-06-2025, 01:15 PM)Koen G Wrote: You are not allowed to view links. Register or Login to view.We don't know if Voynichese is a phonetic representation at all. Whatever it is, is very much dependent on its writing system.
All the 'quirks' that we are seeing are aspects of writing (incomplete list):
- The small selection of paragraph-starting characters
- The predominance of single-leg gallows on top lines
- The predominance of Eva-m and Eva-g at line ends
- The more subtle 'rightwards and downwards' trends of Patrick Feaster.
This makes it quite unlikely that the MS text is a direct rendition of a spoken text.
(27-06-2025, 10:16 PM)quimqu Wrote: You are not allowed to view links. Register or Login to view.I was actually working with perplexity
OK, and I was commenting on plots labelled "Entropy," in the interest of not multiplying jargon without necessity. But the formal relationship between the two is clear enough. Meanwhile I have had time to illustrate my comment with a graph.
Imagine a tireless Cuva Character Selector that generates symbols, independently, according to probabilities observed in a VMS transliteration. By construction, the CCS's true entropy is 3.85 bits at all orders, shown as a blue dashed line below:
[
attachment=10918]
The blue dots represent calculated conditional entropy estimates if we have only 144 000 characters of output to analyze. The dramatic decrease is a pure artifact of limited sample length. Orders 1, 2, and 3 are reliable,
n>3 falsely suggest correlations in the Character Selector. The computational procedure is what it is, but the high-
n results should not be interpreted as "entropy" in the received sense. Maybe renaming it will help!
The unscrambled Cuva sample (continuous text) is shown as grey stars, in satisfactory agreement with your results. They show that the VMS generator is correlated on a scale of 2-4 characters, before the supposed-entropy curve is degraded by its underlying numerical limitations. Any features appearing farther to the right are suspect. We already know that sample length shapes such curves; to make a strong case that encoding method also has a differential effect on them, some kind of sample-length validations would help. How large is your
Docta Ignorantia? and how is its length affected by the transformations applied?
(28-06-2025, 09:52 PM)obelus Wrote: You are not allowed to view links. Register or Login to view.The unscrambled Cuva sample (continuous text) is shown as grey stars, in satisfactory agreement with your results. They show that the VMS generator is correlated on a scale of 2-4 characters, before the supposed-entropy curve is degraded by its underlying numerical limitations. Any features appearing farther to the right are suspect. We already know that sample length shapes such curves; to make a strong case that encoding method also has a differential effect on them, some kind of sample-length validations would help. How large is your Docta Ignorantia? and how is its length affected by the transformations applied?
Thank you for the explanation, I was a bit struggling to follow the discussion otherwise.
As far as I remember, quimqu was experimenting with GPT models. I think it still could be useful to try longer n-grams with a small GPT model, if the size of the model won't let it memorize all patterns of length n. If it turns out a model can improve its score (get lower perplexity) for high order ngrams despite not having enough weights, it would mean there are some rules or correlations between characters separated by n positions, and the model is able to learn these rules. Does this sound plausible?
(28-06-2025, 09:52 PM)obelus Wrote: You are not allowed to view links. Register or Login to view. (27-06-2025, 10:16 PM)quimqu Wrote: You are not allowed to view links. Register or Login to view.I was actually working with perplexity
How large is your Docta Ignorantia? and how is its length affected by the transformations applied?
Hi Obelus, thank you for your reply. I try to answer you. The lengths of the different books analyzed are as shown here:
File: Ambrosius_Medionalensis_Cleaned.txt
Total words: 117734
Total characters: 665138
Word length counts:
Length 1: 2029
Length 2: 13199
Length 3: 17140
Length 4: 15437
Length 5: 14091
Length 6: 13050
Length 7: 12480
Length 8: 10507
Length 9: 7776
Length 10: 5424
Length 11: 3266
Length 12: 1777
Length 13: 804
Length 14: 405
Length 15: 193
Length 16: 59
Length 17: 35
Length 18: 14
Length 19: 8
Length 20: 11
Length 21: 5
Length 22: 6
Length 23: 3
Length 24: 2
Length 25: 2
Length 26: 5
Length 28: 1
Length 29: 1
Length 30: 2
Length 33: 1
Length 34: 1
File: La_reine_margot_clean.txt
Total words: 115730
Total characters: 486685
Word length counts:
Length 1: 6395
Length 2: 29893
Length 3: 17986
Length 4: 18381
Length 5: 13470
Length 6: 9752
Length 7: 7134
Length 8: 5016
Length 9: 3537
Length 10: 2284
Length 11: 954
Length 12: 431
Length 13: 318
Length 14: 101
Length 15: 46
Length 16: 27
Length 17: 5
File: Romeo_and_Juliet_clean.txt
Total words: 27716
Total characters: 110486
Word length counts:
Length 1: 2003
Length 2: 4804
Length 3: 5680
Length 4: 6542
Length 5: 3456
Length 6: 1999
Length 7: 1355
Length 8: 776
Length 9: 600
Length 10: 324
Length 11: 104
Length 12: 48
Length 13: 13
Length 14: 10
Length 15: 1
Length 16: 1
File: de_docta_ignorantia_punts.txt
Total words: 37256
Total characters: 212657
Word length counts:
Length 1: 174
Length 2: 5104
Length 3: 5500
Length 4: 4923
Length 5: 4965
Length 6: 3359
Length 7: 3491
Length 8: 2722
Length 9: 2293
Length 10: 1657
Length 11: 1207
Length 12: 854
Length 13: 443
Length 14: 344
Length 15: 122
Length 16: 38
Length 17: 23
Length 18: 16
Length 19: 4
Length 20: 13
About the ciphers, the multi-character substitution (previously wrongly named homophonic cipher, please check that I corrected the initial post) makes the word's length the double, as it adds 0, 1 or 3 after each character. Note that this cannot be 100% the Voynich cipher, as there are no words of length 1 left, and there are plenty of 1 words in the MS.
I totally agree with you that above, let's say, 6-7 ngrams, the entropy gives us no information, as there are few words of those lengths and above and obviously the ngram model is overfitting (so tending to 0). I agree with your comment on your data visualization "The dramatic decrease is a pure artifact of limited sample length.". In fact, I don't care much for ngrams greater than 6 in the Voynich analyse. I care specially for the 2, 3 and 4 ngrams entropy. Natural language ngram entropy analyses gives a soft drop as ngrams increase. The Voynich has a bump. It decreases suddenly by ngram=2 and then keeps "flat" until ngram 4-5. That means that the perplexity/entropy of the next character, given 2 characters (2-gram) decreases suddenly, making the next caracter after 2 caracters more predictable. This is not found in natural languages.
By the way: I changed to Entropy as it is the most common metric here in the forum. Please read perplexity or entropy as "the same" (I mean, they are obviously not the same but equivalent in terms of this analysis).
Looking forward to your next comments.
(28-06-2025, 10:20 PM)oshfdk Wrote: You are not allowed to view links. Register or Login to view.As far as I remember, quimqu was experimenting with GPT models.
Yes, I started training nanoGPT and then I run ngram models (they are different models, even if I work with perplexity/entropy for both of them).
Now I am trying to see if I can combine the models or, at least, their results, and get something interesting.