![]() |
|
How multi-character substitution might explain the voynich’s strange entropy - Printable Version +- The Voynich Ninja (https://www.voynich.ninja) +-- Forum: Voynich Research (https://www.voynich.ninja/forum-27.html) +--- Forum: Analysis of the text (https://www.voynich.ninja/forum-41.html) +--- Thread: How multi-character substitution might explain the voynich’s strange entropy (/thread-4769.html) |
RE: How multi-character substitution might explain the voynich’s strange entropy - ReneZ - 28-06-2025 (27-06-2025, 01:15 PM)Koen G Wrote: You are not allowed to view links. Register or Login to view.We don't know if Voynichese is a phonetic representation at all. Whatever it is, is very much dependent on its writing system. All the 'quirks' that we are seeing are aspects of writing (incomplete list): - The small selection of paragraph-starting characters - The predominance of single-leg gallows on top lines - The predominance of Eva-m and Eva-g at line ends - The more subtle 'rightwards and downwards' trends of Patrick Feaster. This makes it quite unlikely that the MS text is a direct rendition of a spoken text. RE: How multi-character substitution might explain the voynich’s strange entropy - obelus - 28-06-2025 (27-06-2025, 10:16 PM)quimqu Wrote: You are not allowed to view links. Register or Login to view.I was actually working with perplexity OK, and I was commenting on plots labelled "Entropy," in the interest of not multiplying jargon without necessity. But the formal relationship between the two is clear enough. Meanwhile I have had time to illustrate my comment with a graph. Imagine a tireless Cuva Character Selector that generates symbols, independently, according to probabilities observed in a VMS transliteration. By construction, the CCS's true entropy is 3.85 bits at all orders, shown as a blue dashed line below: The blue dots represent calculated conditional entropy estimates if we have only 144 000 characters of output to analyze. The dramatic decrease is a pure artifact of limited sample length. Orders 1, 2, and 3 are reliable, n>3 falsely suggest correlations in the Character Selector. The computational procedure is what it is, but the high-n results should not be interpreted as "entropy" in the received sense. Maybe renaming it will help! The unscrambled Cuva sample (continuous text) is shown as grey stars, in satisfactory agreement with your results. They show that the VMS generator is correlated on a scale of 2-4 characters, before the supposed-entropy curve is degraded by its underlying numerical limitations. Any features appearing farther to the right are suspect. We already know that sample length shapes such curves; to make a strong case that encoding method also has a differential effect on them, some kind of sample-length validations would help. How large is your Docta Ignorantia? and how is its length affected by the transformations applied? RE: How multi-character substitution might explain the voynich’s strange entropy - oshfdk - 28-06-2025 (28-06-2025, 09:52 PM)obelus Wrote: You are not allowed to view links. Register or Login to view.The unscrambled Cuva sample (continuous text) is shown as grey stars, in satisfactory agreement with your results. They show that the VMS generator is correlated on a scale of 2-4 characters, before the supposed-entropy curve is degraded by its underlying numerical limitations. Any features appearing farther to the right are suspect. We already know that sample length shapes such curves; to make a strong case that encoding method also has a differential effect on them, some kind of sample-length validations would help. How large is your Docta Ignorantia? and how is its length affected by the transformations applied? Thank you for the explanation, I was a bit struggling to follow the discussion otherwise. As far as I remember, quimqu was experimenting with GPT models. I think it still could be useful to try longer n-grams with a small GPT model, if the size of the model won't let it memorize all patterns of length n. If it turns out a model can improve its score (get lower perplexity) for high order ngrams despite not having enough weights, it would mean there are some rules or correlations between characters separated by n positions, and the model is able to learn these rules. Does this sound plausible? RE: How multi-character substitution might explain the voynich’s strange entropy - quimqu - 29-06-2025 (28-06-2025, 09:52 PM)obelus Wrote: You are not allowed to view links. Register or Login to view.(27-06-2025, 10:16 PM)quimqu Wrote: You are not allowed to view links. Register or Login to view.I was actually working with perplexityHow large is your Docta Ignorantia? and how is its length affected by the transformations applied? Hi Obelus, thank you for your reply. I try to answer you. The lengths of the different books analyzed are as shown here: File: Ambrosius_Medionalensis_Cleaned.txt Total words: 117734 Total characters: 665138 Word length counts: Length 1: 2029 Length 2: 13199 Length 3: 17140 Length 4: 15437 Length 5: 14091 Length 6: 13050 Length 7: 12480 Length 8: 10507 Length 9: 7776 Length 10: 5424 Length 11: 3266 Length 12: 1777 Length 13: 804 Length 14: 405 Length 15: 193 Length 16: 59 Length 17: 35 Length 18: 14 Length 19: 8 Length 20: 11 Length 21: 5 Length 22: 6 Length 23: 3 Length 24: 2 Length 25: 2 Length 26: 5 Length 28: 1 Length 29: 1 Length 30: 2 Length 33: 1 Length 34: 1 File: La_reine_margot_clean.txt Total words: 115730 Total characters: 486685 Word length counts: Length 1: 6395 Length 2: 29893 Length 3: 17986 Length 4: 18381 Length 5: 13470 Length 6: 9752 Length 7: 7134 Length 8: 5016 Length 9: 3537 Length 10: 2284 Length 11: 954 Length 12: 431 Length 13: 318 Length 14: 101 Length 15: 46 Length 16: 27 Length 17: 5 File: Romeo_and_Juliet_clean.txt Total words: 27716 Total characters: 110486 Word length counts: Length 1: 2003 Length 2: 4804 Length 3: 5680 Length 4: 6542 Length 5: 3456 Length 6: 1999 Length 7: 1355 Length 8: 776 Length 9: 600 Length 10: 324 Length 11: 104 Length 12: 48 Length 13: 13 Length 14: 10 Length 15: 1 Length 16: 1 File: de_docta_ignorantia_punts.txt Total words: 37256 Total characters: 212657 Word length counts: Length 1: 174 Length 2: 5104 Length 3: 5500 Length 4: 4923 Length 5: 4965 Length 6: 3359 Length 7: 3491 Length 8: 2722 Length 9: 2293 Length 10: 1657 Length 11: 1207 Length 12: 854 Length 13: 443 Length 14: 344 Length 15: 122 Length 16: 38 Length 17: 23 Length 18: 16 Length 19: 4 Length 20: 13 About the ciphers, the multi-character substitution (previously wrongly named homophonic cipher, please check that I corrected the initial post) makes the word's length the double, as it adds 0, 1 or 3 after each character. Note that this cannot be 100% the Voynich cipher, as there are no words of length 1 left, and there are plenty of 1 words in the MS. I totally agree with you that above, let's say, 6-7 ngrams, the entropy gives us no information, as there are few words of those lengths and above and obviously the ngram model is overfitting (so tending to 0). I agree with your comment on your data visualization "The dramatic decrease is a pure artifact of limited sample length.". In fact, I don't care much for ngrams greater than 6 in the Voynich analyse. I care specially for the 2, 3 and 4 ngrams entropy. Natural language ngram entropy analyses gives a soft drop as ngrams increase. The Voynich has a bump. It decreases suddenly by ngram=2 and then keeps "flat" until ngram 4-5. That means that the perplexity/entropy of the next character, given 2 characters (2-gram) decreases suddenly, making the next caracter after 2 caracters more predictable. This is not found in natural languages. By the way: I changed to Entropy as it is the most common metric here in the forum. Please read perplexity or entropy as "the same" (I mean, they are obviously not the same but equivalent in terms of this analysis). Looking forward to your next comments. (28-06-2025, 10:20 PM)oshfdk Wrote: You are not allowed to view links. Register or Login to view.As far as I remember, quimqu was experimenting with GPT models. Yes, I started training nanoGPT and then I run ngram models (they are different models, even if I work with perplexity/entropy for both of them). Now I am trying to see if I can combine the models or, at least, their results, and get something interesting. |