![]() |
|
Need advice for testing of hypotheses related to the self-citation method - Printable Version +- The Voynich Ninja (https://www.voynich.ninja) +-- Forum: Voynich Research (https://www.voynich.ninja/forum-27.html) +--- Forum: Analysis of the text (https://www.voynich.ninja/forum-41.html) +--- Thread: Need advice for testing of hypotheses related to the self-citation method (/thread-4765.html) |
RE: Need advice for testing of hypotheses related to the self-citation method - Jorge_Stolfi - 07-07-2025 PS. I think I can express a bit more clearly now why I am not impressed by T&T's claims. In their argument T&T implicitly or explicitly assume that Prob(A|not H) is practically zero; that is, they assume that a manuscript that is not a hoax cannot have the "context-dependent repetitions" that they observed -- because they did not observe them in a few other non-hoax books that they analyzed. Conversely they claim that Prob(A|H) is much higher, because the hypothetical forger may well have generated the VMS using a method, like the SCM, that accidentally created such repetitions. And indeed, if Prob(A|H) is much greater than prob(A|not H), then Bayes's formula gives Prob(H|A) ≈ 1 --- no matter what the prior P(H) is. However, my Prob(A|not H) is actually quite high. If the nature of the text is what the illustrations suggest (herbal, pharmacopoeia, list of diseases, etc.), then I do expect that it will have a lot more "context-dependent repetitions" than a novel or chronicle. And conversely my Prob(A|H) is rather low, because I cannot see how or why the forger would have used a generation method that produced a text with the observed "natural" properties of the VMS (Zipf's law, vocabulary size, word entropy, etc.) but with a word structure quite unlike that of an European language --- plus those "context-dependent repetitions". I don't see the SCM as a plausible answer to that question. The "self-citation" part is relatively easy to execute, but does not seem to be a natural choice for the hypothetical forger, and would require a non-trivial "warm-up" period to create a stable seed text that could then be used to start the VMS. But the "mutation" part of the SCM would require generating several coin tosses, with non-uniform probabilities, at each word. And these probabilities would have to be finely tuned in order to generate the proper Zipf plot and other "natural" properties. I would expect that a forger who set out to create an "alien" book of lore would use a simpler method, without caring for staistics or consistency -- like the "method" (or lack thereof) that Edward Kelley used to create the You are not allowed to view links. Register or Login to view. books. if that crude product could fool a mathematician like Dee, it would surely fool whoever was the intended VMS victim. But then, if Prob(A|H) ≈ Prob(A|not H), then Bayes's formula says that P(H|A) ≈ P(H). That is, ones prior probability of the VMS being a hoax is not significantly changed by learning that it has "context-dependent repetitions". An, in fact, if Prob(A|H) is less than Prob(A|not H), learning of observation A actually lowers one's probability that the VMS is a hoax. All the best, --jorge RE: Need advice for testing of hypotheses related to the self-citation method - Mauro - 07-07-2025 (07-07-2025, 09:34 AM)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view.But the "mutation" part of the SCM would require generating several coin tosses, with non-uniform probabilities, at each word. And these probabilities would have to be finely tuned in order to generate the proper Zipf plot and other "natural" properties. Well said, I wholly agree with this. RE: Need advice for testing of hypotheses related to the self-citation method - oshfdk - 07-07-2025 (07-07-2025, 09:34 AM)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view.In their argument T&T implicitly or explicitly assume that Prob(A|not H) is practically zero; that is, they assume that a manuscript that is not a hoax cannot have the "context-dependent repetitions" that they observed -- because they did not observe them in a few other non-hoax books that they analyzed. Conversely they claim that Prob(A|H) is much higher, because the hypothetical forger may well have generated the VMS using a method, like the SCM, that accidentally created such repetitions. In my full support of this view, I can just repeat what I said before specifically for my area of interest: context-dependent repetitions are very likely when using a one-to-many cipher, they basically work like a cache in software. If, taking a simple one-to-many substitution, encoding "brine" takes five lookups into some coding table, but you just encoded "bring" as qokchdy two lines above, you may want to skip the one to many step and just copy over the "brin" part as it were then add one of the codes for "g" and get qokcheey. But in this scheme the SCM part is an optimization technique and not the generation method. (All examples are made up.) RE: Need advice for testing of hypotheses related to the self-citation method - ReneZ - 07-07-2025 (07-07-2025, 10:04 AM)Mauro Wrote: You are not allowed to view links. Register or Login to view.(07-07-2025, 09:34 AM)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view.But the "mutation" part of the SCM would require generating several coin tosses, with non-uniform probabilities, at each word. And these probabilities would have to be finely tuned in order to generate the proper Zipf plot and other "natural" properties. Fully agreed too. Beside the "natural" properties there is the need to preserve the "slot table" properties. RE: Need advice for testing of hypotheses related to the self-citation method - ReneZ - 07-07-2025 Again back to the topic of the thread.... Some years ago I set up two texts that should be useful as comparison texts for the Voynich MS, in the frame of this analysis. Let me just explain the first of these. I took a part of Pliny the Elder, which I have set up (again) You are not allowed to view links. Register or Login to view.. I then did a word-for-word translation into Roman numerals, in a way to increase the appearance of similar words near each other. Of course, a word-for-word translation does not change the number of exact repeats. This text is found You are not allowed to view links. Register or Login to view., with one word per line. The method was to take the most frequent words from the input text and assign these numbers from 1 upwards. After that, whenever a new word was encountered while processing the text, it received the next number. The exercise was repeated with a different word pattern, result You are not allowed to view links. Register or Login to view.. RE: Need advice for testing of hypotheses related to the self-citation method - dashstofsk - 07-07-2025 Is it possible to download the text created using Torsten's method? I seem to recall seeing some of it somewhere, and recall that there is some computer program for generating this. If it were possible to have ~24,000 words then it would be possible to get a statistical measure of its closeness to the language B text. RE: Need advice for testing of hypotheses related to the self-citation method - Mauro - 07-07-2025 (07-07-2025, 12:24 PM)dashstofsk Wrote: You are not allowed to view links. Register or Login to view.Is it possible to download the text created using Torsten's method? I seem to recall seeing some of it somewhere, and recall that there is some computer program for generating this. If it were possible to have ~24,000 words then it would be possible to get a statistical measure of its closeness to the language B text. The sample is here: You are not allowed to view links. Register or Login to view. About 10K words iirc. If you can run a Java program you'll find the full code in the GitHub repository and you can create any text you wish (you can also twink the parameters in the source code). RE: Need advice for testing of hypotheses related to the self-citation method - Mauro - 07-07-2025 Basic stats of Tortsen's text: Total characters analyzed in cleaned text: 61794 Total literal characters: 50962 Total space characters: 10832 Character set size (including space character): 21 Character set: aohcidlyernskmtqpfgx Total number of words: 10832 Total words in dictionary: 2228 (one every 4,86176 text words) Total hapax legomena: 1202 (one every 1,853577 dictionary words) 1st-order entropy (single chars including space): 3,793886 bit/character 1st-order entropy (single chars excluding space): 3,788256 bit/character 2nd-order entropy (bigrams): 5,816449 bit/bigram Words entropy: 9,196313 bit/word Same stats for Herbal B (RF1a-n transcription) Total characters analyzed in cleaned text: 22387 Total literal characters: 18636 Total space characters: 3751 Character set size (including space character): 22 Character set: yoehdackirlstnqpmfgxj Total number of words: 3751 Total words in dictionary: 1478 (one every 2,537889 text words) Total hapax legomena: 1069 (one every 1,382601 dictionary words) 1st-order entropy (single chars including space): 3,871553 bit/character 1st-order entropy (single chars excluding space): 3,867437 bit/character 2nd-order entropy (bigrams): 5,979712 bit/bigram Words entropy: 9,313931 bit/word Same stats for Balneological B: Total characters analyzed in cleaned text: 41654 Total literal characters: 34685 Total space characters: 6969 Character set size (including space character): 23 Character set: eoyhdlkacqsitrnpmfgjub Total number of words: 6969 Total words in dictionary: 1561 (one every 4,464446 text words) Total hapax legomena: 1047 (one every 1,490927 dictionary words) 1st-order entropy (single chars including space): 3,820924 bit/character 1st-order entropy (single chars excluding space): 3,806198 bit/character 2nd-order entropy (bigrams): 5,639265 bit/bigram Words entropy: 8,558229 bit/word RE: Need advice for testing of hypotheses related to the self-citation method - Jorge_Stolfi - 07-07-2025 (07-07-2025, 12:05 PM)ReneZ Wrote: You are not allowed to view links. Register or Login to view.Some years ago I set up two texts that should be useful as comparison texts for the Voynich MS, in the frame of this analysis. In case you are interested, I added to the You are not allowed to view links. Register or Login to view. that I posted previously some artifically generated entries: All the best, --jorge RE: Need advice for testing of hypotheses related to the self-citation method - Jorge_Stolfi - 07-07-2025 By the way, I call your attention to the You are not allowed to view links. Register or Login to view. among that list of language samples I posted. It is the only English sample that is not a "narrative" of some sort, but more like a catalog or recipe book -- because it is mostly a herbal and a collection of generic recipes for things like pills, teas, etc.. Thus it is expected to be more repetitious and formulaic than the others. However it is still not a good comparison to the MS, because its entries are still too verbose, with a lot of commentary in narrative style. Indeed I suspect that its popularity was due in good part to that "chattiness" component. Samples:
|