Need advice for testing of hypotheses related to the self-citation method

Need advice for testing of hypotheses related to the self-citation method - Printable Version

+- The Voynich Ninja (https://www.voynich.ninja)
+-- Forum: Voynich Research (https://www.voynich.ninja/forum-27.html)
+--- Forum: Analysis of the text (https://www.voynich.ninja/forum-41.html)
+--- Thread: Need advice for testing of hypotheses related to the self-citation method (/thread-4765.html)

Pages: 1 2 3 4 5 6 7 8 9 10

RE: Need advice for testing of hypotheses related to the self-citation method - Jorge_Stolfi - 07-07-2025

PS. I think I can express a bit more clearly now why I am not impressed by T&T's claims.

In their argument T&T implicitly or explicitly assume that Prob(A|not H) is practically zero; that is, they assume that a manuscript that is not a hoax cannot have the "context-dependent repetitions" that they observed -- because they did not observe them in a few other non-hoax books that they analyzed. Conversely they claim that Prob(A|H) is much higher, because the hypothetical forger may well have generated the VMS using a method, like the SCM, that accidentally created such repetitions.

And indeed, if Prob(A|H) is much greater than prob(A|not H), then Bayes's formula gives Prob(H|A) ≈ 1 --- no matter what the prior P(H) is.

However, my Prob(A|not H) is actually quite high. If the nature of the text is what the illustrations suggest (herbal, pharmacopoeia, list of diseases, etc.), then I do expect that it will have a lot more "context-dependent repetitions" than a novel or chronicle.

And conversely my Prob(A|H) is rather low, because I cannot see how or why the forger would have used a generation method that produced a text with the observed "natural" properties of the VMS (Zipf's law, vocabulary size, word entropy, etc.) but with a word structure quite unlike that of an European language --- plus those "context-dependent repetitions".

I don't see the SCM as a plausible answer to that question. The "self-citation" part is relatively easy to execute, but does not seem to be a natural choice for the hypothetical forger, and would require a non-trivial "warm-up" period to create a stable seed text that could then be used to start the VMS. But the "mutation" part of the SCM would require generating several coin tosses, with non-uniform probabilities, at each word. And these probabilities would have to be finely tuned in order to generate the proper Zipf plot and other "natural" properties.

I would expect that a forger who set out to create an "alien" book of lore would use a simpler method, without caring for staistics or consistency -- like the "method" (or lack thereof) that Edward Kelley used to create the You are not allowed to view links. Register or Login to view. books. if that crude product could fool a mathematician like Dee, it would surely fool whoever was the intended VMS victim.

But then, if Prob(A|H) ≈ Prob(A|not H), then Bayes's formula says that P(H|A) ≈ P(H). That is, ones prior probability of the VMS being a hoax is not significantly changed by learning that it has "context-dependent repetitions".

An, in fact, if Prob(A|H) is less than Prob(A|not H), learning of observation A actually lowers one's probability that the VMS is a hoax.

All the best, --jorge

RE: Need advice for testing of hypotheses related to the self-citation method - Mauro - 07-07-2025

(07-07-2025, 09:34 AM)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view.But the "mutation" part of the SCM would require generating several coin tosses, with non-uniform probabilities, at each word. And these probabilities would have to be finely tuned in order to generate the proper Zipf plot and other "natural" properties.

Well said, I wholly agree with this.

RE: Need advice for testing of hypotheses related to the self-citation method - oshfdk - 07-07-2025

(07-07-2025, 09:34 AM)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view.In their argument T&T implicitly or explicitly assume that Prob(A|not H) is practically zero; that is, they assume that a manuscript that is not a hoax cannot have the "context-dependent repetitions" that they observed -- because they did not observe them in a few other non-hoax books that they analyzed. Conversely they claim that Prob(A|H) is much higher, because the hypothetical forger may well have generated the VMS using a method, like the SCM, that accidentally created such repetitions.

And indeed, if Prob(A|H) is much greater than prob(A|not H), then Bayes's formula gives Prob(H|A) ≈ 1 --- no matter what the prior P(H) is.

However, my Prob(A|not H) is actually quite high. If the nature of the text is what the illustrations suggest (herbal, pharmacopoeia, list of diseases, etc.), then I do expect that it will have a lot more "context-dependent repetitions" than a novel or chronicle.

In my full support of this view, I can just repeat what I said before specifically for my area of interest: context-dependent repetitions are very likely when using a one-to-many cipher, they basically work like a cache in software. If, taking a simple one-to-many substitution, encoding "brine" takes five lookups into some coding table, but you just encoded "bring" as qokchdy two lines above, you may want to skip the one to many step and just copy over the "brin" part as it were then add one of the codes for "g" and get qokcheey. But in this scheme the SCM part is an optimization technique and not the generation method. (All examples are made up.)

RE: Need advice for testing of hypotheses related to the self-citation method - ReneZ - 07-07-2025

(07-07-2025, 10:04 AM)Mauro Wrote: You are not allowed to view links. Register or Login to view.
(07-07-2025, 09:34 AM)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view.But the "mutation" part of the SCM would require generating several coin tosses, with non-uniform probabilities, at each word. And these probabilities would have to be finely tuned in order to generate the proper Zipf plot and other "natural" properties.

Well said, I wholly agree with this.

Fully agreed too.
Beside the "natural" properties there is the need to preserve the "slot table" properties.

RE: Need advice for testing of hypotheses related to the self-citation method - ReneZ - 07-07-2025

Again back to the topic of the thread....

Some years ago I set up two texts that should be useful as comparison texts for the Voynich MS, in the frame of this analysis. Let me just explain the first of these.

I took a part of Pliny the Elder, which I have set up (again) You are not allowed to view links. Register or Login to view..

I then did a word-for-word translation into Roman numerals, in a way to increase the appearance of similar words near each other. Of course, a word-for-word translation does not change the number of exact repeats.
This text is found You are not allowed to view links. Register or Login to view., with one word per line.

The method was to take the most frequent words from the input text and assign these numbers from 1 upwards. After that, whenever a new word was encountered while processing the text, it received the next number.

The exercise was repeated with a different word pattern, result You are not allowed to view links. Register or Login to view..

RE: Need advice for testing of hypotheses related to the self-citation method - dashstofsk - 07-07-2025

Is it possible to download the text created using Torsten's method? I seem to recall seeing some of it somewhere, and recall that there is some computer program for generating this. If it were possible to have ~24,000 words then it would be possible to get a statistical measure of its closeness to the language B text.

RE: Need advice for testing of hypotheses related to the self-citation method - Mauro - 07-07-2025

(07-07-2025, 12:24 PM)dashstofsk Wrote: You are not allowed to view links. Register or Login to view.Is it possible to download the text created using Torsten's method? I seem to recall seeing some of it somewhere, and recall that there is some computer program for generating this. If it were possible to have ~24,000 words then it would be possible to get a statistical measure of its closeness to the language B text.

The sample is here: You are not allowed to view links. Register or Login to view.

About 10K words iirc. If you can run a Java program you'll find the full code in the GitHub repository and you can create any text you wish (you can also twink the parameters in the source code).

RE: Need advice for testing of hypotheses related to the self-citation method - Mauro - 07-07-2025

Basic stats of Tortsen's text:

Total characters analyzed in cleaned text: 61794 Total literal characters: 50962 Total space characters: 10832

Character set size (including space character): 21 Character set: aohcidlyernskmtqpfgx

Total number of words: 10832
Total words in dictionary: 2228 (one every 4,86176 text words)
Total hapax legomena: 1202 (one every 1,853577 dictionary words)

1st-order entropy (single chars including space): 3,793886 bit/character
1st-order entropy (single chars excluding space): 3,788256 bit/character
2nd-order entropy (bigrams): 5,816449 bit/bigram
Words entropy: 9,196313 bit/word

Same stats for Herbal B (RF1a-n transcription)

Total characters analyzed in cleaned text: 22387 Total literal characters: 18636 Total space characters: 3751

Character set size (including space character): 22 Character set: yoehdackirlstnqpmfgxj

Total number of words: 3751
Total words in dictionary: 1478 (one every 2,537889 text words)
Total hapax legomena: 1069 (one every 1,382601 dictionary words)

1st-order entropy (single chars including space): 3,871553 bit/character
1st-order entropy (single chars excluding space): 3,867437 bit/character
2nd-order entropy (bigrams): 5,979712 bit/bigram
Words entropy: 9,313931 bit/word

Same stats for Balneological B:

Total characters analyzed in cleaned text: 41654 Total literal characters: 34685 Total space characters: 6969

Character set size (including space character): 23 Character set: eoyhdlkacqsitrnpmfgjub

Total number of words: 6969
Total words in dictionary: 1561 (one every 4,464446 text words)
Total hapax legomena: 1047 (one every 1,490927 dictionary words)

1st-order entropy (single chars including space): 3,820924 bit/character
1st-order entropy (single chars excluding space): 3,806198 bit/character
2nd-order entropy (bigrams): 5,639265 bit/bigram
Words entropy: 8,558229 bit/word

RE: Need advice for testing of hypotheses related to the self-citation method - Jorge_Stolfi - 07-07-2025

(07-07-2025, 12:05 PM)ReneZ Wrote: You are not allowed to view links. Register or Login to view.Some years ago I set up two texts that should be useful as comparison texts for the Voynich MS, in the frame of this analysis.

In case you are interested, I added to the You are not allowed to view links. Register or Login to view. that I posted previously some artifically generated entries:

Vietnamese, You are not allowed to view links. Register or Login to view.
- You are not allowed to view links. Register or Login to view. Random Vietnamese by a variant of Gordon Rrugg's table and template method.
- You are not allowed to view links. Register or Login to view. Random Vietnamese by a character Markov chain or order 3.
Voynichese, EVA encoding
- You are not allowed to view links. Register or Login to view. Random Voynichese generated manually by Gordon Rrugg.
- You are not allowed to view links. Register or Login to view. Random Voynichese generated automatically by Gordon Rugg.

All the best, --jorge

RE: Need advice for testing of hypotheses related to the self-citation method - Jorge_Stolfi - 07-07-2025

By the way, I call your attention to the You are not allowed to view links. Register or Login to view. among that list of language samples I posted. It is the only English sample that is not a "narrative" of some sort, but more like a catalog or recipe book -- because it is mostly a herbal and a collection of generic recipes for things like pills, teas, etc.. Thus it is expected to be more repetitious and formulaic than the others.

However it is still not a good comparison to the MS, because its entries are still too verbose, with a lot of commentary in narrative style. Indeed I suspect that its popularity was due in good part to that "chattiness" component. Samples:

The Seed of this Wormwood is that which usually Women give their Children for the Worms: Of all Wormwoods that grow here, this is the weakest; I but Doctors commend it, and Apothecaries sell it, the one must keep his Credit, and the other get Money, and that's the key of the work. The Herb is good for somthing, because God made nothing in vain; Will you give me leave to weigh things in the Ballance of Reason; Then thus, The Seeds of the common Wormwood are far more prevalent than the Seed of this, to expell Worms in Children, or People of ripe age: Of both, some are weak, some are strong. The Seeds of the common Wormwood are far more prevalent than the Seed of this, to expell Worms in Children, or People of ripe age: Of both, some are weak, some are strong. The Seriphian Wormseed is the weakest, & happily may prove to be fittest for weakest Bodies (for it is weak enough in all conscience) Let such as are strong take the common Wormseed, for the other will do but little good. Again, neer the Sea many people live, and Seriphium grows neer them, and therfore is more fitting for their Bodies because nourished by the same Air; and this I had from Dr° Reason. In whose Body Dr° Reason dwels not, dwels Dr° Madness, and he brings in his Brethren, Dr° Ignorance, Dr° Folly, and Dr° Sickness, and these together make way for Dr° Death, and the latter end of that man is worse than the beginning. Pride was the cause of Adam's Fall, Pride begate a Daughter, I do not know the Father of it unless the Divil, but she christned it, and call'd it Appetite, and sent her Daughter to tast these Wormwoods, who finding this the least bitter, made the sqeamish Wench extol it to the Skies, though the Vertues of it never reached to the middle Region of the Air. Its due praise is this; It is weakest, therefore fitter for weak Bodies, and fitter for those Bodies that dwel neer it than those that live far from it: my reason, is The Sea (as those that live far from it know when they comt neer it) casteth not such a smel as the Land doth: The tender Mercies of God being over all his Works, hath by his eternal Providence planted Seriphium by the Sea side, as a fit Medicine for the Bodies of those that live neer it. Lastly, It is known to all that know any thing in the Course of Nature, That the Liver delights in sweet things; if so, it abhors bitter, then if your Liver be weak, it is none of the wisest courses to plague it with an Enemy: if the Liver be weak a Consumption follows; Would you know the Reason? 'tis this, A mans Flesh is repaired by Blood, by a third concoction which transmutes Blood into Flesh ('tis well I said «Conction» for if I had said «Boyling» every Cook would have understood me.) The Liver makes Blood, and if it be weakned that it makes not enough the Flesh wasteth, and why must Flesh alwaies be renewed? Because the eternal God when he made the Creation, made one part of it in continual dependency upon another: And why did he so? Because Himself is only Permanent, to reach us, That we should not fix our affections upon what is transitory, but upon what endures for ever. The result of all is this, If the Liver be weak and cannot make Blood enonough (I would have said «Sanguifie» if I had written only to Schollers) The Seriphian which is the weakest of Wormwoods is better than the best. I have been Critical enough, if not too much.
[...]

Having gathered your Herb you would preserve the Juyce of, when it is very dry (for otherwise your Juyce will not be worth a Button) bruise it very wel in a stone Mortar with a wooden Pestle, then having put it into a Canvas Bag (the Herb I mean, not the Mortar for that will yield but little Juyce) press it hard in a press, then take the Juyce and clarifie it.

All the best, --jorge