Need advice for testing of hypotheses related to the self-citation method

Need advice for testing of hypotheses related to the self-citation method - Printable Version

+- The Voynich Ninja (https://www.voynich.ninja)
+-- Forum: Voynich Research (https://www.voynich.ninja/forum-27.html)
+--- Forum: Analysis of the text (https://www.voynich.ninja/forum-41.html)
+--- Thread: Need advice for testing of hypotheses related to the self-citation method (/thread-4765.html)

Pages: 1 2 3 4 5 6 7 8 9 10

RE: Need advice for testing of hypotheses related to the self-citation method - nablator - 05-07-2025

(04-07-2025, 06:32 PM)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view.So even if there exists an input bit sequence (coin toss outcomes) that causes SCM to output an exact copy of the VMS (which I doubt, but let's assume it does), in order to show that the VMS "has no meaning" one would have to show that this magical bit sequence is not a meaningful message.

There are two "little" problems with this that make it impossible:
1) Thinking in terms of bits and transducers would be anachronistic at a time when cryptography was at the mono-alphabetic substitution (with homophones and nulls) stage,
2) The bit sequence would need to be reconstructed from the VMS in order to recover the message. With multiple possible sources for each target word of the SCM it can't be done. You can't unscramble eggs. Smile

RE: Need advice for testing of hypotheses related to the self-citation method - oshfdk - 05-07-2025

(05-07-2025, 08:06 AM)nablator Wrote: You are not allowed to view links. Register or Login to view.
(04-07-2025, 06:32 PM)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view.So even if there exists an input bit sequence (coin toss outcomes) that causes SCM to output an exact copy of the VMS (which I doubt, but let's assume it does), in order to show that the VMS "has no meaning" one would have to show that this magical bit sequence is not a meaningful message.

There are two "little" problems with this that make it impossible:
1) Thinking in terms of bits and transducers would be anachronistic at a time when cryptography was at the mono-alphabetic substitution (with homophones and nulls) stage,
2) The bit sequence would need to be reconstructed from the VMS in order to recover the message. With multiple possible sources for each target word of the SCM it can't be done. You can't unscramble eggs.

We can replace bits with coin tosses and transducers with whatever concept similar to transducers existed in the XV century. I don't know what transducers are anyway.

As for the unscrambability of the cipher, the below is a very simple (although quite verbose) self-citation cipher based on Voynich script. I think it is readable. Usual disclaimer: I don't think this is how VMS is encoded.

Filename: verbose.jpg Size: 274.79 KB 05-07-2025, 10:14 AM

This scheme is a pain to encode/decode, and I guess it's much more complicated than the actual cipher of the Voynich manuscript, but it shows that in principle it doesn't take advanced tech to encode text via self-citation.

RE: Need advice for testing of hypotheses related to the self-citation method - Jorge_Stolfi - 05-07-2025

(05-07-2025, 08:06 AM)nablator Wrote: You are not allowed to view links. Register or Login to view.There are two "little" problems with this that make it impossible:
1) Thinking in terms of bits and transducers would be anachronistic at a time when cryptography was at the mono-alphabetic substitution (with homophones and nulls) stage,

People back then were as smart as we are today. (Maybe smarter, since they had no TV or facebook...)

"Bits" and "transducers" is only our modern way of describing the process. Just as we would say that the tables of Soyga are traces of a cellular automaton. But people have always understood those concepts in an intuitive way, without using those words.

While complex encryption schemes were not standard common knowledge in the 1400s, as they would be in the 1500s, individual mathematicians would have been perfectly capable of devising encryption methods as complex as the "self-citation encoding" would be. From wikipedia, for example:

"In his book Flos, Leonardo de Pisa, also known as You are not allowed to view links. Register or Login to view. (1170–1250), was able to closely approximate the positive solution to the cubic equation x^3 + 2x^2 + 10x = 20. Writing in You are not allowed to view links. Register or Login to view. he gave the result as 1,22,7,42,33,4,40 (equivalent to 1 + 22/60 + 7/60^2 + 42/60^3 + 33/60^4 + 4/60^5 + 40/60^6), which has a You are not allowed to view links. Register or Login to view. of about 10−9.You are not allowed to view links. Register or Login to view.

The SCM as encryption method is not much more complex than writing a sequence of numbers where each number is the location of a word in the Bible

Quote:2) The bit sequence would need to be reconstructed from the VMS in order to recover the message. With multiple possible sources for each target word of the SCM it can't be done.

Indeed the SCM as described is a many-to-one encoding of a bit sequence to a string of words. But
what Thorsten and Timm observed is only that the VMS often repeats sequences of words that have occurred before, with variations. They then devised a method (the SCM) that generates random text with that same feature, while imitating a few other statistics such as the Zipf plot. But it does not follow that the VMS was created using the SCM!

Back in my day, Gordon Rugg observed that Voynichese words can be split into prefix/middle/suffix where each segment is picked from a small set of choices. (This feature was noted by others before, e. g. Tiltman). He then devised a method, using a three-column table and masking cards with slots, to generate random text that (by construction) also displayed that feature. And he then he too "concluded" that the VMS was random gibberish with no meaning -- a "hoax".

One can easily build a 3rd-order Markov word generator that produces random text very much like Shakespearean English. (That would be basically equivalent to the GPT model discussed in other posts here.) The Markov-generated text will have the same vocabulary size, Zipf plot, word and character entropy, and may other statistics as the real works of Shakespeare. Someone who does not know English would probably be unable to tel them apart. Does it follow that Shakespeare's plays are random gibberish without meaning -- a "hoax" too?

So, if the VMS was an encrypted text, the actual encoding method could have been very different from the SCM, and easily invertible; but still happened to generate repetitions and near-repetitions similar to those observed.

All the best, --jorge

RE: Need advice for testing of hypotheses related to the self-citation method - Torsten - 05-07-2025

(05-07-2025, 02:23 AM)ReneZ Wrote: You are not allowed to view links. Register or Login to view.
(04-07-2025, 07:14 PM)nablator Wrote: You are not allowed to view links. Register or Login to view.satisfied with an explanation of why "the whole thing cannot work" (dixit ReneZ).

Very brief:

The high level of repetitions is not the most conspicuous artifact of the text. Perhaps not even the second most conspicuous, but probably the third. This is so subjective that I don't want to argue about it.

Using this as the prime method for the text generation, while it does nothing towards the most conspicuous aspects (low entropy, word pattern - same thing really) is my biggest problem.

But this risks getting us off topic.

Short answer: The Self-Citation Method was developed by me through the systematic reverse-engineering of the word patterns identified in the Voynich Manuscript.

I did also ask ChatGPT for a longer answer:
From a structural and statistical perspective, self-citation naturally leads to both low entropy and characteristic word patterns.

1. What is the Self-Citation Method?
It refers to generating new text by:

Copying previously written words or fragments
Making small, predictable modifications
Repeating this process recursively over the growing text

2. Why Does That Lead to Low Entropy?
Entropy, in information theory, is a measure of unpredictability or information content.
Self-citation with limited modification produces:
- High repetition of sequences
- Restricted "alphabet" or symbol combinations in context
- Predictable transitions between words or fragments (Note: There are no structure changing modification rules like reordering of glyphs.)
- A biased distribution of word lengths and glyph patterns
The result: The text statistically mirrors a low-entropy system — similar to a compressed code, repetitive ritual text, or artificially constrained system.

3. Why Does That Produce Word Patterns?
The Voynich manuscript exhibits:

Clusters of similar-looking "words"
Systematic variations
Repeated fragments that mimic prefixes, suffixes, or stems

Self-citation directly causes this because:

When copying an existing word and slightly modifying it, families of related words emerge
Frequent reuse of certain fragments (e.g., "qo", "ol", "chedy") occurs naturally
Sequences of similar words appear close together, reflecting recent "copy targets"

Thus, the observable word patterns — both at the micro (within lines) and macro (across sections) levels — are a direct byproduct of the self-referential text generation process.

4. Illustrative Analogy
Imagine a program that:

Starts with a few invented "words": qokeedy, chedy, dain
At each step:
- Selects a previous word or fragment
- Modifies it slightly (replaces glyphs/adds or removes prefixes)
- Appends the result to the text

Over thousands of iterations:

Some fragments become dominant
Similar-looking words proliferate
Novel sequences appear rarely
Overall entropy remains low
The "vocabulary" stabilizes into recognizable word families

Exactly what we observe in the Voynich manuscript.

Conclusion:
The self-citation hypothesis does explain the most conspicuous aspects of the Voynich manuscript:

Low entropy emerges naturally through repetitive, constrained copying
Word patterns arise through systematic, recursive modifications
No need for an independent mechanism to impose these features

Thus, if the self-citation process is properly constrained and recursive, it not only explains text generation but necessarily produces the statistical and structural anomalies seen in the manuscript.

RE: Need advice for testing of hypotheses related to the self-citation method - Torsten - 05-07-2025

(04-07-2025, 06:32 PM)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view.I would rather not spend much time studying the "self-citation method" (SCM), but maybe I can still say something useful. I don't believe the 'hoax' conclusion. The main reason I will explain in my 10-minute talk at the conference. The second main reason is that it impossible to prove that some string "contains no message". Or even to provide statistical evidence that would make such a conclusion more likely than not.

I fully agree that proving a string 'contains no message' is practically impossible, especially in the absence of external context or a known key. However, that question is distinct from examining whether the structure of the text can be explained by a formal, self-referential generation process — such as the Self-Citation Method. The SCM is not inherently a 'hoax' theory, but rather a model that reproduces specific statistical and structural features observed in the Voynich Manuscript. It addresses how the text could have been generated, not necessarily why or whether meaning is present. In that sense, exploring such a model provides insights into the mechanics of the text — without requiring us to draw conclusions about its semantic content.

RE: Need advice for testing of hypotheses related to the self-citation method - nablator - 05-07-2025

(05-07-2025, 10:53 AM)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view.But it does not follow that the VMS was created using the SCM!

"It does not follow" is not a valid criticism of any theory... Induction is not deduction. It isn't worthless.

When I created this thread I expected a discussion of how to do a proper (Bayesian?) hypothesis testing.

We have a lot of evidence (the entire VMS), how do we assess/compare specific generating/ciphering/encoding methods with Bayesian inference? I know the basics but I've never done model selection / model fitting using Bayesian inference.

By "advice" I mean: I'd prefer not to spend a month reading the literature and use a framework made by someone who knows how to do it.

I don't care what ChatGPT "thinks" about it. I'm old-fashioned: I asked Google for pointers but it is not helpful enough, there is too much to read. Yes, I'm lazy. Big Grin

An introductory tutorial: You are not allowed to view links. Register or Login to view.

RE: Need advice for testing of hypotheses related to the self-citation method - Jorge_Stolfi - 05-07-2025

(05-07-2025, 04:18 PM)nablator Wrote: You are not allowed to view links. Register or Login to view."It does not follow" is not a valid criticism of any theory... Induction is not deduction. It isn't worthless.

Torsten&Timm's theory (TTT) is "the VMS is a hoax". Their arguments are (A) the VMS has certain repetition patterns that are not seen in a set of other texts they examined, and ( B) they devised a probabilistic generator whose output is like Voynichese according to some statistics, including similar repetition patterns.

Pointing out that A&B does not imply TTT is a perfectly valid criticism of the TTT.

If you prefer in Bayes:

Code:
                                  Prob(A&B|TTT) * Prob(TTT)

Prob(TTT|A&B) = ------------------------------------------------------------

                  Prob(A&B|TTT)*Prob(TTT) + Prob(A&B|not TTT)*Prob(not TTT)       

where Prob(TTT) is the a priori probability of the VMS being a hoax. My probability of that is less than 0.0001 (0.01%).

If A is true then B is true too, because the kind of repetitions that were detected are such that the SCM (or many other methods) can generate them. So A&B in in fact equivalent to A.

The factor Prob(A&B|TTT) is therefore Prob(A|TTT), the probability that a hoax text will have the kind of repetitions that they detected. Not all hoax texts will have them. But let's be generous and say that Prob(A|TTT) = 0.5.

The term Prob(A&B|not TTT) is therefore Prob(A|not TTT), the probability that a non-hoax text will have the kind of repetitions that they detected. Let's be pessimistic and say that it is 0.01 (1%). Then

Code:
                        0.5 * 0.0001              0.00005

Prob(TTT|A&B) = ----------------------------- =~ ----------- = 0.005 = 0.5%

                0.5 * 0.0001 + 0.01 * 0.9999       0.01

Said more simply: if one does not believe a priori that the VMS is a hoax, knowing A and B will not convince them.

RE: Need advice for testing of hypotheses related to the self-citation method - Jorge_Stolfi - 05-07-2025

By the way, it is important to keep in mind that repetitiveness (of any sort) is a property of the text, not of the language. One can write a text in perfect English with the same kinds of repetitions that T&T detected in the VMS. IN fact one can write a very meaningful text like that,

RE: Need advice for testing of hypotheses related to the self-citation method - oshfdk - 05-07-2025

(05-07-2025, 07:07 PM)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view.Said more simply: if one does not believe a priori that the VMS is a hoax, knowing A and B will not convince them.

I'm never sure how to determine these priors. How to visualize the meaning of "my probability of VMS being a hoax is less than 0.01%"? Is it like if we had 10000 different manuscripts showing the strange properties of VMS, you'd say only one of them is likely to be a hoax? I find this too low. Given that people are known to make hoaxes, even though I find it highly unlikely that VMS is a hoax, I would still put my a priori probability of a hoax somewhere close to 10%, that is, about 1 in 10 strange 240 pages long unembellished medieval manuscripts with weird drawings written in an unknown script that have resisted modern attempts of deciphering it for more than a century might be a hoax.

Which then would be: prob(hoax with T&T) = 0.5 * 0.1 / (0.5 * 0.1 + 0.01 * 0.9) ≈ 0.85

But somehow I'm still not convinced.

(I think there is a typo there in your post, should be 0.0001 and 0.9999 in the denominator? It doesn't really change the result. I have no real experience with this formula, so maybe I don't understand something.)

RE: Need advice for testing of hypotheses related to the self-citation method - Torsten - 05-07-2025

(05-07-2025, 07:07 PM)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view.Thorsten&Timm's theory (TTT) is "the VMS is a hoax". Their arguments are (A) the VMS has certain repetition patterns that are not seen in a set of other texts they examined, and ( B) they devised a probabilistic generator whose output is like Voynichese according to some statistics, including similar repetition patterns.

Pointing out that A&B does not imply TTT is a perfectly valid criticism of the TTT.

Said more simply: if one does not believe a priori that the VMS is a hoax, knowing A and B will not convince them.

No, that is a misunderstanding of my position.

In fact, we explicitly acknowledge that A and B do not imply C. As we stated: "Most likely, it is impossible to devise an exact mathematical proof that an arbitrary set of strings is truly meaningless, or not. This would involve a general method to compute upper boundaries to the Kolmogorov complexity" [You are not allowed to view links. Register or Login to view.].

My theory is that the Voynich text was generated using the Self-Citation Method (B): "In the present work we have shown that a strikingly simple process for random text generation ('self-citation' algorithm) has the potential to resolve all of these seeming contradictions. The proposed text generation method is not only supported by many details of self-similarities uncovered in the VMS text, and is fully compatible with the historical background, but also even quantitatively reproduces the key statistical properties. In particular, we were able to demonstrate that our sample 'facsimile' text fulfills both of Zipf’s laws. Following Occam’s principle, this theory provides the optimal hypothesis available to explain all facts currently known about the VMS. It, however, does not totally dismiss the steganography hypothesis ..." [You are not allowed to view links. Register or Login to view.].

I argue (A) that context-dependent self-similarity features are a defining characteristic for the Voynich text and (B) that the Self-Citation Method is sufficient to explain these properties of the Voynich text. Pointing out that A and B do not logically imply conclusion C — namely, that the Voynich Manuscript is a medieval hoax — does not constitute a valid criticism of my theory, because my argument is focused on the mechanism of text generation (B) and its ability to account for the structural properties (A).

Said more simply: conclusion C does not provide a basis for evaluating the validity of either A or B.