The Voynich Ninja - Speculative fraud hypothesis

Pages: 1 2 3 4 5 6 7 8

(23-08-2025, 07:51 AM)magnesium Wrote: You are not allowed to view links. Register or Login to view.Why go through all the trouble of making a complex cipher if you could sell the book just as easily with it saying nothing at all?

First, it would be a sure way to make Voynichese look like a real language. It seems unlikely that a forger of the time could devise a gibberish generation method with that property. The methods of Rugg and Thorsten&Timm could do that with proper parameters, but no one has shown how the VMS Author could have figured out the proper parameters.

Second, to protect the Author from an angry mark. Suppose that the Author creates the VMS with a gibberish generator and sells it to Rudolf, claiming that it was Bacon's Third Book obtained from some source in England. Rudolf then gives the book to the mathematicians and cryptographers in his court to decipher, After months of effort, they (like many modern Voynichologists) all agree that it must be gibberish. They cannot really prove it, but they come up with some flawed "proof" like modern "hoax" proponents have done. Or they just tell Rudolf that they cannot prove it, but are fairly confident that it is gibberish; and that is enough for Rudolf. If so, the Author must have been lying and must be himself the forger. What would Rudolf do to him? What he did to Edward Kelley?

Now suppose the same story, but the Author creates the VMS by taking some mix of alchemical herbals with other "legitimate" texts, like Dioscorides or Galen, and encoding it with some bizarre but not too hard cipher. Then same as above. The Maths may be able to crack the cipher. Or they may give up and tell Rudolph "surely gibberish", so Rudolf confronts the Author, and the Author "by luck", finds the key hidden in the circular text of page f69v. Either way, Rudolf will be disappointed at the contents, but the Author can claim that he was fooled too, and could not have suspected it, because it is indeed a book of Secret Ancient Knowledge, just not Bacon's, and after all the contents is not more bogus than most other herbal/medical books of the times.

I admit that it is rather unlikely that a Forger-Author would reason this way. But the "hoax" theory itself is far less likely...

All the best, --jorge

If you have to pass cursory inspection with a mark, you might need to be able to plausibly "read" phonemes, giving you incentive to invent a pronouncable system. If you want it to pass cursory reading inspection, you want there to be no obvious repeats of entire blocks of text, and especially have unique labels so two different things aren't labelled the same. These objectives *would* explain the peculiar properties of the word distributions for labels and for the text - even down to destroying the fact that in lists in a natural language some structures like common turns of phrase *would* repeat but differ by a few words.

(23-08-2025, 11:59 AM)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view.First, it would be a sure way to make Voynichese look like a real language. It seems unlikely that a forger of the time could devise a gibberish generation method with that property. The methods of Rugg and Thorsten&Timm could do that with proper parameters, but no one has shown how the VMS Author could have figured out the proper parameters.

There was no need to invent an artificial “gibberish generation” mechanism. As D’Imperio already observed, the natural way to produce sequences of meaningless text would be through iterative reuse and modification of existing material: “The scribe, faced with the task of thinking up a large number of such dummy sequences, would naturally tend to repeat parts of neighboring strings with various small changes and additions to fill out the line until the next message-bearing word or phrase” (D’Imperio 1978, p. 31).

It is simply more efficient for a scribe to copy and modify existing words than to continually invent entirely new ones. Consequently, a scribe attempting to generate language-like gibberish would, sooner or later, abandon the laborious task of perpetual invention in favor of the far easier strategy of reduplicating and adapting previously written material—and would ultimately adhere to this approach consistently. This tendency is further supported by the experimental findings of You are not allowed to view links. Register or Login to view..

One of the key advantages of the self-citation method is that statistical regularities—such as the Zipfian distribution—emerge organically from the process itself, without requiring deliberate adjustment. This is because the method is grounded directly in observations of the Voynich text. I begin by identifying fundamental, corpus-wide patterns—for example, the clustering of similar words across pages. These clusters suggest a mechanism of repetition and gradual variation. The central argument, therefore, is that an iterative process of copying and modification is sufficient to account for the statistical features. The observed word type frequencies can be interpreted as the outcome of a self-reinforcing resonance effect produced by continual copying. Frequently occurring words are more likely to be copied, thereby generating numerous variants. The accumulation of such variants, in turn, increases the probability that the original form will reappear during subsequent cycles of copying. This way the hypothesis accounts for the clustering of similar words within the Voynich Manuscript.

Therefore there is no need to determine or fine-tune external parameters. As noted in Timm & Schinner (2019):

Quote:We deliberately did not fine-tune the algorithm to pick an 'optimal' sample for this presentation. Such a strategy is by itself questionable. Nevertheless, an exhaustive scan of the parameter space (involving thousands of automatically analyzed text samples) verified the overall stability of the proposed algorithm. About 10-20% of the parameter space even yields excellent numerical conformity (≤ 10% relative error) with all considered key features of the real VMS text (entropy values, random walk exponents, token length distribution, etc.).

(23-08-2025, 08:23 PM)Torsten Wrote: You are not allowed to view links. Register or Login to view.One of the key advantages of the self-citation method is that statistical regularities—such as the Zipfian distribution—emerge organically from the process itself, without requiring deliberate adjustment. This is because the method is grounded directly in observations of the Voynich text. I begin by identifying fundamental, corpus-wide patterns—for example, the clustering of similar words across pages. These clusters suggest a mechanism of repetition and gradual variation. The central argument, therefore, is that an iterative process of copying and modification is sufficient to account for the statistical features. The observed word type frequencies can be interpreted as the outcome of a self-reinforcing resonance effect produced by continual copying. Frequently occurring words are more likely to be copied, thereby generating numerous variants. The accumulation of such variants, in turn, increases the probability that the original form will reappear during subsequent cycles of copying. This way the hypothesis accounts for the clustering of similar words within the Voynich Manuscript.

Therefore there is no need to determine or fine-tune external parameters. As noted in Timm & Schinner (2019):

Quote:We deliberately did not fine-tune the algorithm to pick an 'optimal' sample for this presentation. Such a strategy is by itself questionable. Nevertheless, an exhaustive scan of the parameter space (involving thousands of automatically analyzed text samples) verified the overall stability of the proposed algorithm. About 10-20% of the parameter space even yields excellent numerical conformity (≤ 10% relative error) with all considered key features of the real VMS text (entropy values, random walk exponents, token length distribution, etc.).

Furthermore, in regards to "no one has shown how the VMS Author could have figured out the proper parameters." - the author(s) didn't need to 'figure out' the parameters. Had they accidentally landed somewhere else in parameter space, we would be very similarly stumped. All that's required for the hypothetical process to be plausible is that the fraction of the parameter space that explains the Voynich is significant enough that they could have landed there. Which it sounds like this simple process is.

A simple process like this is entirely congruent with '15th century hoax' hypotheses.

(23-08-2025, 08:23 PM)Torsten Wrote: You are not allowed to view links. Register or Login to view.There was no need to invent an artificial “gibberish generation” mechanism. As D’Imperio already observed [...]

She was not stating a fact. She was proposing her version of the "hoax" theory. Which, in general terms, apparently is the same as yours. Which has the same problems as yours.

Torsten Wrote:It is simply more efficient for a scribe to copy and modify existing words than to continually invent entirely new ones.

This is correct. Continually inventing new words is hard. Copying previous text is much easier.

Torsten Wrote:Consequently, a scribe attempting to generate language-like gibberish would, sooner or later, abandon the laborious task of perpetual invention in favor of the far easier strategy of reduplicating and adapting previously written material — and would ultimately adhere to this approach consistently.

Note my emphasis. The problem is that the "adapting" is far from a simple step. Voynichese words have a very restricted structure, so the "adapting" must be random but such that it preserves that structure. At this point the gibberish generation method is not much easier than generating each word from scratch (as Rugg had proposed), and is totally not "natural".

Much easier and more natural would have been to take any text in Latin or some other readable language, even if an "alchemical herbal", and encode it with a cipher that was easy to apply on the fly but hard or even impossible to decipher. The resulting text would look like language and have language-like properties, much more so than the VMS.

In fact (if I read you correctly), your justification for your proposed method is that it creates the repetitiousness that you claim to see in the VMS; which is a clue that the text is gibberish. Wouldn't the Author have worried about this last fact?

Torsten Wrote:I begin by identifying fundamental, corpus-wide patterns — for example, the clustering of similar words across pages. These clusters suggest a mechanism of repetition and gradual variation. The central argument, therefore, is that an iterative process of copying and modification is sufficient to account for the statistical features. [...]

Paraphrasing your argument: "The VMS text has statistical properties X, Y, and Z, where Z is 'repetitiousnss'. Here is an algorithm that generates gibberish with properties X, Y and Z. Therefore the VMS must be gibberish."

Do you see the logical fallacy there?

Please confirm if this Python code is a sufficiently close approximation of your method:

Code:
def TnT(Seedtext,Mutate,Prob_Restart,Prob_Mutate):

  # Generates a pseudo-VMS text as a list of strings.

  #

  T = SeedText(1000) # Create a 1000-word seed text.

  k = None; # Source text index.

  for i in range(35000):

    if i == 0 or random() < Prob_Restart:

      k = randint(0, len(T)-1)

    word = T[k]; k += 1

    if random() < Prob_Mutate: 

      word = Mutate(word)

    T.append(word)

  return(T)

Torsten Wrote:Therefore there is no need to determine or fine-tune external parameters

(23-08-2025, 09:03 PM)dexdex Wrote: You are not allowed to view links. Register or Login to view.the author(s) didn't need to 'figure out' the parameters. Had they accidentally landed somewhere else in parameter space, we would be very similarly stumped. All that's required for the hypothetical process to be plausible is that the fraction of the parameter space that explains the Voynich is significant enough that they could have landed there. Which it sounds like this simple process is.

There was no need for the Author to tune the parameters to produce the Voynichese "language" specifically, but the parameters could not have been random. The seed text and the algorithm of the Mutate function had to be compatible, and both had to generate the non-trivial word structure that we see in VMS lexicon.

The Mutate function could not have been just "choose randomly between deleting a random letter, inserting a random letter in a random place, or replacing a randomly chosen letter with some other random letter". After a short while those mutations would produce a lexicon that is just random strings of letters, with no discernible structure.

For the same reason, the seed text could not have been just a list of random strings of letters. Its words must have had the non-trivial structure we observe in Voynichese, which must have been preserved by Mutate.

Moreover, the Author would have had to choose the seed text and the Mutate function so as to produce the pronounced idiosyncrasies we see in the VMS word frequency distribution. Consider the following word counts from my version of the VMS transcription file:

105.250 Chdy
35.125 Shdy

301.250 Chedy
236.500 Shedy

34.000 Cheedy
50.500 Sheedy

26.000 okChdy
1.000 okShdy

18.500 okChedy
3.000 okShedy

0.000 okCheedy
0.000 okSheedy

38.000 qokChdy
4.000 qokShdy

33.000 qokChedy
5.000 qokShedy

2.000 qokCheedy
1.000 qokSheedy

(The fractional numbers result from regarding the ',' separator as a word break with 50% probability).

How could a "parameterless" Mutate function produce these asymmetric word frequencies?

By the way, such asymmetries are normal in natural languages. Here are some counts in Well's War of the Worlds:

91 brother
0 brothers
13 brother's

63 another
0 anothers
1 another's

58 other
11 others
0 other's

3 mother
1 mothers
0 mother's

1 bother
0 bothers
0 bother's

It so happens that the novel's main character had just one brother, and their mother must have passed away before the invasion...

But here is a way you could further confirm your theory. If the seed text has N tokens, it has only N-1 consecutive token pairs. Therefore, if the seed text is only a thousand words of less, the distribution of word types that could follow a given word type would have been very limited, often singular (only one choice, zero entropy). But each time the source index k is reset, or a word is mutated, new consecutive word pairs are created, and therefore that distribution becomes broader as more and more words are generated.

For instance, suppose that Chody occurs only once in the seed text, preceded by dol and followed by daiin. For a while, the algorithm will generate a Chody only after a dol, and then would always generate a daiin next. But if the pointer k is reset after a Chody output to point to the word ChChy, the generated text will have a new pair starting with Chody, namely Chody ChChy. Then future Chodys may be followed by either daiin or ChChy.

A similar situation occurs if after a Chody output the Mutate function is called and changes T[k] from daiin to kaiin.

Thus, as the algorithm progresses, the next-word distribution will become "blurred" by the incluson of random previously generated words, and mutations thereof.

In fact, if the algorithm is run for a long enough time, the next-word distribution will tend to be the same for every word type. The algorithm then would reduce to a Markov chain of order zero, with a word frequency distribution that is invariant under the Mutate function. The next-word entropy should grow from near zero at the beginning of the algorithm to the entropy of that distribution.

Did you observe such effect in the output of your algorithm? And/or on the VMS?

All the best, --jorge

Isn't the idea that you are self-copying something you don't understand? The words have an overwhelming probability of 'mutating' by simple errors; the scribe knows the glyphs so it's most probable to happen to some potentially hard to read glyphs. Or, they intentionally added a flourish to a word out of boredom. Now, the word becomes a copy candidate but with much lower probability since it only has one example, preserving the zipfian distribution of words.

(24-08-2025, 09:16 AM)dexdex Wrote: You are not allowed to view links. Register or Login to view.Isn't the idea that you are self-copying something you don't understand?

Yes, of course.

Quote:The words have an overwhelming probability of 'mutating' by simple errors

I am not speaking of "errors" (which is a meaningless term in this context), but of words that do not have the proper structure. Here is an old analysis of mine. (The "?" superscript means "either 0 or 1 instances of a symbol from this set".)

[attachment=11314]
[attachment=11315]
[attachment=11316]
[attachment=11317]

How exactly does the TnT model give rise to this word structure?

All the best, --jorge

(24-08-2025, 10:57 AM)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view.How exactly does the TnT model give rise to this word structure?

Only if there is a long set of rules that have this effect. And that defeats the purpose.

(24-08-2025, 08:46 AM)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view.Did you observe such effect in the output of your algorithm? And/or on the VMS?

There was a thread about a similar question: You are not allowed to view links. Register or Login to view.

(12-08-2024, 08:03 AM)obelus Wrote: You are not allowed to view links. Register or Login to view.By chopping the generated text into 75 pseudo-pages of 16 lines each, we can approximate the bulk layout of a vms sample (below). The statistical traces of pagewise self-citation, if present, should manifest on each page independently.

(13-08-2024, 12:17 PM)obelus Wrote: You are not allowed to view links. Register or Login to view.@Emma May Smith:
Your reasoning that word mutations must eventually converge on a statistically equilibrated vocabulary appears to be borne out by Torsten's simulated text. Parsing it into 45 pages of 26 lines each,

My nitpicking: there is a setting (29 lines per page for the generated text) that should be better than 16 or 26.

#text.lines_per_page=29

(24-08-2025, 10:59 AM)ReneZ Wrote: You are not allowed to view links. Register or Login to view.
(24-08-2025, 10:57 AM)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view.How exactly does the TnT model give rise to this word structure?

Only if there is a long set of rules that have this effect. And that defeats the purpose.

As far as I can tell, for the self citation you only require a couple rules:
Create new words by either:
1) taking a previous word and changing a glyph to a similar glyph
2) add a prefix from a list of prefixes if it isn't in the word
3) concatenate existing words

A vast majority of the words would fit these rules. Now, if we add a little per-scribe personal touch (beautifying marks, preference for certain word shape).

I think that is the conceit behind the algorithm, and it is certainly not a complex process.

Pages: 1 2 3 4 5 6 7 8