The Voynich Ninja

Full Version: The structure of the Voynich text and how it may be generated
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Pages: 1 2 3 4 5 6
(09-04-2026, 12:16 PM)Rafal Wrote: You are not allowed to view links. Register or Login to view.Remeber that the Voynich author didn't have computer 

This is the difficult part.
A human does not follow a strict pattern.
The computer can be told not to follow a strict pattern too, but the result will never be even remotely similar to the human output. They will always diverge quickly.

In that situation it is quite complicated to find indicators (extracted from the two texts) that show whether one is on the right path or not.
Quote:But I absolutelly agree that the human "art" or creativity is not well modeled by machines.

Creativity is a form of human thinking. And actually we still don't know and understand how humans think, what is the exact process that produces some result.

We developed AI that can mimic final results of some human creativity processes but it is almost sure that it reaches these results in a totally different way that humans do.

Take chess as an example. Computers when playing chess analyze millions of positions using minimax algorythm ( You are not allowed to view links. Register or Login to view. ). Do people playing chess do the same? Rather not, certainly not consciously. They often rely on this blurry thing called intuition  Wink Yet the final result is similar to a machine playing chess.

I believe that if Voynich text doesn't have meaning and was generated in some way then it was generated in some semi-systematic way. Not fully systematic and not fully random, something in-between. The scribe had a set of rules and common tricks but had a freedom if he will choose Rule1, Rule2 or Rule3 at some moment and at other moments was just improvising.

Recreating it with an algorythm may be hard and even if you get something similar you can never be sure if it wasn't done another way.

But of course I am really curious to see your results!  Smile
I post an example of text output:

sokeey chear qot aleedy chokeey cheor cheal alched chochor rchedar
otoar cheol sorshey lcheedy alchey alched olo otoar dytchy chdol chochor
ototar olkeedy oldaiin eeety ykeor lchedy dodl qod polchedy
teeoar chody dod chedain daiinokshody cheedy teeoar chckheed alshedy
oinoly qoka tchoar cheeog ykaly oeeody
okchey
khey olaiin pcheocphy dosg tchoar oteey oaiin pcheocphy tolkeeedy
qoteeod dosg loiin opalkar polaiin polkeey ctheor sheekain chedar chokey
tolkeeedy pchomotor kalol okyytaiin
olshdy shdalo qokain
dyoty ychocphy shoeey kydain lchey shdaly lshechy qokcho
ockhhy keshar okal ykeey ytodaly shoshy skaiiodar scho
araiiin chol kydain chlol lteey arom oraiiin olaiiin cthar
chopchy kedar ykeey cheedy shtor pchody yty qototeeey qokchedy
sorchy chckhy otair dkeey chcthy qokeey dsheedal qody pcheety qofchedy oto
osary chedy ckhey pchomotor qokshedy okey chtaldy cphol cheol qotes
dcheedy qedy kan ctheety oteeg qotas qokshedy ear
yteod ctho toar okoraldy kar dar qotes tochady chtaly
qody qokeody daral shokol okchaldy chotal chdam lkarshar
okoraldy qokeody okary dalam dokedy


sokeey chear qot aleedy chokeey cheor cheal alched chochor rchedar
otoar cheol sorshey lcheedy alchey alched olo otoar dytchy chdol chochor
ototar olkeedy oldaiin eeety ykeor lchedy dodl qod polchedy
teeoar chody dod chedain daiinokshody cheedy teeoar chckheed alshedy
oinoly qoka tchoar cheeog ykaly oeeody
okchey
khey olaiin pcheocphy dosg tchoar oteey oaiin pcheocphy tolkeeedy
qoteeod dosg loiin opalkar polaiin polkeey ctheor sheekain chedar chokey
tolkeeedy pchomotor kalol okyytaiin
olshdy shdalo qokain
dyoty ychocphy shoeey kydain lchey shdaly lshechy qokcho
ockhhy keshar okal ykeey ytodaly shoshy skaiiodar scho
araiiin chol kydain chlol lteey arom oraiiin olaiiin cthar
chopchy kedar ykeey cheedy shtor pchody yty qototeeey qokchedy
sorchy chckhy otair dkeey chcthy qokeey dsheedal qody pcheety qofchedy oto
osary chedy ckhey pchomotor qokshedy okey chtaldy cphol cheol qotes
dcheedy qedy kan ctheety oteeg qotas qokshedy ear
yteod ctho toar okoraldy kar dar qotes tochady chtaly
qody qokeody daral shokol okchaldy chotal chdam lkarshar
okoraldy qokeody okary dalam dokedy

Note that unicode character at start of line 7 is one of the special gliphs from EVA transliteration.

The model reproduces many surface-level statistical and positional properties of the Voynich text, but fails to capture its core lexical structure: a highly productive system that generates many new word forms while keeping them densely interconnected.

The words are not drawn from a fixed bag of words. The model starts from an attested vocabulary, but new tokens are continuously created through small mutations of existing ones. These candidates are then selected based on local context and compatibility with similar forms, and finally projected back onto the known vocabulary space. This creates a system where words are partly reused, but also systematically varied, rather than simply sampled independently from a predefined list.

Text is generated token by token using a hybrid mechanism. At each step, the model either copies and mutates a recent token, samples from a compatible “family” of similar forms, or introduces controlled innovations that are snapped back to attested vocabulary. Generation is constrained by a real line skeleton, so line length, position in line, and paragraph-initial context directly influence which forms can appear. Additional biases enforce known properties such as longer words and higher gallows frequency at line and paragraph starts.

This setup reproduces well the external structure of the text: line lengths, repetition rate, basic entropy, and most positional effects (line start/end distributions, paragraph-initial patterns). However, it fails to reproduce the internal lexical geometry. Vocabulary size is too small, hapax rate is far too low, and both global and local Levenshtein connectivity are weaker than in Voynich. 


The model captures how the text looks, but not how it generates many new yet tightly related word forms. For example, word daiin appears only as part of two longer words.
(10-04-2026, 06:53 PM)quimqu Wrote: You are not allowed to view links. Register or Login to view.The model reproduces many surface-level statistical and positional properties of the Voynich text.

Quimqu, I see that you are taking seriously the theory by Thorsten & Timm that the VMS text was generated by some complicated random process with some kind of feedback ("copy and mutate").

Even if you don't accept my claim about the Starred Parags being a transcription of the Shennong Bencaojing (see the Chinese theory thread), I must warn you that this approach is fundamentally flawed, for at least these reasons:
  • For any collection of statistical measurements (nth order letter or word frequencies, distance correlations, etc.), there is an algorithm that generates text that looks just like the VMS under those measurements.  Or just like the corpus of Shakespeare's plays. In either case, that "success" says nothing about how the original was actually created.
  • Those "copy and mutate" gibberish theories presume that the author started with a "seed" that was a sample of Voynichese text, and used a tuned mutation procedure that was biased towards the statistics that we see.  So those theories do not explain why the text has the strange properties that we see, because they require that the author defined those properties before he started to create the book.
  • It is mathematically impossible to prove that a string is "random", or even to present evidence that it is "probably random".  Thus any argument of statistical analysis that claims to do do that is necessarily flawed.
 In fact, a string that encodes the maximum possible amount of information (like a zip file or a JPEG image) will look like "random". 

The only way to show that the VMS is "meaningless" would be to find a deterministic algorithm that was much shorter than the VMS and generated exactly the same text (not just a text with similar statistics).  Then one could say that the information contents of the whole book is that short algorithm. 

That would be the case, for example if one proved that the text of the VMS is the digits of Pi encoded in a simple scheme (like a Roman number for each three decimal digits). 

That was the case of the mysterious book with large tables of letters that got John Dee totally trapped into bogus conversations with a large language model angels in a crystal ball.

All the best, --stolfi
(11-04-2026, 02:46 AM)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view.I see that you are taking seriously the theory by Thorsten & Timm that the VMS text was generated by some complicated random process with some kind of feedback ("copy and mutate").

Hello Jorge,

not really. This is a sort of process that lead me to this kind of model, that I don't think it might be the real thing, but that we can interpolate like:

  1. The scribe writes line by line with local context in mind
  2. Words are chosen to resemble nearby words
  3. Similar forms tend to cluster together
  4. New words can be created by small variations of existing ones
  5. Word choice is constrained by position in the line or paragraph
  6. The process favors consistency over long-range structure
  7. The underlying source of words is unknown (bag of words, language, or other)

The thing is that I started detecting burst and then I tried some kind of model that explains the text features summarized in this You are not allowed to view links. Register or Login to view.

So I am not getting into how the words are created, but I intend to understand the structure of them. How they are positioned in the text. And what I saw is that depending on the position, some families of words tend to appear more than others, for example, as they seem to depend on the local context. Again, this can be also some feature of lenguages, under my understanding. Also, I don't claim that the generated text is Voynichese; it is just a simmulation.

Note that points 2, 3, 4 and 7 might be just results of what is written, but I am not telling that it is a random generation or gibberish. What it really can say, according to the numbers, is that there is some sort of positional an contextual constraint, but again, this could be a feature of a language, who knows?
(11-04-2026, 08:52 AM)quimqu Wrote: You are not allowed to view links. Register or Login to view.
(11-04-2026, 02:46 AM)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view.I see that you are taking seriously the theory by Thorsten & Timm that the VMS text was generated by some complicated random process with some kind of feedback ("copy and mutate").

Hello Jorge,

not really. This is a sort of process that lead me to this kind of model, that I don't think it might be the real thing, but that we can interpolate like:

  1. The scribe writes line by line with local context in mind
  2. Words are chosen to resemble nearby words
  3. Similar forms tend to cluster together
  4. New words can be created by small variations of existing ones
  5. Word choice is constrained by position in the line or paragraph
  6. The process favors consistency over long-range structure
  7. The underlying source of words is unknown (bag of words, language, or other)

The thing is that I started detecting burst and then I tried some kind of model that explains the text features summarized in this You are not allowed to view links. Register or Login to view.

So I am not getting into how the words are created, but I intend to understand the structure of them. How they are positioned in the text. And what I saw is that depending on the position, some families of words tend to appear more than others, for example, as they seem to depend on the local context. Again, this can be also some feature of lenguages, under my understanding. Also, I don't claim that the generated text is Voynichese; it is just a simmulation.

Note that points 2, 3, 4 and 7 might be just results of what is written, but I am not telling that it is a random generation or gibberish. What it really can say, according to the numbers, is that there is some sort of positional an contextual constraint, but again, this could be a feature of a language, who knows?

That's the point though. It's just a simulation and there are a seemingly infinite number of simulations. None of which gets you any closer to finding meaning. Your next step will be to look at what the seed might be, and you'll consider plain text, then different kinds of ciphered texts, and eventually conclude that all of them work but none of them quite works the way it should if it were the actual answer....
Quote:It is mathematically impossible to prove that a string is "random", or even to present evidence that it is "probably random".

Not necessarily.

Let's take a hypothetical long string of numbers 1-6 written in medieval book. We made calculations and observed that:
- all numbers have very similar frequencies
- there aren't any longer sequences repeating

In such case we can propose a scenario that it was probably generated by throwing a 6-walled dice.
(11-04-2026, 11:55 AM)Rafal Wrote: You are not allowed to view links. Register or Login to view.Let's take a hypothetical long string of numbers 1-6 written in medieval book. We made calculations and observed that:
- all numbers have very similar frequencies
- there aren't any longer sequences repeating
In such case we can propose a scenario that it was probably generated by throwing a 6-walled dice.

That would be a possibility but those observations do not prove it, and not even make it likely.  The sequence of digits of Pi in base 6 has those properties too, but it is not random.  (Same for sqrt(2), or the digits of most irrational real numbers in any base)

And if you take any text, even a very repetitive one, and encode it with the Vigenère cipher using the digits of Pi as the key, you will get a ciphertext that has those same properties too.

(When I was getting my Masters in applied math, a friend of mine was doing his thesis on a simple process for generating infinite strings that were "cube-free" -- that did not have substrings that repeat three times in a row.  For instance the string 0010100110100110100110101 is not "cube-free" because it contains 010011 three times in a row.  The problem was inspired by a rule of chess that declares a draw if the players repeat the same sequence of moves three times in a row.

I forgot the details, but it was something like this. You need at least three letters in the alphabet, say A, B, C.  You start with a single A and then do repeated passes where each letters is replaced by a specific string of those three letters.  Like A -> ABCA,  B -> CA, C -> BAC.  Thus you get A, then ABCA, then ABCACABAC, then ABCACABACABCABACABCACAABCABAC, etc.   I just guessed these particular replacements and they probably don't work, but with the right rules one gets arbitrarily long strings that are cube-free.  Possibly they have equal numbers of As, B, and Cs.

This "iterated verbose substitution cipher" process is very simple procedure that someone like Tartaglia or Fibonacci could have thought of and played around with, well before the 15th century.  It is less complicated than the process that was used by the anonymous nerd who created those tables that baffled John Dee (although, AFAIK, those were probably created in the 16th century). The point is that it is actually easier to generate non-random deterministic strings that "look random" under simple statistical test, than to generate truly random sequences...

At some point the other Masters students in the department gave that guy, as a birthday present, a fake Master's thesis, properly formatted and bound -- including with the official cardboard thesis cover, fake examiners' signatures, etc -- titled "The Problem of the Abacas"; with every one of its 100 pages filled with a random string of A, B, and C.)
Quote:The sequence of digits of Pi in base 6 has those properties too, but it is not random

Well, I must agree with that  Smile

I still hope that if Voynich is semi-mechanically generated text without meaning, then it may be at some moment possible to prove it.
But it may be indeed hard.
I did a new experiment today, not with the model, but trying to focus on generation of similar words. Starting with daiin, the most used word in the MS.

I started from a simple version of the similar-token generation idea. If words like daiin are produced by making small variations of nearby forms, then its closest variants, the words at Levenshtein distance ≤ 1, should also tend to appear in similar surroundings. So I took the immediate neighbours of daiin in form, and I compared the words that come before and after them. I also allowed context to continue across line breaks when the paragraph continues, so the comparison was not artificially cut by the layout.

The first result was negative for a naïve version of the idea. The exact neighbouring words of these variants are usually not very similar to the exact neighbouring words of daiin. In other words, just because two words look almost the same does not mean they sit in the same local context.

Then I relaxed the test. Instead of asking whether the neighbours are exactly the same, I asked whether the neighbours are themselves also close variants, again with Levenshtein ≤ 1. This is a fairer test for a model based on repeated local variation. Under this looser comparison, the overlap rises a lot. So the variants of daiin do not usually share the same exact neighbours, but they often share neighbours from the same broad family of similar forms.

That is the important point. A free and almost random variation model would suggest that close variants should circulate in almost the same environments. What I see is more constrained than that. The system does seem to keep words inside a broader field of similar contexts, but not in a loose or fully interchangeable way. Some variants are clearly closer to daiin than others, and they are closer in different directions.

In particular, dain stands out more on the previous-word side, while aiin stands out more on the next-word side. So even inside the immediate neighbourhood of daiin, the closest forms are not behaving in exactly the same way. They are partly related, but they are not simple drop-in substitutes.

ComparisonMain resultInterpretation
Exact neighbours of daiin vs Lev<=1 variantsUsually low overlapVery similar word forms do not usually share the same exact local context
Fuzzy neighbours, allowing Lev<=1 also for the context wordsOverlap rises stronglyThe contexts are not identical, but they often belong to the same broader family of similar forms
dain vs daiinCloser on the previous-word sidedain seems more tied to similar left contexts
aiin vs daiinCloser on the next-word sideaiin seems more tied to similar right contexts
Overall readingPartial support, not full supportThe text may use local families of related forms, but not as free random substitutions

A simpler way to say it:

Changing a word into a very similar one does not give you the same neighbours.
But it often gives you neighbours that also look similar.

So the system is not freely swapping words. It stays inside small groups of related forms.
Pages: 1 2 3 4 5 6