The Voynich Ninja

Full Version: About the generation of similar words
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Pages: 1 2 3 4
Thank you for the detailed analysis. 

A few comments:

Your finding 2 — that similar words cluster in the same vertical area but not in exact columns — is consistent with an observation from Timm (2014). My results point in the same direction as your results: the scribe's eye tends to scan upward rather than strictly horizontally, producing local vertical clustering without exact column alignment.

Your finding 4 is interesting: the first word in a burst tends to be rare (~6) while the variants are frequent (~70). This could be an artifact of the definition of "core" as the first word in the sequence. If the scribe generates text by modifying words already visible on the page, the common words are the established source vocabulary — they are frequent precisely because they have been copied and modified many times. A rare word appearing near them is more likely to be a new modification the scribe just produced from those common forms. In other words, the frequent words may be the actual sources and the rare first word the output, which would invert the direction of generation relative to your burst model.

Your overall conclusion — "some kind of local variant generation mechanism" with "short textual memory" — is consistent with the self-citation model described in Timm (2014) and Timm & Schinner (2020).
(16-03-2026, 05:55 PM)quimqu Wrote: You are not allowed to view links. Register or Login to view.Only approximately between 5% and 10% of the bursts have a core that is more frequent than its members.That is, on average the variants are much more frequent than the initial word of the burst.

Maybe the core tends to be less common because it is weirder than the peripheral members?  Like, for English, the core of the words 'this', 'that', 'these', 'thus', 'those' may be 'ths' -- if this word occurs even once, e. g. because of a typo?

Is the core generally shorter than the peripheral members?

It is good that you are now analyzing words instead of just digraphs.  But now beware that the occurrence patterns can be radically different for different words, because they have different meanings and functions.  

So don't focus only on patterns and statistics that are common to all words.  Pay attention to individual words (or word clusters) that deviate from the general patterns... 

All the best, --stolfi
(16-03-2026, 05:55 PM)quimqu Wrote: You are not allowed to view links. Register or Login to view.If anyone can think of other controls or tests that could help to better discriminate between possible models, they would be most welcome.

If you have the time, could you do a cluster analysis on You are not allowed to view links. Register or Login to view.? (You must dowload it, not copy-paste it, because it is UTF-8 but our WWW server displays it as ISO-Latin-1, which turns it into gibberish.)

I could provide an alternative version that uses numeric suffixes to indicate tone, instead of diacritics.

All the best, --stolfi
Quote:If anyone can think of other controls or tests that could help to better discriminate between possible models, they would be most welcome.


For some time I wanted to suggest some test.

People often say that Voynichese is similar to numbers, especially Roman numbers with sequences like CCC and III.

So what if we made some code with Roman numbers.

Lets take some text, may be the Book of Genesis. Assign Roman numbers to words - I,II,III, ..., LXXI, LXXII ...and so on. If the word repeats then reuse the number of course.

It would word as some huge code cipher and may be the closest thing to Voynich text I can imagine.
You are not allowed to view links. Register or Login to view.

Would you be interested to test such text if I made and uploaded it here?
I believe it could also make clusters of similar words as you would assign subsequent numbers to plaintext words.

And one question - do you need the text broken into pages?




It would make possibly the mi
(16-03-2026, 08:58 PM)Rafal Wrote: You are not allowed to view links. Register or Login to view.Would you be interested to test such text if I made and uploaded it here?
I believe it could also make clusters of similar words as you would assign subsequent numbers to plaintext words.
And one question - do you need the text broken into pages?

I can give a try. If you think the pages can have some sort of effect, yes, give me the code by pages. I am not sure what will be found. The thing is that there are only 7 Characters for the roman numbers and Voynich has more than 20 characters (apart from the special ones)...
(16-03-2026, 05:55 PM)quimqu Wrote: You are not allowed to view links. Register or Login to view.4. Regarding René's question: René was asking if the “core word” tends to be more frequent than the variants.

In this study, I defined the core of the burst as the first word that appears in the sequence, because the goal is to see if the variants are generated from previous words in the text flow. Defining the core as the medoid (the word with the smallest total distance to the set), for example, completely eliminates directionality and is therefore not suitable for this type of question.

With this definition the results are:

Number of bursts analyzed: 3774
average frequency of the core ≈ 6
average frequency of the variants ≈ 70

Only approximately between 5% and 10% of the bursts have a core that is more frequent than its members.
That is, on average the variants are much more frequent than the initial word of the burst.

That is perhaps unexpected, and raises the question: what is the reason that the variant word appeared at that point?

Was it because it is a variant of a previous core word?
Or was it because it is a relatively frequent word that is therefore likely to appear anyway?

The latter has to be the preferred case.

(or the text is backwards...)
(16-03-2026, 07:44 PM)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view.Is the core generally shorter than the peripheral members?

So don't focus only on patterns and statistics that are common to all words.  Pay attention to individual words (or word clusters) that deviate from the general patterns... 

Hi Jorge,

As for whether the core tends to be shorter than the variants, the results do not indicate this. On average, the length of the core and the members of the burst is very similar, and the core is even slightly longer in some cases. What does appear clear is that many variants are exactly the same length as the core. This suggests that a good part of the variants are formed by small internal changes within the word, rather than by adding material to it.

It also seems useful to look at specific families of words that deviate from the general behavior. When doing this, it becomes clear that not all families function in the same way. Some remain almost always the same length and seem to rely mainly on internal substitutions, while in other families the variants tend to be longer. This suggests that the mechanism of variation is not uniform across the entire vocabulary of the manuscript.

Note again that I am using CUVA transliteration whre some EVA gliphs are grouped as a single gliph.
Quote:I can give a try.

Thanks! I can see you not quite get what I am trying to generate but things will become clearer when the text is ready  Smile
Hello again (warning, this post is long!),

I have been wondering if the method I have been using really gives us the desired information. I think the definition of core and the generation of bursts was perhaps not the best for trying to understand the generation of similar words, so I have reconsidered certain points. Forgive the length of this post but I think it is quite interesting.

About the bursts: before this post, the burst detection followed a fairly natural idea: the burst starts with the first word and the following are considered variations of it. This is simple and works well for segmenting the text, but it has one major problem: it assumes that the first word is the origin of the others. And this does not have to be true.

If the text is generated by modifying words already written on the page, it is most likely that the most frequent words will be the ones that form the basis. Rare words could, on the other hand, be the end result of a transformation. This means that the core of a burst cannot be defined simply as the first word.

From here, the first decision I made was to separate two things that until now I had mixed up: detecting which words form the same group, and understanding how they are generated from each other.

New way to detect bursts: bursts are detected within each page, but with two restrictions:

- Only words with a minimum length of 2 in CUVA are considered, to avoid noise (note that a minimum length of 2 in CUVA can perfectly be a length of 3, 4 or 5 in EVA).
- And, above all, bursts are not defined with a fixed window, but grow as long as there is continuity, and close when the page ends.

This ensures two important things: there are no overlaps, and all tokens within a burst are in the same real visual context. In addition, each word is stored with its exact position, line and position within the line. This allows the phenomenon to be analyzed in two dimensions, not just as a linear sequence.

Redefinition and detection of the "core" of the burst: Instead of defining a single core, four different types have been defined, each with a different meaning.

- The first is the "anchor core" which is simply the first word of the burst. It serves to order and reference, but does not imply causality.
- The second is the "central core", which is the most formally representative word, the one that most resembles the rest.
- The third is the "frequent core", which is the most repeated form within the burst. This is interesting because it could represent the “base vocabulary” that the copyist reuses.
- The fourth is the most important: the "generative core". This is not defined by position or frequency, but by seeing which word best explains the rest if we assume that words are generated by small modifications.

The most important change I've made is not just redefining the core, but completely changing the model. Instead of saying "all words come from the core," what I do is, for each word, look for which other previous word is the best candidate to be its origin.

This creates a network of parent-child relationships within the burst.

Each relationship is calculated by combining three factors: formal similarity between words, spatial proximity within the page, and a slight preference for more frequent forms.

This model is much more realistic, because it allows generation to be progressive and local, not radial.

The results are quite clear.

[attachment=14700][attachment=14699]

The vast majority of parent-child relationships have a very low distance. In many cases they are exact copies or with a single modification. This indicates that the system does not generate words arbitrarily, but rather by small variations on existing forms.

When looking at the type of modification, the dominant pattern is very clear. Almost 40% are exact copies, and 30% are single-character substitutions. Insertions and deletions are much less frequent.

This suggests a very simple rule: first it is copied, and then, sometimes, it is modified.

Thanks to having the position of each word, the direction of generation can be studied.

[attachment=14698]

A significant portion of the relationships occur within the same line or very close together, often to the right. This fits with sequential writing.

But there are also many relationships between separate lines. This indicates that the system does not depend only on the immediately preceding word (within the burst, of course), but on a wider set of visible shapes on the page.

Furthermore, when the distance is greater, the probability of modification increases and the exact copy decreases. In other words, the further away the source is, the more it is transformed (although the values are light).

[attachment=14696]

When comparing which type of core actually acts as the origin of words, the result is consistent. The first token of the burst is not the best candidate. In contrast, core and generative words more often match the real parents. The frequent core also fits well, especially from a formal point of view.

[attachment=14697]

This reinforces the idea that the system is based on already established and reused forms, not on an expansion from a single starting point.

Provisional conclusion as of today: the change in model has been to move from a simple view, where a burst is an expansion around an initial word, to a dynamic view, where words are generated from other visible words through small modifications.

The results indicate a strongly structured system, based on copying, minimal variation and local reuse. It does not appear to be a random or purely linear process.

This opens the door to going one step further: identifying the specific rules of transformation and seeing if they can be formalized as a kind of productive grammar of the text.

I've started to address this issue and I'll give you a glimpse: when you look at the specific modifications, very consistent patterns emerge. There are substitutions that are repeated many times, often in specific positions within the word. It doesn't seem like a random process.

Even more interesting is the case of insertions and deletions. Most of them are concentrated at the beginning of the word. This suggests that a significant part of the variation occurs by adding or removing elements at the beginning.

Overall, this points to a reduced set of recurrent operations, not to free combinations.
@qimqu Your findings are very consistent with the results in Timm & Schinner 2020, "The generation of the Voynich manuscript text: Results of a statistical analysis" (Cryptologia, 44(5), pp. 387-420). Some specific points of overlap:

Regarding the core: You found that the first token of a burst is not the best candidate for the origin of words. We addressed this for page boundaries: "There was a similar problem for the author of the VMS every time he/she was starting a new (empty) page. In such a case it was probably useful to use another page as source. There is some evidence that the scribe preferred the last completed sheet for this purpose" (p. 11).

Regarding the types of modification: You found that almost 40% are exact copies and 30% are single-character substitutions, with insertions and deletions much less frequent. We formalized three rules (pp. 9-10):

1) Replace one or more glyphs by similar ones. For example, <chol> could be the origin of <shol>, <chor>, <char>, <chal> etc. by substituting similar ligatures.

2) Add or remove a prefix. The most common prefixes (<d->, <ch->, <ok->, <qok->) combine with various suffixes to produce much of the vocabulary (Table 3, p. 10).

3) Combine two source words to create a new word.

Regarding directionality: You found that relationships occur both within the same line and between lines, and that the system depends on "a wider set of visible shapes on the page." This suggests the scribe was influenced by what was visible above the current line, not only by the word just written.

Regarding the concentration of changes at word beginnings: You found that insertions and deletions concentrate at the beginning of words. This corresponds to our prefix operations (rule 2), which we identified as one of the three main mechanisms of text generation.

Your finding that "the further away the source is, the more it is transformed" is an interesting new observation that is not explicitly in our paper but is consistent with the model — a more distant source provides a less precise visual template for copying.
Pages: 1 2 3 4