The Voynich Ninja

Full Version: About the generation of similar words
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Pages: 1 2 3 4 5 6 7 8
(18-03-2026, 03:27 PM)quimqu Wrote: You are not allowed to view links. Register or Login to view.
metricVoynichNatural
lev≤1 giant~80%5–34%
lev≤2 giant~93–94%58–82%
segmentation83–90%0–60%

I am still in shocked seing these numbers... Huh

94% of the Voynich words are connected in a giant graph with Levenshtein values of <=2 (this means 2 character changes, insertion, deletion or substitution). The rest, they are all Hapax legomena and 90% of that rest can be explained dividing the word into smaller words, that are all in the giant graph. This means that 99,4% of the words are connected... stunning
At quimqu

I found a tool that might aid in your work.  I don't know if you are already using it quimqu?  Voynich transcription parser!

You are not allowed to view links. Register or Login to view.
(17-03-2026, 01:37 PM)nablator Wrote: You are not allowed to view links. Register or Login to view.This is an important fact to take into account: lines were not always written sequentially from top to bottom as any normal text would be. There are many instances of gallows intrusions where the text visibly curves upward to avoid a big gallows glyph on the next line. (This is a big indication of something fishy going on by the way.) So the earlier written words on the same page don't have to be on a line above or to the left on the same line.

This is very important, and I noticed the same back at the time when Gabriel and I did our transliteration activity.
On the other hand, it is not so easy to find many good examples to demonstrate this. 
In a few places it is clear that things were not written line by line, but I have not found sufficient convincing evidence that this happened on any significant scale.
Best candidates are the biological and stars sections.

The best evidence for non-linear writing I have seen is in small vertical baseline jumps that appear in the same location on several consecutive lines, but that can also have another explanation.
Again, hard to find examples of it happening on any large scale...
(17-03-2026, 01:37 PM)nablator Wrote: You are not allowed to view links. Register or Login to view.There are many instances of gallows intrusions where the text visibly curves upward to avoid a big gallows glyph on the next line.

I know of one example in the Starred Parags section, which is part of a rather strange mess in the layout.  That is followed a few lines later by an abrupt change of handwriting.  Apparently in the middle of a word.  I fancy that a Scribe was fired for said mess and a new one hired to finish the page.

Since those examples are supposed to be the evidence proving that the Scribing was not "linear", it would be important to list them all in one place.

It may be that the Scribe struggled to understand the Author's handwriting and thus left blanks here and there, to be filled later after consulting the Author.

It may also be that the Scribe wrote a crooked line by accident, and then on the next line he chose to "puff up" some glyphs where he had enough space to do so.

Quote:[René]:The best evidence for non-linear writing I have seen is in small vertical baseline jumps that appear in the same location on several consecutive lines, but that can also have another explanation.

There is also that Herbal page where the Scribe apparently misunderstood a single-column text that was interrupted by the plant as being two columns, and wrote it that way.

All the best, --stolfi
(18-03-2026, 06:02 PM)ReneZ Wrote: You are not allowed to view links. Register or Login to view.Anyway, this question is: given that daiin is the most frequent word in the overall text, is the reason that it appears most because:
- this is just the most frequent word so it appears most frequenty
- this is the word that most frequently results from a small (or zero) change from the recent words.
And then of course the question if or how we can possibly detect this.

Yes, in a strict sense, almost everything could be derived from "daiin" if you allow multiple small steps, because the whole system forms a single connected network. But that is not specific to "daiin". In fact, it is true for any word inside that network. From any node, you can reach any other through chains of small edits.

So the important question is not whether everything can be derived from “daiin”, but whether the system is actually centered around it. And the graph suggests it is not.

"daiin" is not especially central in the network. It does not directly connect to a large portion of the vocabulary (only 5%), and there are other words that are more central in terms of connectivity and average distance.

So while everything can be connected to "daiin", the same is true for any other word. The structure looks more like a dense web than a system expanding from a single root.

This makes it more likely that "daiin" is simply very frequent, rather than the main generative source.

(18-03-2026, 06:02 PM)ReneZ Wrote: You are not allowed to view links. Register or Login to view.The background behind my thoughts is related to a question (doubt) I have about the autocopy method.

I’m actually not very convinced by the autocopy idea either.

From what I’m seeing in the CUVA data, almost all the vocabulary (around 94% of tokens) is inside the Levenshtein ≤2 network. That suggests a very dense and coherent system of variation, and not in a form of tree (there is no source, at least easily seen).

To me, that looks less like words being generated locally during writing by copying and slightly modifying the previous one, and more like a system where the word forms already belong to a sort of dictionary, and the writer is selecting from there (codebook?). In other words, it feels more like a predefined generative system than an on-the-fly autocopy process.

Also, the small set of words outside the main network are almost all hapax, and many of them can be explained as combinations of existing tokens from inside the network. That could point to cases where the author is no longer following the usual formation patterns and instead combines known pieces. (Oh, I don't have this word in the dictionary now! Let's join some words and work further)

So my current intuition is that the structure we see is more consistent with a prior generated scheme and controlled variation, rather than simple local copying from one word to the next.
Here is a visualization of the word network that might be useful for the discussion: 
You are not allowed to view links. Register or Login to view.

Each edge represents an edit distance of 1. To highlight words similar to <daiin>, <ol>, and <chedy> different colors are used. All nodes for a word that contains the glyph <i> are orange. Nodes of a word ending in <d> or <y> are purple. Nodes of words containing <ol>, <or>, <ar>, <al>, or ending with <am> or <os> are green. All other nodes are in blue color. The size of a node is determined by the number of times a token appears in the VMS.

The network graphs for Currier A and B are also available: You are not allowed to view links. Register or Login to view. You are not allowed to view links. Register or Login to view.

[attachment=14756]
The underlying Gephi file for Torsten's network analysis can be found here: You are not allowed to view links. Register or Login to view.
[
(18-03-2026, 06:02 PM)ReneZ Wrote: You are not allowed to view links. Register or Login to view.Anyway, this question is: given that daiin is the most frequent word in the overall text, is the reason that it appears most because:
- this is just the most frequent word so it appears most frequenty

I believe that it is indeed the most common word by itself, in all Herbal pages (A and B) as well as in the Starred Parags secton.  In fact is one of few words that has about the same frequency of occurrence in both Herbal-A and Herbal-B:

[attachment=14765]

(I have a good explanation for that but I am not allowed to say it here.)

All the best, --stolfi
I've been looking at the Voynich word graph (in CUVA, but I'm showing the examples in EVA here) and comparing it to various natural texts, and I think the difference is not so much in whether there is a giant component, but in how it is organized internally.

In the Voynich, almost all of the vocabulary is contained within a single giant component. This means that almost any word can be connected to another by very small changes (maximum 2). In natural texts this is also true for the most frequent tokens in the corpus, but it is not usually the case for almost all of the vocabulary (set of tokens).

When you look at the internal structure of the Voynich, three large, very dominant groups emerge. They are not small groups, but blocks that concentrate the majority of words and are very densely connected. A first group revolves around shapes like  chedy, chey, qokeedy, qokeey, okey, with many small variations especially at the end of the word. A second group is that of daiin and variant as aiin, dar, dal, okaiin, otaiin, where there are many prefix changes and small substitutions. The third group is much more compact and short, with forms like ol, chol, or, cho, chor, which are combined almost systematically.

GroupEVA ExamplesComment
1chedy, chey, qokeedy, qokeey, okeyVariation mainly in endings, many very close forms
2daiin, aiin, dar, dal, okaiinPrefix changes and small substitutions
3ol, chol, or, cho, chorVery compact system with short words


Furthermore, these groups are not completely separate. There are very short words like y, dy, or, al, ar that act as a bridge between different areas of the graph, as if they were highly reusable pieces.

When you compare this with natural texts, the picture is different. There are also large giant components, especially in long texts, but the vocabulary is not that much absorbed by the giant component and, above all, the internal part is much more fragmented. Instead of a few large blocks, many smaller communities appear.

To see it better, this is a comparison with the corpora I tested, all normalized to max 50,000 tokens:

CorpusTokensVocab% vocab in giant% tokens in giantComunities
Voynich~38k~7.8k~94%~98%10
Tirant lo Blanch50k536179.0%94.6%87
Culpepper50k448667.4%91.2%50
De docta ignorantia37121597458.5%84.1%68
Plató (Apologia)8437275459.3%82.2%32
Portuguès antic4053138760.7%81.7%18

The Tirant is the closest case to the Voynich in percentage, especially in tokens, but there are still clear differences. Although 79% of the vocabulary is connected, the network is divided into many more communities, more than eighty, and not into three or four dominant blocks. In addition, there are many more words that fall outside the giant component.

The general feeling is that in natural language you have many relatively independent word families, whereas in the Voynich it seems that almost all the vocabulary is built from a few very large nuclei, with small and very repetitive variations. It seems not so much a question of how connected it is, but of how compact and homogeneous this system is.
In regards to a system being compact: I think it has always been Torsten's position that the algorithm he gives is merely a computer adaptation of a possible simple scheme, to test statistics with.

Humans would likely not have a set-in-stone algorithm for autocopying, but would often be influenced by other factors. It may just come down to a vague set of aesthetic preferences that, when filtered through a simple autocopying scheme gives emergent behavior that looks schematic.

Certainly the participants of the C&B experiment sometimes adopted the autocopying scheme, but statistically they generated a wide array of results - showcasing that personal preferences and choices made (quasirandom though they may be, they are still choices) influence the resulting product. An ordered person might find aesthetically pleasing something that has the appearance of structure.
Pages: 1 2 3 4 5 6 7 8