The Voynich Ninja

Full Version: About the generation of similar words
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Pages: 1 2 3 4 5 6 7 8
At quimqu

I like your idea and I don't know if you are suggesting that the voynich might not be a language.  How could the voynich be a language by the method you are posing?  What I have noticed is that in and around plant stems are shorter vords.  Would your system break down if tested near them, I think it would and I don't think you are wrong.  I'm betting that the vords might not mean anything.
(18-03-2026, 02:05 AM)oeesordy Wrote: You are not allowed to view links. Register or Login to view.What I have noticed is that in and around plant stems are shorter words.

The 1-3 words before plant stems should be shorter than average, for the same reason that the 1-3 words at the end of each line should be.  Namely, the Scribe would be more likely to break the line just before a long word than before a short one.  

Conversely, the first word after a plant stem or line break should be longer than average. 

And, IIUC, several people have noticed both anomalies around line breaks long ago.  But, until recently, they did not realize that they were a simple consequence of the trivial line-breaking algorithm.

There may be other factors at play, like a possible tendency of the Scribe to break lines before certain key words, or at the places where the draft itself had line breaks (even if the Scribe was supposed to ignore them).  But the natural bias above seems to account for a good part of the observed anomalies.

All the best, --stolfi
(18-03-2026, 02:05 AM)oeesordy Wrote: You are not allowed to view links. Register or Login to view.What I have noticed is that in and around plant stems are shorter words.

(18-03-2026, 05:28 AM)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view.The 1-3 words before plant stems should be shorter than average, for the same reason that the 1-3 words at the end of each line should be.  Namely, the Scribe would be more likely to break the line just before a long word than before a short one.  

Conversely, the first word after a plant stem or line break should be longer than average. 

There are, in fact, particular Voynichese "words" that have a strong affinity for certain positions (such as immediately before or after plant drawing intrusions, as well as at the beginning and ending of lines) -- and others that have a strong aversity for certain positions.

I reported a detailed study, showing the statistical significance of these observation, in a paper at the Internat. Conf. on Historical Cryptology in 2024: You are not allowed to view links. Register or Login to view.

(Due to length limitations for the research paper, the results actually shown in the paper are a truncated version of the full analysis.)
From yesterday's analysis it seemed clear that bursts were transversal throughout the Voynich. The theory was that perhaps the core of the burst, from which the similar tokens emerged, was a word that might be found a few pages earlier in the manuscript. This made me think that limiting burst detection to one page was perhaps wrong and I wanted to see what happens when I detect bursts with Levenshtein = 1 or Levenshtein = 2 throughout the text.

The surprise was these data (maybe it is a surprise to me and this was already studied, sorry then):

metricVoynichNatural
lev≤1 giant~80%5–34%
lev≤2 giant~93–94%58–82%
segmentation83–90%0–60%

In Voynich, whether in EVA or CUVA, we have a giant burst where word changes have a Levenshtein value of 1 that covers 80% of the tokens in the corpus, while in natural language only between 5 and 34% of the tokens are covered. The coverage increases with Levenshtein =2 to 93-94% in Voynich and between 58 and 82% in natural languages.

But what's more, of the words that fall outside the 93-94% in the Voynich with Lev.=2, these can be segmented into real subwords in 83% by EVA and 90% by CUVA. That is, practically all of the tokens in the manuscript can be explained in paths less than or equal to Levenshtein = 2, which does not happen in natural language.

This has made me think that perhaps we are facing a type of cipher that Rafal proposed to me with the Roman numerals. A codebook cipher. A natural language -> voynich dictionary was created prior to writing and was replaced word by word. The generation of the Voynich words was mechanical, with a system of substitution, addition or deletion, and was assigned to a natural word previously from a dictionary. It is an option that I am currently exploring and will post in a new thread apart from this one.
(17-03-2026, 01:30 PM)quimqu Wrote: You are not allowed to view links. Register or Login to view.René, in the earlier version of the analysis, I defined the “core” simply as the first word of the burst. Under that operational definition, the first word was indeed usually rarer than the other members. But the newer analysis suggests that this first word is often just a temporal anchor, not the best candidate for the actual source form.

I am trying to understand this. 
The earlier result was a clearly statistical outcome. Is that still the case in the newer analysis, or is that more like an impression?
Shouldn't the cases, where a second core word is a more frequentword, have been included in the overall statistics shown before? In that case they are still a minority.
This depends how the search was done. After having completed the analysis of any (potential) core word, is the next word to be tested the word immediately following it, or is the search continued at the end of the string of core with variants?
(18-03-2026, 03:32 PM)ReneZ Wrote: You are not allowed to view links. Register or Login to view.The earlier result was a clearly statistical outcome. Is that still the case in the newer analysis, or is that more like an impression?

The key point is that what changed is not the data, but the definition of what I call the "core."

In the earlier analysis, the "core" was defined operationally as the first word of the burst (suposition A: the writting and generation of the burst started at the first burst word of the page). Under that definition, the result was clearly statistical: the first word tends to be rarer, simply because it is the earliest occurrence, not necessarily the generative source (suposition A may be wrong). In the newer analysis, the core is no longer fixed as the first word, but is instead selected based on how well it explains the other members of the burst.

To answer your question "Is that still the case in the newer analysis, or is that more like an impression?": it is still statistical. The difference is that now the statistic is computed over a different definition of core (different definitions of core). When we allow any member of the burst to be the core, and select it based on its relationship to the others, we observe that in a quite big number of cases the best candidate is not the first word, and is often more frequent. So this is not an impression, but a result that depends on how the core is defined and selected.

(18-03-2026, 03:32 PM)ReneZ Wrote: You are not allowed to view links. Register or Login to view.Shouldn't the cases, where a second core word is a more frequentword, have been included in the overall statistics shown before? In that case they are still a minority.

In the earlier analysis, those cases were not treated as alternative cores, so they were not counted in that way. In the newer analysis, they are explicitly considered..

To answer you: yes, they are still a minority. In some bursts, the first word remains a reasonable candidate. However, there is a consistent subset of cases where another word provides a better explanation of the rest of the burst, and this subset is large enough to be statistically visible, not just anecdotal.

So the result is something like "the first word is not always the best core, and this happens often enough to matter". This is why I say the newer result is still statistical: it is based on counting how often the best core differs from the first word, under a different definition of core.

(18-03-2026, 03:32 PM)ReneZ Wrote: You are not allowed to view links. Register or Login to view.This depends how the search was done. After having completed the analysis of any (potential) core word, is the next word to be tested the word immediately following it, or is the search continued at the end of the string of core with variants?

In the analysis I showed yesterday, the search was sequential: starting from a word, building a burst around it, and then continuing after that burst. So in that sense it corresponds to your first option.
This means that each word is only used once as a starting point, and words inside a burst are not re-tested as independent cores. That is why the first word naturally becomes the reference in that setup.

In the newer version, the idea is to relax that constraint and allow words inside the burst to be reconsidered as potential cores, which can lead to a different result.

As a side note (related to my post 5 minutes before your last one), when looking at the whole manuscript the structure seems much more interconnected, which suggests that this sequential approach may indeed be too restrictive.
OK, thanks. That clarifies my question.
I'll need to think about this a bit more.

The background behind my thoughts is related to a question (doubt) I have about the autocopy method.
I thought that your experiment could shed some light into that question, but perhaps it does not. (I'm not entirely sure about that),

Anyway, this question is: given that daiin is the most frequent word in the overall text, is the reason that it appears most because:
- this is just the most frequent word so it appears most frequenty
- this is the word that most frequently results from a small (or zero) change from the recent words.
And then of course the question if or how we can possibly detect this.
Quote:This has made me think that perhaps we are facing a type of cipher that Rafal proposed to me with the Roman numerals. A codebook cipher.

Thanks for looking at the codebook cipher option!

Let me share my position on it.
Actually I don't believe that Voynich is codebook cipher but don't exclude it. I would give about 10% of chance for that or so.

Codebook cipher is somehow strong candidate, better than a lot of things like simple substitution of European language (doesn't work), homophonic cipher (not enough symbols), syllabary, abugida, more advanced ciphers like Vignere and so on.

It is also not anachronistic, we don't have examples but it seems easy enough to be used in the 1400s.

It would be very laborious both at encoding and decoding, but I could even imagine some slightly autistic man working 5 years on Voynich Manuscript in his cell not for any practical reason but just out of passion.

The main problem I can see is that the text encoded with codebook cipher should have exactly the same structure as plain natural language. There should be nouns, verbs, conjunctions and a lot of cribs. But we don't have these.

Unless it is more advanced and words may mean full words, syllables or letters like in Great Cipher ( You are not allowed to view links. Register or Login to view. ).
But that becomes anachronistic.

So I would check first the simplest option:  one code=one word.

If we want some confirmation for "autocitation model", we should check if we cannot get the same results with codebook cipher.

My intuition is that codebook cipher may generate a lot of similar words clustered together.
Imagine some new story begins in the Bible, let's say about Noah. Several new words are introduced - Noah, Ark, flood... 
These words could get close numbers - 1221, 1222, 1223...
It will be very similar looking Latin numbers - MCCXXI, MCCXXII, MCCXXIII

I wonder how it would behave.
@Rafal, start a new thread on the Codebook cipher idea, it seems like an idea well worth discussing.
(18-03-2026, 09:42 PM)RobGea Wrote: You are not allowed to view links. Register or Login to view.@Rafal, start a new thread on the Codebook cipher idea, it seems like an idea well worth discussing.

I am working on a cipher that is giving me some good results, which is partially a codebook, The cipher is originated by the results of this thread. I hope to ppst something by tomorrow.
Pages: 1 2 3 4 5 6 7 8