The Voynich Ninja

Full Version: About the generation of similar words
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Pages: 1 2 3 4
(16-03-2026, 11:44 PM)ReneZ Wrote: You are not allowed to view links. Register or Login to view.That is perhaps unexpected, and raises the question: what is the reason that the variant word appeared at that point?

Was it because it is a variant of a previous core word?
Or was it because it is a relatively frequent word that is therefore likely to appear anyway?

The latter has to be the preferred case.

(or the text is backwards...)

René, in the earlier version of the analysis, I defined the “core” simply as the first word of the burst. Under that operational definition, the first word was indeed usually rarer than the other members. But the newer analysis suggests that this first word is often just a temporal anchor, not the best candidate for the actual source form.

When I compare alternative core definitions, the picture changes. The generative core is more frequent than the other burst members in about 72% of bursts, and the frequency core in about 86%. More importantly, in the inferred parent-child graph, the parent is more frequent than the child in about 33.5% of cases and equally frequent in about 41.6%, so in roughly 75% of the links the parent is at least as frequent as the child.

So the old result still holds descriptively for the first word, but I would no longer interpret it causally. The newer evidence is more consistent with the idea that the scribe often generated forms from already established, relatively frequent words on the page.
(Yesterday, 01:01 PM)quimqu Wrote: You are not allowed to view links. Register or Login to view.About the bursts: before this post, the burst detection followed a fairly natural idea: the burst starts with the first word and the following are considered variations of it. This is simple and works well for segmenting the text, but it has one major problem: it assumes that the first word is the origin of the others. And this does not have to be true.

This is an important fact to take into account: lines were not always written sequentially from top to bottom as any normal text would be. There are many instances of gallows intrusions where the text visibly curves upward to avoid a big gallows glyph on the next line. (This is a big indication of something fishy going on by the way.) So the earlier written words on the same page don't have to be on a line above or to the left on the same line.
(Yesterday, 01:37 PM)nablator Wrote: You are not allowed to view links. Register or Login to view.This is an important fact to take into account: lines were not always written sequentially from top to bottom as any normal text would be. There are many instances of gallows intrusions where the text visibly curves upward to avoid a big gallows glyph on the next line. (This is a big indication of something fishy going on by the way.) So the earlier written words on the same page don't have to be on a line above or to the left on the same line.

This is a good and curious point... I remember I posted some other strange issues I detected You are not allowed to view links. Register or Login to view.
Based on one of the results of the last analysis, the fact that a large part of the insertions and deletions are concentrated at the beginning of the token, I wanted to check if this could be an artifact of the internal direction of the word. In many natural languages such as Catalan, Spanish, French or German, morphological variation is loaded more often at the end of the word than at the beginning. This made me think of testing a simple hypothesis: maintain the position of the words in the text, but internally invert each token, as if they were written from right to left.

I have done the test keeping exactly the same parent-child relationship detection model and the same metrics. The result is quite clear. The quality of the model gets slightly worse when the tokens are reversed: the average distance increases and the exact copy ratio decreases. The transformation rules do not become simpler or more regular. The concentration of repeated rules does not improve and the distribution of the change positions remains scattered, without a cleaner structure appearing at the end of the token.

In the original version, the father is more or equally frequent than the son in about 75% of the relationships, and this pattern remains stable. With internal inversion, not only does it not improve, but it loses consistency in some cases. Nor is there any clear reorganization of insertions and deletions towards the relative end of the word, which would be what we would expect if we were reading the tokens backwards.

Therefore, reversing the internal order of the words does not explain the data better. This reinforces the idea that the text generation mechanism works preferentially on the beginning of words, and not on the end as would be more common in natural language.
I remember a thread (last year?) about statistics showing a kind of drift in pages generated by Torsten Timm's generator, I guess because the average number n of nth-generation words tended to increase from the top to the bottom of the page when words are generated by modification of earlier words on the same page, whereas the same statistics on the VMS transliteration does not show this drift on its pages.
Okay, let me explain what I meant.

1) Let's take some real text - in my case it was the Book of Genesis

In the beginning God created the heaven and the earth.
And the earth was without form, and void


2) Assign numbers to words, the first word is 1, the second word is 2 and so on. If a word repeats, the assigned number repeats too
 For example "the" is coded by "2" here and as you can see "2" repeats several times.

1 2 3 4 5 2 6 7 2 8
7 2 9 10 11 12 7 13


3. Write the numbers as Roman numerals

I II III IIII V II VI VII II VIII
VII II VIIII X XI XII VII XIII

I transformed programmatically all the Book of Genesis this way. The result for me is somehow similar to Voynich:

Code:
DLXXXI CLIII XXV MCCCXXX MCCCXXV CLI CXII CXXX MCCCVII I CCCXXV CLXXVII XXXVII XXVIIII LXXVIII CCCCLXXXXIII I II DCLVI MLXXXXIII MCCCXXXI CCCXXXV MCCCXXXII XVII CCCXII MCCXXXXVIII L LXXVIII CLXXXVI XVII CCCLI MCCCXXXIII XXXVII XXVIIII LXXVIII CCCCLXXXXIII I CCCLI DCLVI VII XXXVII XXVIIII LXXVIII MCCCXXXI CCCXXXV CCCLI MCCCXXXII MCCCXXXIIII MCCCXXXV XXV MCCCXXX VII CCLXXXXIIII DCXXXXVI CLIII XXV I CCCXXV CCLXXXIIII LXXXV CCLXV DCCLXXXVIIII MCCCXXXVI VII II MCCCXXXVII CXXX MCCCVII LXXVII CCLXXXIIII XVII LXXV MCCCXXVIII LXXVIII CLXXXVI MCCCXXX XXVIIII MXXXXII CLIII XXV DCCLXXX DCLXVII XXXIII LXXV DCCCCLVIIII XXXVII CIIII DCLXXVII CCLXXXXIIII MCCCXXXVI VII IIII XXII LVII MCCCXXIII CCCXXVIII LXXXV DCCCCLXXXVIIII CCCLI CCCII CCLI CCLVI CLXXXVI CCLXXIII CCLXXXVIIII CCXXVII DCCCCLXXXVIIII CLXXXXI MCCCXXXVIII CLIII CCLXXXVIIII CCXXVII MCCXXVII VII CXXXXVIIII CCLXIIII MVII CCLXXXVIIII VII LXXXX CCCLVI XXXXIIII CCCCLXXXXI CCXVIII XVII CCLXXXVIIII CCCXIII CXXXXVIIII CCLXIIII MVII CCLXXXVIIII VII CCLXXXXVI CLIII XXV XXXXIIII CCC XVII DCCCCXXXVI MCXXXXIII XVII DCCCCLVIIII CLIII XXV XVII MCCCXXXVIIII CCCXXIIII MCCCXXIII MCLXVIIII XV LXXV XVI VII MCCCXXXX VII XXII I LXXV DC CLIII XXXXIIII MCCCVII XXV CCCCLXXXXIII LVII CXXXXII XXVIIII LXXVIII CCLXV DXXXXIIII DXXXXVI MCCCXXXXI VII CLIII MCCCXXXVIII XXVIIII LXXVIII DLVIII DXXXXVI DLXXXI MCCCXXXXII VII MCCCXXIII XXII LVII IIII MCCCXXXXIII XXVIIII MCCCVIII MLXV CCCCII CLXXXIIII MCCCXXXXIIII VII IIII XXII MCCCXXXVIII CCCLI CCCII CLIII MLXXX CCCLVI XXXXIIII CCCCLXXXXI MCCCXXXXV VII CCLI CCLVI CCLXXIII LXXV CCXXVII MCCCXXXXVI VII CXXXXVIIII CCLXIIII DCXXXXV CCLXXXXIIII DCXXXXVI CCCXXXV CXXXXII LXXXV CCLXV DCCLXXXVIIII DCXXXXVI VII CCCXXXV LXXV LXXI LXXIIII CCLXVIII

But it is not gibberish. It is an extreme case of code cipher: You are not allowed to view links. Register or Login to view.

I wonder how it compares to natural texts, Voynich manuscripts and Timm generated text. Will your methods detect it wrongly as "autocitation"?

@Quimqu, if you would like to preform the analysis, I attach the file:
You are not allowed to view links. Register or Login to view.
I got excited about burst detection and went a little further. The logical step, now that I have the bursts detected, is to see if there is any logic within the bursts on the same page. What I have done is to analyze how the words are generated within the manuscript, not so much by looking at whether they are similar, but whether they follow some generation mechanism.

The key idea is to look at "bursts", groups of related words, and see if there is a pattern within these groups and if this pattern is repeated within the same page.

The first important result is that within a burst the behavior is very coherent. Words do not change arbitrarily, but evolve with small modifications. This points to a local mechanism, almost as if one word is transformed from the previous one.

However, when moving from one burst to another, the behavior changes. It is not completely random, but it does not follow a simple rule either. There are preferences, but not determinism.

This is clear with a sequential model. I have applied a Markov model not directly on the words, but on the burst type associated with each word. 
That is, I first create types of bursts with clustering models (kmeans). Then I assign each word a type (derived from its burst), and then I analyze the sequence of these types throughout the text.

Therefore, the sequence I model is not:

word1 → word2 → word3

but:

type1 → type2 → type3

This is important because bursts are not continuous blocks, but rather their words are mixed. Working at the type level allows me to capture the behavior of the system (or at least I try).

With this approach, it is seen that a Markov model clearly improves over assuming independence:

Code:
Model                | Log-loss | Perplexity
---------------------------------------------
Independent          | 1.19    | 3.29
Markov (order 1)    | 1.07    | 2.91
Markov (order 2)    | 1.01    | 2.73

This indicates that the type of the next word depends on the type of the previous ones,it  is not independent. And the fact that order 2 improves order 1 shows that this dependence is not only on the immediate step, but that there is memory from more than one step.

In other words, the system does not generate words independently, but rather tends to maintain or change "mode" (burst type) following sequential patterns.

Furthermore, if I try to predict the burst type of a word, we see that it depends on both context and position:


Code:
Feature        | Importance
---------------------------
line_ord        | 0.287
prev_type      | 0.283
prev2_type      | 0.255
pos_norm        | 0.097
pos_in_line    | 0.073
is_line_start  | 0.004

I think this is very revealing. Sequential context is very important, but so is vertical position within the page. On the other hand, the beginning of the line does not carry much weight.

When we try to predict whether there will be a type change (change of state), the result is much weaker:

Code:
Metric    | Value
----------------
Accuracy  | 0.556
ROC AUC  | 0.574
Log-loss  | 0.684

These values are only slightly above chance, indicating that the model is hardly able to predict when a change will occur. This suggests that the transition between states is not governed by a simple and easily observable or modelable rule, such as position within the line or the immediate context. Although the system shows clear structure within the bursts, the times when it changes from one type to another are much more difficult to predict and are not well explained by simple local variables.

Finally, when we analyze the different types of bursts, we see that they are not arbitrary. Each type has a different internal behavior: some modify mainly the end of the word, others the beginning, others are more balanced. In other words, each cluster corresponds to a specific way of transforming words.

Putting all this together, the model that best explains the behavior is this:

“within a state (burst type), the text evolves in a very coherent and local way. From time to time, the system changes state. This change is not random, but not deterministic either, and depends on both context and position.”

The text appears neither random nor purely linguistic in the classical sense. It has discrete states, short memory, and local transformation rules. It is much more compatible with a generative system with rules than with a natural text without explicit structure.
A brief self-citation

(Yesterday, 04:32 PM)quimqu Wrote: You are not allowed to view links. Register or Login to view.On the other hand, the beginning of the line does not carry much weight.

It is known that the beginning of the lines have some kind of differentiation with the rest of the text. In the case of the bursts, I think it might be important how the beginning of the lines do not affect the burst type, in the way that they might be another kind of generative system... I just put this in words in this post for later.
I'll leave it here for today, but first I wanted to look at one last thing: if there was any pattern in how the burst types alternate within the pages.

The intuition was clear. Perhaps the text was alternating types in a regular manner, or following increasing or decreasing sequences. But the results point in a different direction.

Code:
Pattern class | Mean share
same | 0.514
up_jump | 0.181
down_jump | 0.181
up_1 | 0.087
down_1 | 0.084

More than half the time, the type stays the same. And when it changes, it often does so in leaps and bounds, not smooth progressions.

If we look at slightly longer sequences, it becomes even clearer:

Code:
Pattern class | Mean share
flat_3 | 0.315
nondecreasing | 0.201
nonincreasing | 0.200
aba | 0.201

There is no strong alternation of the type A B A B, nor are there linear paths. What does appear a lot is the pattern A B A: the system leaves one type, but quickly returns to it.

In other words, the dominant behavior is not alternation, but persistence with small local excursions. The text seems to move around a “state” and make short detours before returning to it.

And this fits very well with everything that has been coming out: strong internal structure within the bursts, real sequential dependency, but changes that do not follow simple rules.

And now, for today I'll leave it here. More tomorrow  Smile (I will reply if you have comments but for today the coding is finished)
(Yesterday, 03:17 PM)Rafal Wrote: You are not allowed to view links. Register or Login to view.3. Write the numbers as Roman numerals

I II III IIII V II VI VII II VIII
VII II VIIII X XI XII VII XIII

I transformed programmatically all the Book of Genesis this way. The result for me is somehow similar to Voynich:

DLXXXI CLIII XXV MCCCXXX MCCCXXV CLI CXII CXXX MCCCVII I CCCXXV CLXXVII XXXVII XXVIIII LXXVIII CCCCLXXXXIII I II DCLVI MLXXXXIII MCCCXXXI CCCXXXV MCCCXXXII XVII CCCXII MCCXXXXVIII L LXXVIII CLXXXVI XVII CCCLI...

I had this idea too, back when I found the peculiar distribution of word lengths:
You are not allowed to view links. Register or Login to view.

Quote:It is an extreme case of code cipher

Is there a worse name for anything than "code cipher"?*  Is there a cipher that is not in code?

Please, crypto folks, at least call  it "codebook cipher"...

All the best, --stolfi

* Admitted, the mathematician's "partial order" comes close.
Pages: 1 2 3 4