The Voynich Ninja - An attempt at extracting grammar from vord order statistics.

Pages: 1 2 3 4 5 6 7 8 9

Davidd, would you like to try Genesis in Latin?

I believe it may be interesting and important test which may give different result than English version.

If anybody wonders - Latin has declension which is mostly absent in English. While in English it is always for example "Egypt", in Latin it may be "Aegypto", "Aegypti", "Aegyptum" etc.

I believe your model is not "clever" enough to know that it is the same word and will treat these variants as totally different words.

I wonder how it may impact the found word groups and their "quality" which you count with your statistics. And I am not sure.

I include Vulgate version of Genesis if you needed it.

(09-06-2025, 11:04 PM)Rafal Wrote: You are not allowed to view links. Register or Login to view.Davidd, would you like to try Genesis in Latin?

I believe it may be interesting and important test which may give different result than English version.

If anybody wonders - Latin has declension which is mostly absent in English. While in English it is always for example "Egypt", in Latin it may be "Aegypto", "Aegypti", "Aegyptum" etc.

I believe your model is not "clever" enough to know that it is the same word and will treat these variants as totally different words.

I wonder how it may impact the found word groups and their "quality" which you count with your statistics. And I am not sure.

I include Vulgate version of Genesis if you needed it.

score: 280
You are not allowed to view links. Register or Login to view.

(10-06-2025, 09:23 AM)davidd Wrote: You are not allowed to view links. Register or Login to view.score: 280

You've lost me a bit. What is it that these numbers represent? Some statistical measure? If so then it would be useful to know what the distribution of this measure is.

Quote:You've lost me a bit. What is it that these numbers represent?

Method that Davidd uses finds groups of "similar" words, that behave in a similar way in the text. That is they are after some specific words or before some words etc. Analysis made on the Book of Genesis shows that this method to some extent actually works and gives meaningful results - found groups are actually made of nouns, verbs, conjunctions etc. So "Abraham" and "Noah" will fall into one group and "walk" and "go" into another group.

Now, these groups are not perfect, sometimes (or maybe quite often) a verb may fall into "noun" group etc.
We had a feeling that we need a measure how "good" these groups are and Davidd invented such a measure. Understanding how it exactly works would probably require a bit of an intelectual labour but I trust him that it is not bullsh!t Wink

This measure also seems to work. It has high values for Genesis, like 300 for English version and 280 for Latin version. My intuition was it will be lower for Latin than English because of declension (see my earlier post) and it proved right. But it is not much lower, 280 vs 300.

For totally random text (Genesis with scrambled words) this value is very low - 20. And for Voynich it is something inbetween - 100.

So Voynich feels too regular (having patterns) for a random text and too random for a normal text.
Curious, isn't it? Wink

Ok let me try to explain the score function.

step 1.

imagine there is a group of words that combined are 10% of the text, so 90% of the words in the text are in other groups. Lets call this group of words "lord"

Q: What is the chance that "the next word" is a member of the "lord" group?
A: 10% if you assume the chance is completely random.

Now imagine a second group of words, lets call this group "the" say size around 5% of total words of the text belong to this group "the"

Now when parsing the text whenever we encounter a word from the group "the" the chance that "the next word" is from the group "lord" is much higher than 10% that we answered in the question above. It's closer to 40%

I see that as a signal that is significantly above noise level.

Step 2:

There is always some variation, some noise. How far above noiselevel is that 40%?
This is where the You are not allowed to view links. Register or Login to view. comes to the rescue. If you take enough "samples" of any distribution, it will start to behave as a normal distribution (Gaussian)

So imagine the voynich, vulgata, king james was written by throwing some dice to determine what the next word will be.

The noise level (sigma, standard deviation) is proportional to the squareroot of the number of throws.

If you throw some dice ten thousand times, the expected number of sixes you throw is thousand six hunderd and sixty six. plus or minus on avarage thirthy seven.
The average noise is thirty seven. The sigma is thirty seven.

Step 3:

The software produces these giant tables with some positive values but very many negative values. These are the actual transition frequencies compared to the expected transition frequencies. when you read negative twenty in the table, that means that the actual transition rate was twenty sigma below the expected transition rate.

Step 4.

on average, when taking absolute values the noise should be one sigma. That is the definition of standard deviation.
The left number in the score tables is just the straightforward average of all these transitions divided by their individual sigma
To accentuate the extremes a little bit, i choose to also show the average of the squares of these numbers.
This is very similar to You are not allowed to view links. Register or Login to view.
Basically the score 280 should be squarerooted to get the RMS (around 16), but i was lazy and bigger numbers more beautiful Tongue

Becuase it is not really the RMS but RMS squared i just call it "the score"

The score of king james genesis with the words shuffled is around 20. without shuffling around 300.

I hope this clarifies the procedure.

Did a second run of vulgata
score 248
You are not allowed to view links. Register or Login to view.

(10-06-2025, 11:44 AM)davidd Wrote: You are not allowed to view links. Register or Login to view.Did a second run of vulgata
score 248
You are not allowed to view links. Register or Login to view.

Would there be scope to follow Zattera's "sub-languages" he identified in his slot-alphabet paper from 2022? If I recall correctly he argued it would make more sense to study each section of the VM individually as the slot sequence slightly changes between them, indicating some sort of change in the text generation process. My only worry would be the considerably smaller sample size, I think the 5 sub-languages he identified were Herbal A, Herbal B, Pharmaceutical, Biological and Recipes.

(10-06-2025, 12:06 PM)davidma Wrote: You are not allowed to view links. Register or Login to view.follow Zattera's "sub-languages"

I don't think this will give you much. The text in the herbal sections is frequently broken by the illustrations. Line lengths are much shorter than in the BioB2 and StarsB3 sections. Especially so in the HerbalA1 pages because hand 1 writes larger than hands 2 and 3. Line first word and line last word effects are going to bias your calculations.

( I gave some opinions about line last word effects in
You are not allowed to view links. Register or Login to view. )

I believe this suggestion will lead you down a false path.

(10-06-2025, 12:40 PM)dashstofsk Wrote: You are not allowed to view links. Register or Login to view.
(10-06-2025, 12:06 PM)davidma Wrote: You are not allowed to view links. Register or Login to view.follow Zattera's "sub-languages"

I believe this suggestion will lead you down a false path.

If the issue is as suspected sample size and sample layout (i.e broken text in herbal vs continuous text as in recipes and balneological) then I suppose one could just limit themselves to the balneological and recipes section which should cover enough vord tokens to still produce some meaningful analysis no?

(10-06-2025, 01:46 PM)davidma Wrote: You are not allowed to view links. Register or Login to view.enough vord tokens to still produce some meaningful analysis no?

BioB2 and StarsB3, being both language B, will probably give similar output, same as to what has already been calculated for language B.

If there is a difference it might be because quire 13 has a higher frequency of just a few words. The top 10 words make up 22% of the total words. For quire 20 the top ten make up 15%. Alternatively, 20% of quire 13 is made up of just 9 words. 20% of quire 20 is made up of 16 words.

Pages: 1 2 3 4 5 6 7 8 9