The Voynich Ninja - An attempt at extracting grammar from vord order statistics.

Pages: 1 2 3 4 5 6 7 8 9

(27-05-2025, 11:27 AM)davidd Wrote: You are not allowed to view links. Register or Login to view.
(27-05-2025, 09:52 AM)MarcoP Wrote: You are not allowed to view links. Register or Login to view.Hi Davidd,
your results look very readable now, well done! I would only suggest that you sort the members by decreasing frequency, so that the first 3 are the same words as the node labels in the graphs.

feature added, new runs

You are not allowed to view links. Register or Login to view.

You are not allowed to view links. Register or Login to view.

Edit, i didnt do it in the tooltips when hovering over the picture yet, only in the overview below the picture. will add it to the tooltip also

You are not allowed to view links. Register or Login to view.

You are not allowed to view links. Register or Login to view.

note: when analysing a section, such as a quire or only the parts with language A, the frequency count is based only on that section, not the frequency of the whole MS. this may cause orderings to change between quires

Great work! Looks promising Smile

I cant help but think that the apparent 'homogenized or 'softer' differences in 'More/less likely followed/ing' probabilities within VMS vordgroups could potentially hint at vords not being whole plaintext words or some obfuscation at work beyond a standard 1:1 word to vord mapping, though I'm looking forward to seeing this analysis applied to other languages and texts.

I had a look at these groups from the Book of Genesis and they actually make sense Yes

They are not perfect but the method clearly finds some real patterns. It won't find detailed stuff like "human own names" or "personality traits" but it seems to roughly find parts of speech like nouns, verbs, adjectives or conjuntions. As I said it's not 100% perfect, but feels high above random stuff.

I would have a question Davidd. Do you have any statistical simple measure that would tell us how "good" these groups are? Something like correlation coefficient ot some measure of total variance explained?

I wonder if you could do a comparison for:
- Book of Genesis
- some ungrammatical text (could be Book of Genesis with scrambled word order)
- Voynich Manuscript

and see how good are the found groups. Is Voynich more similar to grammatical or ungrammatical text?

Is Voynich more similar to grammatical or ungrammatical text?

The VMS doesn't look like randomly scrambled text, IIRC, word entropy is comparable with some written languages (while scrambling results in maximum entropy). But some of the most prominent patterns are different from written languages.

1. There are different sections with different word types.
2. There are positional preferences in lines and paragraphs, like y-/s- line-initially and -m line finally. There also are the patterns discussed in Patrick Feaster's Malta paper.
3. Last-first combinations: e.g. words ending -y prefer to be followed by words starting q- (also discussed in a Cryptologia paper by Emma May Smith and myself). I guess this affects Davidd's graphs.
4. Identical or similar words tend to appear consecutively.

(28-05-2025, 05:10 PM)MarcoP Wrote: You are not allowed to view links. Register or Login to view.Is Voynich more similar to grammatical or ungrammatical text?

The VMS doesn't look like randomly scrambled text, IIRC, word entropy is comparable with some written languages (while scrambling results in maximum entropy). But some of the most prominent patterns are different from written languages.

The software i wrote is coming to the same conclusion.
I followed the suggestion made by Rafal, and made a version that randomizes the word order in genesis.

Voynich is more like the king james than the random ordered

added some color to the output. the second image is when all other words are added to the best fitting groups

You are not allowed to view links. Register or Login to view.

You are not allowed to view links. Register or Login to view.

You are not allowed to view links. Register or Login to view.

The experiments with King James Genesis confirm that the system gives meaningful results on a 38K words simple text. Another interesting experiment could be processing English texts that show a complexity (e.g. MATTR) closer to Voynichese and whose length matches that of relatively homogeneous Voynich sections (e.g. Quire13 or Quire20, about 7-11K words).

These are MATTR 500 values for Q20, Q13, Moby Dick, Shakespeare’s Sonnets, You are not allowed to view links. Register or Login to view. (skipping the initial index) and King James Genesis. English texts have been converted to lower case and punctuation was removed. The Genesis is more repetitive than the other texts and therefore has a lower number of different word types. The Grete Herball appears to be more varied, but still more repetitive than Voynichese.

voynich___EVA_Q20 0.666412364696
voynich___EVA_Q13 0.535685865663
moby_dick________ 0.538253469990
shakespeare_sonn. 0.525767143674
grete_herball____ 0.436593990497
king_james_______ 0.349082070707

(27-05-2025, 11:04 PM)Rafal Wrote: You are not allowed to view links. Register or Login to view.I would have a question Davidd. Do you have any statistical simple measure that would tell us how "good" these groups are? Something like correlation coefficient ot some measure of total variance explained?

This definitely requires more statistical knowledge than I have, Rene or Nablator can certainly make better suggestions. My unreliable guess is that, as a simple measure of quality, one could compute the probability for a graph to generate the text under examination. For instance, one could use You are not allowed to view links. Register or Login to view.discussed in Brown et al. "Class-based n-gram models of natural language".

[attachment=10713]

The probability of generating word Wi given that the preceding word is Wi-1, is the product of the probability of Wi occurring as a member of class Ci multiplied by the probability that class Ci-1 (Wi-1's class) is followed by class Ci (the arrows in Davidd’s graphs, assuming that the weights of arrows going out of a class add up to 100%).

One can compute such probabilities for all words W1...Wn and multiply all of them to get the overall probability for the whole passage.

I guess that Pr(Wi|Ci) is simply how many Wi tokens occur among all the tokens assigned to Ci.

This system appears to be simple enough, but I think it can only compare models based on the same number of classes and texts of identical length.

Quote:Voynich is more like the king james than the random ordered

That's certainly an interesting result.

I had a look at the groups your algorythm found in VM but unfortunately couldn't spot any regularity.
You know, in many languages (but not so much in English) words making parts of speech like nouns or verbs are somehow similar.

Think of Latin verbs: bibere, dormire, obedire, vivere, movere etc. They all have similar endings.

But in your groups there isn't any pattern. For example your first group is: shol, chey, shy, qotchy, oty, cheo, choty, qo, shaiin, cthody, qotchol, ytaiin, daly, oteody, ckhody, okeom, tsho, qokeeo, shoky, daim

Do you have any ideas Davidd how to follow your research? What would you like to check next?

(30-05-2025, 09:25 PM)Rafal Wrote: You are not allowed to view links. Register or Login to view.I had a look at the groups your algorythm found in VM but unfortunately couldn't spot any regularity.
You know, in many languages (but not so much in English) words making parts of speech like nouns or verbs are somehow similar.

Think of Latin verbs: bibere, dormire, obedire, vivere, movere etc. They all have similar endings.

But in your groups there isn't any pattern. For example your first group is: shol, chey, shy, qotchy, oty, cheo, choty, qo, shaiin, cthody, qotchol, ytaiin, daly, oteody, ckhody, okeom, tsho, qokeeo, shoky, daim

The Latin example is very specific: those are verbs in the infinitive, i.e. a single verbal form out of tens of different possibilities from the conjugation of each verb. In order to correctly classify Latin infinitives as a distinct class, you probably need tens of different classes and a large corpus (100,000 words or more, I guess). Davidd’s experiments are designed to work on the limited size of the Voynich text and cannot identify such a fine-grained class of words. Also, we know that Voynichese is not a phonetic rendering of a European natural language, so we can expect that it behaves differently.

But Davidd’s results do show regularities, not too different from those I discussed in You are not allowed to view links. Register or Login to view. (words belonging to each class tend to be rather omogeneous [sic]).

For instance, I checked the top 10 words for the 7 classes for quireM (Quire 13) from You are not allowed to view links. Register or Login to view.. Quire 13 is both rather long and has a particularly low MATTR, so it’s one of the sections that give better hopes for this kind of analysis.

qokedy, qokeedy, qokeey, qoky, qotedy, qoteedy, qoty, lchey, lol, sal: the top 7 words start with q- and end with -y
shedy, chedy, shey, chey, dy, chckhy, sheedy, sheey, shckhy, cheol: 8 out of 10 start with a bench and end with -y; the top 4 show an even tighter pattern
qokain, qol, qokal, qokaiin, dar, qokar, qotal, okal, qotain, qokol: 8 out of 10 start with q- and do not end with -y
cheey, aiin, oly, olkain, otain, ain, checkhy, al, o, olkeey: 8 out of 10 start with one of the two “circles” o/a
lchedy, otedy, okedy, qokey, okeedy, lshedy, dshedy, oteey, pchedy, otey: 5 out of 10 start with o-gallows and end with -y
ol, daiin, dal, dain, y, okain, sol, saiin, okaiin, l: 5 out of 10 end with -n
or, r, s, ar, sar, olor, kain, otaiin, sheor, ykeedy: 6 out of 10 end with -r

For the first four classes, the patterns appear to be quite significant, in my opinion. They could suggest that Voynichese grammar is dominated by the last-first glyph combinations at word breaks that Emma May Smith and myself discussed in our You are not allowed to view links. Register or Login to view..

Thanks Marco. I definitely shouldn't write posts late in the evening Wink

And I definitely must study carefully the article written by you and Emma. I was going to do it forever, now I have an extra incentive.

So words which are similar like qokedy and qokeedy also behave in similar way, they have the same patterns when it comes to previous and next word.
I have a feeling that it can lead us to somewhere Wink

Pages: 1 2 3 4 5 6 7 8 9