An attempt at extracting grammar from vord order statistics. - Printable Version

An attempt at extracting grammar from vord order statistics. - Printable Version

+- The Voynich Ninja (https://www.voynich.ninja)
+-- Forum: Voynich Research (https://www.voynich.ninja/forum-27.html)
+--- Forum: Analysis of the text (https://www.voynich.ninja/forum-41.html)
+--- Thread: An attempt at extracting grammar from vord order statistics. (/thread-4708.html)

Pages: 1 2 3 4 5 6 7 8 9

An attempt at extracting grammar from vord order statistics. - davidd - 19-05-2025

Hi All,

Based on vords coming before and after, grouping vords into vordgroups looks possible.
The past few weeks i have been programming in python to do some analysis and statistical informed guessing on grammar in voynechese. I am happy to announce to you these partial/preliminary results. There seem to be statistically significant groupings of vords that have either increased or decreased likelyhood to either precede or follow certain other groupings of vords.

What i am looking for

I would like to make some academic paper out of these results. That is why i am looking for the help of any academic voynich researcher that would like to collaborate. These results look statistically significant to my amateur eyes but I still have to do some p-value calculations, I think Chi Square would be the appropiate one for this.

assumptions
+ A and B are different languages.
+ each vord is matching a word in a real language
+ the real language has some form of positional grammar, ex like some prepositions come with a genetive case associated directly following the preposition
+ vords have only one meaning and every time a vord is used it means only that one meaning (statistics will still work even if this one isnt true)

method:
language A and language B are processed seperately
for each vord besides frequency also tally preceding and following vords, respecting paragraphs and ignoring line breaks
for all vords that appear at least 4 times, put them in little vordgroups up to around 5 vords each based on similarity in vords coming before and after them.
now score all vordgroups against eachother and merge the most similar ones, not looking at each individual vord transition but transitions from vordgroup to vordgroup
merge until desired amount of vordgroups left
score each of the most frequent vords against all groups to see if any would fit better in another group.

safeguards:
By just looking at the more frequent vords there is less wiggle room than when assigning unique vords to some group to increase the score.
Because the non frequent vords were counted in the total for the percentage calculation, this makes total transition frequency to each of the labeled groups lower.

cons/doubts/possible improvements:
The algoritm I built is made to find these patterns. It has not been tested on random noise or other language samples.
It takes a lot of time, the merging step takes aroung one hour on my poor old pc.
there may be some bugs in the searching algorithm, It is not very stable, the groups that come out are different every time. Probably some memory in python that gives a different order every time. Maybe it is an omen that the method may be flawed.
The output is very long, but with analysing over 27000 vords that is somewhat inevitable. A big part of the output is guessing for all the non-frequent vords in which group they would fit best.

It really feels great to be standing on the shoulders of giants and looking further than anyone before.
Thanks to the members of the voynich ninja and the maintainers of websites about voynich.
Maybe this work can help provide a break through for somebody else.

the results:

statistics about language A and language B are in the same file. first all A output, than all B output
line number chapter
1 Language A
13 initial groups
99 merging
380 vordgroup stats
847 transition tables
894 moving vords to other groups
1217 guessing but not adding to groups of other vords
4466 vordgroup stats
5024 transition tables
5079 Language B
5087 initial groups
5233 merging
5775 vordgroup stats
6149 transition tables
6197 moving vords to other groups
6850 guessing but not adding to groups of other vords
11434 vordgroup stats
11930 transition tables

You are not allowed to view links. Register or Login to view.

Code:
vordgroup 56: cheody 56 446

members:  ['cheody', 'opar', 'shek', 'sheody', 'shody', 'she', 'opchedy', 'psheody', 'opchdy', 'cheky', 'tchy', 'chekaiin', 'ytedy', 'olkchedy', 'ytchey', 'ytody', 'cholky', 'chcphy', 'lkeey', 'ycheedy', 'shor', 'olky', 'sshey', 'shckhey', 'keol', 'teeody', 'shaiin', 'lkeeedy', 'ycheeo', 'cheoty', 'shekeey', 'chotal']

num members:  32

vord count:  446

groupname: cheody

lesser likely following  : chedy  5.16% instead of 16.02%

more likely  following  : daiin  6.95% instead of  3.37%

lesser likely followed by : chedy  8.07% instead of 16.02%

more likely  followed by : qokain 21.08% instead of 15.88%

coming from group  <groupname> followed by <groupname> which has a relative size of <x>

---------------------------------------------------------

<    other> -> 31.39% <cheody>  37.44% -> <    other>

<    chedy> ->  5.16% <cheody>  8.07% -> <    chedy> rel size: 16.02%

<      ol> ->  6.50% <cheody>  6.50% -> <      ol> rel size: 12.64%

<    aiin> -> 11.66% <cheody>  3.14% -> <    aiin> rel size: 10.69%

<    daiin> ->  6.95% <cheody>  3.59% -> <    daiin> rel size:  3.37%

<  qokain> -> 14.13% <cheody>  21.08% -> <  qokain> rel size: 15.88%

<      dar> ->  2.02% <cheody>  2.69% -> <      dar> rel size:  2.14%

<  okaiin> ->  0.90% <cheody>  0.90% -> <  okaiin> rel size:  0.72%

<    okain> ->  0.90% <cheody>  0.45% -> <    okain> rel size:  0.78%

<    okeey> ->  0.67% <cheody>  0.22% -> <    okeey> rel size:  0.52%

<    otar> ->  0.67% <cheody>  0.22% -> <    otar> rel size:  0.58%

<  otaiin> ->  1.57% <cheody>  3.36% -> <  otaiin> rel size:  1.28%

<        o> ->  6.73% <cheody>  3.36% -> <        o> rel size:  4.40%

<      oty> ->  2.02% <cheody>  1.35% -> <      oty> rel size:  1.59%

<    shol> ->  1.35% <cheody>  1.35% -> <    shol> rel size:  0.81%

<      am> ->  2.02% <cheody>  2.91% -> <      am> rel size:  1.92%

<  cheody> ->  2.24% <cheody>  2.24% -> <  cheody> rel size:  1.90%

< chedaiin> ->  0.00% <cheody>  0.00% -> < chedaiin> rel size:  0.36%

<  yteedy> ->  2.47% <cheody>  0.45% -> <  yteedy> rel size:  0.53%

=========================================================

RE: An attempt at extracting grammar from vord order statistics. - davidd - 19-05-2025

the name of the vordgroup is the most frequent vord in that group

RE: An attempt at extracting grammar from vord order statistics. - Eiríkur - 19-05-2025

This is inspiring work. I've thought about doing something similar in Python, but I haven't because I don't have a clear idea what I should try to discover. You've done a great job here, especially including things like the timings. I would keep that feature, with an option for not printing them. I've used the Python profilers in the past. They are good for finding the less efficient parts of a program. I've seen results of programs discovering word networks, which might be something to add. I think you are well on your way to producing a nice, flexible tool.

RE: An attempt at extracting grammar from vord order statistics. - MarcoP - 19-05-2025

Hi,
I find the text file somehow hard to read, so I haven’t analyzed it in detail.

I tried something similar a few years ago. I wrote two posts based on Part Of Speech POS-tagging software. I only considered Quire20 and (in the update post) Quire 13. I only experimented with a small number of word-classes (from 5 to 8, IIRC).
The process gives decent results for the English King James Genesis: but that’s a very repetitive text with particularly low MATTR. Voynichese is much tougher, so it’s not clear that we have enough coherent text to process.

You are not allowed to view links. Register or Login to view. (2019)

Here I found that groups of words for Voynichese include types that are more similar to each other than for English: if two words are similar, they behave similarly (while in English this is not necessarily the case: better/butter which/witch bare/bear etc).

I also found that Voynichese grammar tends to have loops, where words belonging to the same category follow each other (a trivial case is the consecutive repetition of identical tokens). I am not sure about how such loops are handled in your experiment. Have you noticed that there are word-classes that tend to appear with consecutive tokens? This is of course something that doesn't happen frequently in English.

You are not allowed to view links. Register or Login to view. (2021)

If one considers the newline separator as a symbol, the inferred grammar is based on LAAFU: there’s a class for line-initial words and a class for line-final words (e.g. ending -m).

I then applied some simple transformations (partly based on You are not allowed to view links. Register or Login to view.) to remove line effects (e.g. word initial s-, which typically is line-initial, was removed, assuming that saiin is a line-initial variant of aiin).
At this point, I removed the newline symbol from the input data and analyzed whole paragraphs instead.
The positive result I found is that Q20 and Q13 produce grammars that are somehow comparable. I created graphs where word-classes that often follow each-other are connected by arrows (npnp is the paragraph-separator).

Filename: q13_q20_voynich_grammar.jpg Size: 66.47 KB 19-05-2025, 07:25 AM

It seems that there is some overlap with your results e.g.

the bidirectional arrow between shedy/chedy and qokain/qokaiin
the Q13 sequence qokain->dar->ol

davidd Wrote:vordgroup 1: chedy 1 3760
more likely following : qokain 22.71% instead of 15.88%
more likely followed by : qokain 35.69% instead of 15.88%

vordgroup 6: qokain 6 3726
more likely following : chedy 36.02% instead of 16.02%
more likely followed by : chedy 22.92% instead of 16.02%

vordgroup 8: dar 8 503
more likely following : qokain 23.86% instead of 15.88%
more likely followed by : ol 19.28% instead of 12.64%

Also, your results are clearly influenced by line-effects. E.g. line-initial words like yteedy tend to follow line-final words like am:

davidd Wrote:vordgroup 120: yteedy 120 125
more likely following : am 4.00% instead of 1.92%

RE: An attempt at extracting grammar from vord order statistics. - RobGea - 19-05-2025

(19-05-2025, 03:55 AM)davidd Wrote: You are not allowed to view links. Register or Login to view.there may be some bugs in the searching algorithm, It is not very stable, the groups that come out are different every time.
Probably some memory in python that gives a different order every time.

idk if this is useful to you but some causes could be.
Dictionaries : Python version <3.7 Dictionaries are unordered.
Sets : Always unordered.
Memory: (dont know what this is called but ive come across it). Fix : Close all instances of Python between code runs.
Scope(?): Lists, ( Dictionaries(?), watch out if dictionary value is a list as well ) , Mutable, Pass-by-reference ( out of my hobbyist league but its caused me some trouble on occasion )

RE: An attempt at extracting grammar from vord order statistics. - davidd - 19-05-2025

(19-05-2025, 03:37 PM)RobGea Wrote: You are not allowed to view links. Register or Login to view.
(19-05-2025, 03:55 AM)davidd Wrote: You are not allowed to view links. Register or Login to view.there may be some bugs in the searching algorithm, It is not very stable, the groups that come out are different every time.
Probably some memory in python that gives a different order every time.
idk if this is useful to you but some causes could be.
Dictionaries : Python version <3.7 Dictionaries are unordered.
Sets : Always unordered.
Memory: (dont know what this is called but ive come across it). Fix : Close all instances of Python between code runs.
Scope(?): Lists, ( Dictionaries(?), watch out if dictionary value is a list as well ) , Mutable, Pass-by-reference ( out of my hobbyist league but its caused me some trouble on occasion )

The goal is to discover the same structure even if the order the relationships are parsed in is different.

I will try and run Q13 and Q20 to see if i replicate MarcoP results, and i will try EMS linestart modifications. I was hoping the grammar grouping of vords could shed light on those line start vords/line ending vords to see in which grammar group they most likely fall, see if their "non line start" similes come into the same group.

RE: An attempt at extracting grammar from vord order statistics. - Ruby Novacna - 19-05-2025

(19-05-2025, 03:55 AM)davidd Wrote: You are not allowed to view links. Register or Login to view.... These results look statistically significant to my amateur eyes but ...

Hello Davidd!
I didn't manage to understand your idea, could you please present a step-by-step example with a single word of the B language, without using Python?

RE: An attempt at extracting grammar from vord order statistics. - davidd - 19-05-2025

(19-05-2025, 07:30 AM)MarcoP Wrote: You are not allowed to view links. Register or Login to view.Hi,
I find the text file somehow hard to read, so I haven’t analyzed it in detail.

I tried something similar a few years ago. I wrote two posts based on Part Of Speech POS-tagging software. I only considered Quire20 and (in the update post) Quire 13. I only experimented with a small number of word-classes (from 5 to 8, IIRC).
The process gives decent results for the English King James Genesis: but that’s a very repetitive text with particularly low MATTR. Voynichese is much tougher, so it’s not clear that we have enough coherent text to process.

You are not allowed to view links. Register or Login to view. (2019)

Here I found that groups of words for Voynichese include types that are more similar to each other than for English: if two words are similar, they behave similarly (while in English this is not necessarily the case: better/butter which/witch bare/bear etc).

I also found that Voynichese grammar tends to have loops, where words belonging to the same category follow each other (a trivial case is the consecutive repetition of identical tokens). I am not sure about how such loops are handled in your experiment. Have you noticed that there are word-classes that tend to appear with consecutive tokens? This is of course something that doesn't happen frequently in English.

You are not allowed to view links. Register or Login to view. (2021)

If one considers the newline separator as a symbol, the inferred grammar is based on LAAFU: there’s a class for line-initial words and a class for line-final words (e.g. ending -m).

I then applied some simple transformations (partly based on You are not allowed to view links. Register or Login to view.) to remove line effects (e.g. word initial s-, which typically is line-initial, was removed, assuming that saiin is a line-initial variant of aiin).
At this point, I removed the newline symbol from the input data and analyzed whole paragraphs instead.
The positive result I found is that Q20 and Q13 produce grammars that are somehow comparable. I created graphs where word-classes that often follow each-other are connected by arrows (npnp is the paragraph-separator).

It seems that there is some overlap with your results e.g.
the bidirectional arrow between shedy/chedy and qokain/qokaiin

the Q13 sequence qokain->dar->ol

Not yet implemented the EMS

You are not allowed to view links. Register or Login to view.

I will have a look at producting dot files from my statistics, better legible than raw text files

the format of the text file is largely based on what is pleasant for me to watch in the terminal. also my arbitrary choice of having 18 groups is chosen on that point,

RE: An attempt at extracting grammar from vord order statistics. - davidd - 19-05-2025

(19-05-2025, 06:59 PM)Ruby Novacna Wrote: You are not allowed to view links. Register or Login to view.
(19-05-2025, 03:55 AM)davidd Wrote: You are not allowed to view links. Register or Login to view.... These results look statistically significant to my amateur eyes but ...

Hello Davidd!
I didn't manage to understand your idea, could you please present a step-by-step example with a single word of the B language, without using Python?

Only the first step involves looking at individual vords, so let me try and explain that.
take the word "qokeedy" as an example.
now everywhere it appears in the text (or section under investigation) write down and count which vord comes before "qokeedy" , call these the beforevords and which vord comes after it, call these the aftervords (haha).
next do the same for all other vords in the text, giving each vord a group of beforevords and aftervords.
in the next step compare for all vords in the text the beforevords/aftervords group to the beforevords/aftervords group of "qokeedy" to see how well those groups overlap. the more they overlap the higher the score between this vord and "qokeedy".
after doing all this we see that "qokeedy" scores well with the following vords: 'qokeey', 'qokey', 'qokedy'
we add these 4 vords to an initial vordgroup. That means that we combine the beforevords/aftervords of those individual vords, basicly threat them as if they were all the same vord in the text.
the hope/assumption is that all of these vords perform a simular grammatical function. maybe they are all adjectives, usually coming before a noun, after a verb.
think "pour cold water" "fetch hot water" "pour green water".

let me know if this clarifies it for you

RE: An attempt at extracting grammar from vord order statistics. - Ruby Novacna - 19-05-2025

Thank you, Davidd, for your explanation. I understand the first step: you are looking for the words before and after the word qokeedy. There are 306 occurrences of qokeedy. Do you have a list of words that go with them?