The Voynich Ninja
[split] recurring word sequences - Printable Version

+- The Voynich Ninja (https://www.voynich.ninja)
+-- Forum: Voynich Research (https://www.voynich.ninja/forum-27.html)
+--- Forum: Analysis of the text (https://www.voynich.ninja/forum-41.html)
+--- Thread: [split] recurring word sequences (/thread-2363.html)



[split] recurring word sequences - doranchak - 31-03-2018

I am very curious about the general case of sequences of n "vords".

For example, instead of considering only patterns of the type XYXY, consider any sequence ABCD.  In other words, consider every combination of 4 vords.

For each combination, compute the expected number of occurrences based on your probability calculations.  Then compare to the actual count, and sort the list of combinations in descending order of the difference between actual and expected.  Which combinations are unusually repetitive (or unusually non-repetitive, i.e. phobic of specific combinations) and are thus statistically significant compared to random distributions of vords?  

I suspect this test might have already been performed - perhaps someone can point me to existing research on this.


RE: alternating patterns - MarcoP - 05-04-2018

(31-03-2018, 08:04 PM)doranchak Wrote: You are not allowed to view links. Register or Login to view.I am very curious about the general case of sequences of n "vords".

For example, instead of considering only patterns of the type XYXY, consider any sequence ABCD.  In other words, consider every combination of 4 vords.

For each combination, compute the expected number of occurrences based on your probability calculations.  Then compare to the actual count, and sort the list of combinations in descending order of the difference between actual and expected.  Which combinations are unusually repetitive (or unusually non-repetitive, i.e. phobic of specific combinations) and are thus statistically significant compared to random distributions of vords?  

I suspect this test might have already been performed - perhaps someone can point me to existing research on this.

Hello Doranchak,
it is surprising how little statistical research has been done on the manuscript. There is so much to discover!

You can find some information about repeating word sequences in this post by Julian Bunn:
You are not allowed to view links. Register or Login to view.

Apparently, there is a single repeating 4-words sequence:
ol shedy qokedy qokeedy
which occurs at f75v.P2.21 and f84r.P.10

There are a few 3-words sequences, but the most promising area for an extensive analysis are two-words sequences: they seem to be numerous enough to produce meaningful statistics.


RE: alternating patterns - Koen G - 05-04-2018

Interesting that there is so little repetition of three and four word sequences. Is it easy to run a test against a corpus of similar size? To give it a fair shot it shouldn't be something with fixed formulas like Homer or the Bible. And the language shouldn't be too analytic like modern English. For example "in the name of the" is a sequence of five, compared to Latin "in nomen", two. So my guess is that Latin prose may be a fair shot, though I'd still expect more repetition of sequences.


RE: alternating patterns - MarcoP - 05-04-2018

(05-04-2018, 08:51 PM)Koen Gh. Wrote: You are not allowed to view links. Register or Login to view.Interesting that there is so little repetition of three and four word sequences. Is it easy to run a test against a corpus of similar size? To give it a fair shot it shouldn't be something with fixed formulas like Homer or the Bible. And the language shouldn't be too analytic like modern English. For example "in the name of the" is a sequence of five, compared to Latin "in nomen", two. So my guess is that Latin prose may be a fair shot, though I'd still expect more repetition of sequences.

I suggest splitting the thread at post 33, so we can discuss the subject in its own place. This is something I recently discussed with Emma: I should be able to retrieve some comparisons.


RE: [split] recurring word sequences - Koen G - 06-04-2018

Good idea, split.


RE: [split] recurring word sequences - MarcoP - 06-04-2018

These are the data I discussed with Emma.
I searched for repeating sentences, including those that span over two lines. The python script I used for this test is better than the quick search I did yesterday: in Takahashi's transcription, the VMS appears to have two 4-words repeating sentences.

shedy.qol.shedy.qokaiin
f78r.P.38
f80r.P.25

ol.shedy.qokedy.qokeedy
f75v.P2.21
f84r.P.10

I only checked for sequences at least 3-words long. Adding 2-words sequences could be an interesting experiment.
I compared Quire13 with samples of a similar length, because the VMS seems to be less homogeneous than works like the Genesis of Dante's Inferno. Quire 13 is the only section of the ms with 4-words repeating sentences: others sections would have given a non-zero result only for 3-words sequences. 

Pliny's Historia is made of relatively short books on different subjects, so it could be a better comparison. My impression is that the amount of repeating sentences is strongly dependent of the style and personal preferences of the author.


RE: alternating patterns - doranchak - 06-04-2018

(05-04-2018, 08:23 PM)MarcoP Wrote: You are not allowed to view links. Register or Login to view.Hello Doranchak,
it is surprising how little statistical research has been done on the manuscript. There is so much to discover!

You can find some information about repeating word sequences in this post by Julian Bunn:
You are not allowed to view links. Register or Login to view.

Apparently, there is a single repeating 4-words sequence:
ol shedy qokedy qokeedy
which occurs at f75v.P2.21 and f84r.P.10

There are a few 3-words sequences, but the most promising area for an extensive analysis are two-words sequences: they seem to be numerous enough to produce meaningful statistics.
That is the kind of work I enjoy doing. My main focus of research is the Zodiac ciphers and I spent a lot of time generating statistics, often by randomizing a ciphertext and using the randomizations to estimate significance of certain observations (You are not allowed to view links. Register or Login to view.). My curiosities around Voynich follow a similar pattern of thinking about how the various qualities of Voynichese compare to randomizations of the text. The "phobia" of 3- and 4- word sequence repetitions seems very significant because I would expect shuffles of the text to produce many more than are observed. It would also be very interesting to me to see if the 3- or 4- word repetitions are actually happening over certain distances (i.e., a pattern of words such as ABC might not repeat, but a pattern such as A??BC might, where "?" is a wildcard representing any other Voynich word).


RE: alternating patterns - MarcoP - 06-04-2018

(06-04-2018, 11:32 AM)doranchak Wrote: You are not allowed to view links. Register or Login to view.That is the kind of work I enjoy doing.  My main focus of research is the Zodiac ciphers and I spent a lot of time generating statistics, often by randomizing a ciphertext and using the randomizations to estimate significance of certain observations (You are not allowed to view links. Register or Login to view.).  My curiosities around Voynich follow a similar pattern of thinking about how the various qualities of Voynichese compare to randomizations of the text.  The "phobia" of 3- and 4- word sequence repetitions seems very significant because I would expect shuffles of the text to produce many more than are observed.  It would also be very interesting to me to see if the 3- or 4- word repetitions are actually happening over certain distances (i.e., a pattern of words such as ABC might not repeat, but a pattern such as A??BC might, where "?" is a wildcard representing any other Voynich word).

I hope you will stick with the Voynich community: the ms is a true mine of curious patterns and people willing to help us dig are always welcome!

From the very limited comparisons I have examined, I wouldn't say there is a phobia for 3- and 4- word sequences. In particular, repeating 3-word sequences are about as frequent as in Pliny.
A factor that is difficult to evaluate is spelling variation (see Stephan Bax' discussion You are not allowed to view links. Register or Login to view.). Medieval scribes were often inconsistent in their spelling, in particular when using abbreviations, but not only. On the other hand, the comparison texts I used have been carefully transcribed, reviewed and corrected through the centuries. Factors like these clearly interfere with  the exact repetition of sentences.

I also included a simple experiment with  a "scrambled" version of Voynich Quire 13, in which I arbitrarily shuffled words. Random word order provides a much lower number of repeating sentences (only 10 3-word sequences, vs 49 in the unmodified Quire13).