The Voynich Ninja
Characters repeating across word boundaries - Printable Version

+- The Voynich Ninja (https://www.voynich.ninja)
+-- Forum: Voynich Research (https://www.voynich.ninja/forum-27.html)
+--- Forum: Analysis of the text (https://www.voynich.ninja/forum-41.html)
+--- Thread: Characters repeating across word boundaries (/thread-2271.html)



Characters repeating across word boundaries - MarcoP - 29-01-2018

I am not completely sure I haven't made substantial mistakes. So take this with a grain of salt. Also, I don't know if this has been discussed before.

These graphs represent the expected (red) and actual (green) counts for consecutive words in which the last letter of the first word is the same as the first letter of the second word. For instance (where '.' represent word boundaries):

but.these
others.said
seventh.hour

I considered:
a classical Latin text (De Bello Gallico). 51328 words
early modern English (the gospel of John from King James Bible). 19131 words
a medieval Occitan text (You are not allowed to view links. Register or Login to view.). 15461 words
medieval Italian (Divine Commedy). 105682 words
VMS - Takahashi's EVA transcription. 37718 words
VMS Currier-D'Imperio transcription, extracted via ivtt. 16120 words

The expected occurrences for each letter occurring as -X.X- is computed in this way:
T is the total number of consecutive word pairs
E is the number of these pairs in which the first word ends with -X
S is the number of these pairs in which the second word starts with X-

The expected number of X.X is T*(E/T)*(S/T)=E*S/T 
For instance, in the VMS Takahashi transcription there are 32053 word couples.
In 1082 of these, the first word ends with EVA: -o ( 1082/ 32053 = 3.4%)
In 7315 couples, the second word starts with o- ( 7315/ 32053 = 22.8%)

We expect 22.8% * 2.4% *  32053=247 occurrences of o.o (plotted in red), but we only find 114 actual occurrences (plotted in green).

While all the texts show some degree of difference between expected and actual, I think it is clear that for Latin and English the differences are small and in both directions, while they are much larger for Occitan and Italian and all due to smaller actual counts than expected. This is due to the fact that Occitan and Italian spelling is subject to euphonic transformations that avoid the repetition of the same vowel across word boundaries.

For instance, in Occitan, the final -e of the preposition 'de' is dropped before 'e-'
juntturas.de.las.cambas
fuelhas.d.erba

The same things happens for the determinative articles:
lo.matin
l.omme

One can also note that the most apparent shift from expected values in Latin corresponds to a.a. In my opinion, this is due to one of the very few euphonic variants of Latin: the two versions of the preposition a/ab, the second one being used before words starting with a vowel. 

a milibus
a germanis

ab inimicis
ab aliis
ab armis

In my opinion, this evidence (if confirmed) could support the word-boundaries transformations discussed by You are not allowed to view links. Register or Login to view.. 

It is clear that, if what we observe in the VMS is due to euphonic transformations, these are quite different from those that happen in Latin languages. In these languages, transformations are limited to short and frequent words (mostly prepositions and articles) and typically affect the end of the words. The phenomena discussed by Emma affect longer words and mostly seem to happen at the beginning of words (I am thinking in particular of the You are not allowed to view links. Register or Login to view. on the basis of the ending of the preceding word). [/i]


RE: Characters repeating across word boundaries - Koen G - 29-01-2018

Thanks, Marco, those are some great graphs again. I shall add them to the graphs thread and link to your thread for ease of retrieval.

Might there be something special about the examples where initial [o] is allowed to follow final [o]? For example, may a specific word-initial combination be excluded from this set, causing the discrepancy in the data?


RE: Characters repeating across word boundaries - MarcoP - 31-01-2018

(29-01-2018, 09:44 PM)Koen Gh. Wrote: You are not allowed to view links. Register or Login to view.Thanks, Marco, those are some great graphs again. I shall add them to the graphs thread and link to your thread for ease of retrieval.

Might there be something special about the examples where initial [o] is allowed to follow final [o]? For example, may a specific word-initial combination be excluded from this set, causing the discrepancy in the data?

Thank you for your comment, Koen!

I have tried to investigate the subject some more, but I couldn't find anything clear.
These 4 tables compare the most common words:
ending -o, 1. most common words before NOT(o) and 2. most common words before o-
starting o-, 3. most common words after NOT(o) and 4. most common words after o-

I have manually highlighted words to show that there is a large overlap between the tables (the highlighting isn't 100% accurate). I cannot see nothing special in the o- starting words that appear after -o, nor in the -o ending words that appear before o-.
Another line of investigation could be examining word ending with -o in general. They are not terribly frequent, but not too rare either (a total of about 1000 occurrences). To avoid the appearance of -o.o-, Voynichese must either replace the ending -o before o-, or replace the starting o- after -o. While Emma has already done much work analysing o-, the -o suffix is much more obscure to me. A good field for further work....


RE: Characters repeating across word boundaries - MarcoP - 01-02-2018

I have made a few new experiments, using a random text (in which characters are independently generated, some having higher probabilities) and the You are not allowed to view links. Register or Login to view..

The random text produces almost identical bars for the expected and actual occurrences of -X.X-. Fisk's generated text exhibits actual occurrences that are consistently slightly less frequent than expected.

In both cases, the results are quite regular and don't show the large differences that appear in the VMS (and in the Italian / Occitan examples discussed above).


RE: Characters repeating across word boundaries - MarcoP - 07-02-2018

The attached plot is based on a synthetic text generated by You are not allowed to view links. Register or Login to view. (I hope I have not made errors in processing the data).

In this text, word initial 'r' is much more frequent than in the VMS (6.4% vs 1.5%): as a consequence, -r.r- is expected to be the most frequent end-start repetition. The actual count is identical to the expected count (114).
The other repetitions fluctuate slightly above and below the expected counts, but, contrary to what can be observed in the manuscript, there is no clear avoidance of end-start repetition.


RE: Characters repeating across word boundaries - Torsten - 07-02-2018

The reason for the low value for 'y.y' is that words ending in 'y' are followed frequently by words beginning with 'qo'.
The reason for the low values for 'o.o' and 'r.r' are high values for 'r.o', 'r.a', 'o.l .

y.q exp_out 1982 act_out 3494
r.o exp_out 1154 act_out 1378
r.a exp_out  309 act_out  942
o.l exp_out   39 act_out  148.

It seems that the text generated by my app didn't contain enough sequences like 'ar.ar.or', 'ar.ar.ain.olr.ar.olor', 'sor.ar.al.ar', 'lor.ar.al' etc.
It seems that source words like 'ar' and 'or' need a higher chance to repeat multiple times.


RE: Characters repeating across word boundaries - Emma May Smith - 07-02-2018

(07-02-2018, 09:43 PM)Torsten Wrote: You are not allowed to view links. Register or Login to view.The reason for the low value for 'y.y' is that words ending in 'y' are followed frequently by words beginning with 'qo'.
The reason for the low values for 'o.o' and 'r.r' are high values for 'r.o', 'r.a', 'o.l .

y.q exp_out 1982 act_out 3494
r.o exp_out 1154 act_out 1378
r.a exp_out  309 act_out  942
o.l exp_out   39 act_out  148.

These aren't the reasons, but a restatement of the problem with more specifics. I've worked with Marco over a number of months to find exactly which characters follow others across spaces (we call them last-first combinations). There's a variety of them and they could demonstrate that the text is not a series of unconnected words but interactive pieces of a whole.

What Marco has done, quite sensibly, is narrow down the problem to a simple one: same character repeats across spaces. If proven it opens up the possible acceptance of a wide range of transformations (even beyond last-first combinations) which we've found.


RE: Characters repeating across word boundaries - Torsten - 07-02-2018

@Emma:
If 'y.y' is less frequent then expected another combination must be more frequent then expected.

I fully agree that the words in the VMS are connected to each other. The question is only which words are connected to each other?