(12-11-2019, 04:50 PM)RobGea Wrote: You are not allowed to view links. Register or Login to view.There appears a marked difference between the two in the cosmological section with ch_ appearing much more often within the circles.
Thank you for your interesting observations!
You are not allowed to view links.
Register or
Login to view. is particularly impressive. One can also see that some zodiac signs (e.g. Scorpio) have a marked preference for ch- while others (e.g. Cancer) don't.
(12-11-2019, 04:50 PM)RobGea Wrote: You are not allowed to view links. Register or Login to view.As an observation for futher research ( i didnt do any stats) sh_words seem to appear earlier in the lines than ch_.
sh_ words seem to like being the second word, f79r, f93r, f103v, You are not allowed to view links. Register or Login to view. are notable.
This is also an interesting phenomenon.
1083 lines have a sh- word appearing before a ch- word
748 lines have a ch- word appearing before a sh- word
The two sets are not disjunct: some lines include both patterns. Also, by "before" I do not mean "immediately before".
For instance the following line matches both patterns, because
chedy occurs before
shekam and
shekam occurs before
chl:
sarol chedy shekam qokar chl ykeedy chckhy dalor dy
Excluding lines starting with ch/sh, so that we are not influenced by a line-initial preference for sh-, I find that:
904 lines have a sh- word appearing before a ch- word
659 lines have a ch- word appearing before a sh- word
So, yes, I think we can say that sh- words tend to occur before ch- words, though this is just a preference with many exceptions.
I also confirm that sh- seems to prefer the second position in a line:
721 sh words (22.2%) appear in the second position of a line
932 ch words (15.6%) appear in the second position of a line
As always, I could have made errors and it would be good to see independent measures about this.
PS:
191 sh- words (5.9%) appear at the end of lines
626 ch- words (10.5%) appear at the end of lines
Could it be that the line-ending preference of ch-words is the cause of sh- typically occurring earlier in a line?
Torsten, I deliberately kept it simple to keep it understandable, and the argument does not change if the factor is not 2 but 1.4 or some other similar number.
You still did not understand my post, because it precisely argues why it should be expected that there are such exceptions where sh words are more frequent than ch words.
The main point was, in any case, that the statistics you present are related to the full text of the MS, whereas you then link it to an explanation for behaviour 'per page', which is then presented as a fact, while it is not more than a speculation.
This speculation is not supported by the outcomes of the queries at voynichese.com
The particular line-initial behaviour of such words is also not explained at all.
Finally, the main purpose of the original post, that sh-words tend to be shorter than ch-words is also not explained at all.
This now seems to have been measured to be 0.3 on average.
The real question about this is whether this number is actually significant.
That means: is this evidence that there is a reason behind it, or is this something that could happen as a result of chance?
My gut feeling is that is looks significant, but gut feelings are extremely dangerous when it comes to statistics.
About the difference in average length:
I get that ch-words are averagely 5.45 EVA characters long; for sh- the measure is 5.17. The difference is 0.28.
This is the histogram for frequent word couples, where I replaced absolute counts by % on all ch-words / sh-words.
[
attachment=3681]
The following is the graph for the distribution of word lengths.
[
attachment=3682]
The higher number of 3-chars sh-words appears to be mostly due to
sho (with a small contribution of
shy).
The prominence of
shedy over
chedy appears to be responsible for sh- having more 5 chars words than ch-. Other 5-chars couples like X
heol / X
heor show a much smaller preference for sh- or also a slight preference for ch-.
Words 6-chars long or longer tend to uniformly favour ch- over sh-. I think this could just be an effect of sh- words concentrating at length 2 and 5. Overall, I think that
sho and
shedy might by themselves explain the observed difference in length.
Marco, that's fascinating, thank you. So what you're saying is that sh vs ch are interchangeable with a few very defined exceptions?
(13-11-2019, 08:49 PM)davidjackson Wrote: You are not allowed to view links. Register or Login to view.Marco, that's fascinating, thank you. So what you're saying is that sh vs ch are interchangeable with a few very defined exceptions?
Yes, as nablator wrote above, the two benches are mutually replaceable (in particular as prefixes). I guess this could also be the reason for davidsch's OP; Torsten's numbers point in this direction as well. I am not sure that the exceptions are very defined, but
shedy and
sho have a considerable impact because they are so frequent: maybe they could explain the difference in average length between ch-words and sh-words.
Thanks for the correlation plots. These are quite useful and address the argument that was proposed.
It is worth looking at this topic more closely.
The first and foremost thing about correlation is that correlation does not point to causality.
However, we should try to understand where the general positive correlation is coming from.
In this exercise we are comparing positive numbers only.
Furthermore, there is a clear influence from the varying page size:
- short pages typically have few ch words and sh words
- long pages can have more ch words and sh words
This automatically causes a trend with the model line going up for increasing numbers, and crossing the origin very close to (0,0).
This can be seen in essentially all plots. The dotted line behaves like this. It's slope is always close to the ratio:
nr of sh-words divided by nr of ch-words.
Let's look at the plot for all words per page. The correlation coefficient is given as +0.55
This might seem a strong positive correlation but this is quite misleading.
[
attachment=3696]
In the low-area word count there is a large group where there is basically no correlation. All combinations of counts occur.
Interestingly, there is also a considerable group of pages where the sh-words are more frequent than the ch-words. These tend to be the longer pages. These two things together are already causing something like a positive correlation.
To illustrate the fact that varying page length is dominating these results, lets look at the hypothetical case that the Voynich manuscript text were composed by picking words arbitrarily from a hat. In this case:
- more total words on a page -> more ch words on a page
- more total words on a page -> more sh words on a page
This causes a positive correlation between ch-words and sh-words, without any causality.
(Clearly in this case the text would be meaningless, but not the result of auto-copying).
Because of all this, it is also interesting to look at the pair chol / chor.
The reason is that these words are relatively infrequent on the longer pages, which are all in Currier-B. As a result, the variation of absolute counts per page is much more limited.
The correlation for this case is given as +0.44
This seems like a high positive correlation but for people used to working with correlation coefficients, it is not.
The figure is included here for visual verification that for any number of chol counts, the number of shol counts can vary considerably.
[
attachment=3697]
To understand which correlation values are typical here are some more samples:
First, Torsten's pairs, just for test.
Corr(chedy[515], shedy[438]): 0.84
Corr(chey[350], shey[285]): 0.68
Corr(chol[405], shol[191]): 0.41
Corr(cheey[173], sheey[148]): 0.39
Corr(cho[68], sho[126]): 0.36
Corr(cheol[169], sheol[112]): 0.43
Corr(chy[154], shy[104]): 0.27
Corr(chor[213], shor[97]): 0.23
Corr(cheedy[57], sheedy[85]): 0.60
[font=Tahoma, Verdana, Arial, sans-serif]Then words which frequently appear adjacent:[/font]
Corr(or[367], aiin[464]): 0.50
Corr(ar[370], aiin[464]): 0.67
Corr(chol[405], daiin[892]): 0.18
Corr(ol[550], aiin[464]): 0.27
Corr(ol[550], chedy[515]): 0.58
Corr(daiin[892], aiin[464]): 0.20
[font=Tahoma, Verdana, Arial, sans-serif]ch <-> qok:
Corr(chedy[515], qokedy[269]): 0.73[/font]
Corr(chey[350], qokey[107]): 0.56
Corr(chol[405], qokol[97]): 0.07
Corr(cheey[173], qokeey[305]): 0.44
Corr(cho[68], qoko[6]): -0.03
Corr(cheol[169], qokeol[51]): 0.43
sh <-> qok:
Corr(shedy[438], qokedy[269]): 0.72
Corr(shey[285], qokey[107]): 0.63
Corr(shol[191], qokol[97]): 0.04
Corr(sheey[148], qokeey[305]): 0.53
Corr(sho[126], qoko[6]): 0.02
Corr(sheol[112], qokeol[51]): 0.27
[font=Tahoma, Verdana, Arial, sans-serif]sh <-> qot:[/font]
Corr(shedy[438], qotedy[92]): 0.61
Corr(shey[285], qotey[25]): 0.47
Corr(shol[191], qotol[45]): -0.07
Corr(sheey[148], qoteey[41]): 0.30
Corr(sho[126], qoto[3]): 0.05
Corr(sheol[112], qoteol[12]): 0.13
[font=Courier New][font=Tahoma, Verdana, Arial, sans-serif]ok <-> qok:[/font][/font]
Corr(okedy[119], qokedy[269]): 0.65
Corr(okey[65], qokey[107]): 0.49
Corr(okal[148], qokal[198]): 0.30
Corr(okeey[173], qokeey[305]): 0.78
Corr(oko[8], qoko[6]): -0.03
Corr(okeol[62], qokeol[51]): 0.34
Corr(okain[117], qokain[219]): 0.68
Corr(okaiin[216], qokaiin[266]): 0.66
[font=Tahoma, Verdana, Arial, sans-serif]Some random pairs with high correlation coefficient:
Corr(ol[550], qol[158]): 0.74[/font]
Corr(qokeedy[305], qokeey[305]): 0.71
Corr(al[273], ar[370]): 0.77
Corr(lchedy[118], qokeedy[305]): 0.70
Corr(checthy[28], shecthy[19]): 0.81
Corr(okain[117], otain[74]): 0.70
Corr(qokey[107], shedy[438]): 0.71
Corr(oteedy[96], qokeey[305]): 0.72
What if one used % of occurrences per page, instead of count of occurrences per page? Would this remove the impact of page size? Also: would removing pages with zero matches help?