The Voynich Ninja

Full Version: sh_ and ch_ compose the same words
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Pages: 1 2 3 4 5 6 7 8 9 10
(12-11-2019, 04:50 PM)RobGea Wrote: You are not allowed to view links. Register or Login to view.There appears a marked difference between the two in the cosmological section with ch_ appearing much more often within the circles.

Thank you for your interesting observations!
You are not allowed to view links. Register or Login to view. is particularly impressive. One can also see that some zodiac signs (e.g. Scorpio) have a marked preference for ch- while others (e.g. Cancer) don't.

(12-11-2019, 04:50 PM)RobGea Wrote: You are not allowed to view links. Register or Login to view.As an observation for futher research ( i didnt do any stats)  sh_words seem to appear earlier in the lines than ch_.
sh_ words seem to like being the second word, f79r, f93r, f103v, You are not allowed to view links. Register or Login to view. are notable.

This is also an interesting phenomenon.

1083 lines have a sh- word appearing before a ch- word
748 lines have a ch- word appearing before a sh- word

The two sets are not disjunct: some lines include both patterns. Also, by "before" I do not mean "immediately before".
For instance the following line matches both patterns, because chedy occurs before shekam and shekam occurs before chl:
sarol chedy shekam qokar chl ykeedy chckhy dalor dy

Excluding lines starting with ch/sh, so that we are not influenced by a line-initial preference for sh-, I find that:
904 lines have a sh- word appearing before a ch- word
659 lines have a ch- word appearing before a sh- word

So, yes,  I think we can say that sh- words tend to occur before ch- words, though this is just a preference with many exceptions.

I also confirm that sh- seems to prefer the second position in a line:
721 sh words (22.2%) appear in the second position of a line
932 ch words (15.6%) appear in the second position of a line

As always, I could have made errors and it would be good to see independent measures about this.


PS:
191 sh- words (5.9%) appear at the end of lines
626 ch- words (10.5%) appear at the end of lines

Could it be that the line-ending preference of ch-words is the cause of sh- typically occurring earlier in a line?
(12-11-2019, 07:39 AM)ReneZ Wrote: You are not allowed to view links. Register or Login to view.
Going back to the facts, words including 'ch' are roughly twice as numerous as words including 'sh'. More than that, this holds true for similar word patterns, so typically for all word types including a 'ch',  by replacing 'ch' with 'sh' one tends to finds half the number.

This is 'across the board'. It is an observation, regardless what process is at the origin of this behaviour.
[font=Tahoma, Verdana, Arial, sans-serif]
[/font]

You oversimplify the facts. Words including 'ch' are not always roughly twice as numerous as words including 'sh'. You probably didn't read the previous You are not allowed to view links. Register or Login to view. arguing against such an idea. Only a number of words starting with 'ch' behave this way. But this does not hold for all words starting with 'ch' and 'sh'. I explicitly point to 49 'sh'-words that are used more frequently and name words like <sho>, <sheedy>, <she>, and <shee> as examples. Words where 'ch' and 'sh' are preceded by other characters behave differently. 'sh' is for instance less frequent after 'k' and 't' but it is not less frequent after 'd'. Also words using two instances of 'ch'/'sh' behave differently (see You are not allowed to view links. Register or Login to view.).

Moreover, what you write misrepresent what I say. I say that for 'sh'-word-types with at least 4 tokens the corresponding 'ch'-word exists. With other words, only for rarely used 'sh'-word-types the corresponding 'ch'-word-type is missing. Additionally I say that it is typically that the 'ch'-words are used more frequently. I didn't say that they are 'twice as numerous'. Such a statement would oversimplify the statistic results even for words starting with 'ch' and 'sh'.


(12-11-2019, 07:39 AM)ReneZ Wrote: You are not allowed to view links. Register or Login to view.
Quote:
There is a reason for this result. The chance for a 'sh'-word to occur on a page increases as more often the corresponding 'ch'-word appears on that page:
You are not allowed to view links. Register or Login to view.
You are not allowed to view links. Register or Login to view.


This is the first time that statistics 'per page' enter the argument, so it is not based on what is written before.

I wrote "There is a reason for this result." This statement was probably not clear enough. It would be better if I had added a sentence, that the statistic result indicates "a relation between 'sh'- and 'ch'-words." and that it is necessary to explain this type of relation. 


(12-11-2019, 07:39 AM)ReneZ Wrote: You are not allowed to view links. Register or Login to view.
This is where the facts end.

Again, the only suggestion for the behaviour per page is given by the outputs of voynichese.com

If you are interested in checking all 612 'ch'/'sh'-word-pairs please use the algorithm provided in my previous post or send me a message.


(12-11-2019, 07:39 AM)ReneZ Wrote: You are not allowed to view links. Register or Login to view.It is probably worth looking at them in detail to see if there is any evidence of the suggested cause. I find that there are plenty of pages where it does not hold even remotely, even for these very frequent words.

I clearly say that the chance for a 'sh'-word is increased. There is no statement about every page or that the relation between two words is the only factor. Therefore it is no surprise that it is possible to point to pages where 'shedy' is used more frequently and also to pages where 'shedy' is used less frequently than 'chedy'. You even wrote yourself that you expect this outcome: "since a single page is a much smaller sample of text, one should expect a much greater dispersion of the statistics."

The point for the links to voynichese.com is that on pages containing at least five instances of 'chedy' also the word 'shedy' can be found and that on pages containing at least five instances of 'shedy' also the word 'chedy' can be found. In the same way it is only possible to point to one folio with seven instances of 'chey' but no instance of 'shey' (see You are not allowed to view links. Register or Login to view.). This is what I mean with my statement "the chance for a 'sh'-word to occur on a page increases as more often the corresponding 'ch'-word appears on that page".

If it comes to 'chedy' and 'shedy' see also the paper of Montemurro and Zanette. They write: "there is a strong link between words that share a suffix, as with chedy-shedy in the group shown in Figure 2A" (You are not allowed to view links. Register or Login to view.).

You further comments are obviously not based on something I have said in my previous post.
Torsten, I deliberately kept it simple to keep it understandable, and the argument does not change if the factor is not 2 but 1.4 or some other similar number. 

You still did not understand my post, because it precisely argues why it should be expected that there are such exceptions where sh words are more frequent than ch words.

The main point was, in any case, that the statistics you present are related to the full text of the MS, whereas you then link it to an explanation for behaviour 'per page', which is then presented as a fact, while it is not more than a speculation.
This speculation is not supported by the outcomes of the queries at voynichese.com

The particular line-initial behaviour of such words is also not explained at all.

Finally, the main purpose of the original post, that sh-words tend to be shorter than ch-words is also not explained at all.
This now seems to have been measured to be 0.3 on average.

The real question about this is whether this number is actually significant.
That means: is this evidence that there is a reason behind it, or is this something that could happen as a result of chance?

My gut feeling is that is looks significant, but gut feelings are extremely dangerous when it comes to statistics.
About the difference in average length:

I get that ch-words are averagely 5.45 EVA characters long; for sh- the measure is 5.17. The difference is 0.28.

This is the histogram for frequent word couples, where I replaced absolute counts by % on all ch-words / sh-words.

[attachment=3681]

The following is the graph for the distribution of word lengths.

[attachment=3682]

The higher number of 3-chars sh-words appears to be mostly due to sho (with a small contribution of shy).
The prominence of shedy over chedy appears to be responsible for sh- having more 5 chars words than ch-. Other 5-chars couples like Xheol / Xheor show a much smaller preference for sh- or also a slight preference for ch-.

Words 6-chars long or longer tend to uniformly favour ch- over sh-. I think this could just be an effect of sh- words concentrating at length 2 and 5. Overall, I think that sho and shedy might by themselves explain the observed difference in length.
Marco, that's fascinating, thank you. So what you're saying is that sh vs ch are interchangeable with a few very defined exceptions?
(13-11-2019, 06:24 AM)ReneZ Wrote: You are not allowed to view links. Register or Login to view.
The main point was, in any case, that the statistics you present are related to the full text of the MS, whereas you then link it to an explanation for behaviour 'per page', which is then presented as a fact, while it is not more than a speculation. This speculation is not supported by the outcomes of the queries at voynichese.com 


If you find it more useful to look at a number or at a chart than into the MS itself this is not a problem. It is only necessary to count some word tokens. The correlation coefficient between the token frequencies for <chedy> and <shedy> for all folios is +0.84 and the correlation coefficient between the token frequencies for <chey> and <shey> for all folios is +0.67. 

Pearson's Correlation(chedy[501],shedy[426]): +0.84 (n=225 folios,p=5.44E-50)
[attachment=3694]

Pearson's Correlation(chey[344], shey[283]) : +0.67 (n=225 [font=Courier New]folios,p=6.12E-31)[/font]
[attachment=3693]

Pearson's Correlation(chol[396], shol[186]) : +0.44 (n=225 folios,p=3.4E-12)
[attachment=3691]

Pearson's Correlation(cheey[174],sheey[144]): +0.4  (n=225 [font=Courier New]folios,p=3.09E-10)[/font]
[attachment=3692]

Pearson's Correlation(cho[68],   sho[130])  : +0.3  (n=225 [font=Courier New]folios)[/font]
Pearson's Correlation(cheol[172],sheol[114]): +0.45 (n=225 [font=Courier New]folios)[/font]
Pearson's Correlation(chy[155],  shy[104])  : +0.28 (n=225 [font=Courier New]folios)[/font]
Pearson's Correlation(chor[219], shor[97])  : +0.22 (n=225 [font=Courier New]folios)[/font]
Pearson's Correlation(cheedy[59],sheedy[84]): +0.63 (n=225 [font=Courier New]folios)[/font]
...

[font=Tahoma, Verdana, Arial, sans-serif][font=Tahoma, Verdana, Arial, sans-serif]This results allow the conclusion that the chance for a 'sh'-word to occur on a folio increases as more often the corresponding 'ch'-word appea[/font][font=Tahoma, Verdana, Arial, sans-serif]rs on that folio. Keep in mind that this doesn't mean that [font=Tahoma, Verdana, Arial, sans-serif]'sh'-words solely depend on there 'ch'-counterparts or that 'sh' and 'ch' words are the same. It only means that they depend in some way on each other.[/font][/font][/font]

Additionally I have calculated the token frequencies for all 612 ch/sh word pairs for all folios. The resulting correlation coefficient is +0.55. To make comparison easier I have also calculated the correlation coefficient for the word frequencies for all 612 ch/sh word pairs for the whole MS. This correlation coefficient is +0.93.

Pearson's Correlation(count(chWords@folios),count(shWords@folios)): +0.55 (n=225 folios*612 word pairs=137700,p=too close to zero to calculate)
[attachment=3690]

Pearson's Correlation(count(chWords@VMS), count(shWords@VMS)): +0.93 (n=612 word pairs,p=4.54E-268)
[attachment=3689]

[font=Tahoma, Verdana, Arial, sans-serif][font=Tahoma, Verdana, Arial, sans-serif][font=Tahoma, Verdana, Arial, sans-serif][font=Tahoma, Verdana, Arial, sans-serif]We all talk about the same manuscript no matter what we think about it. It is hard to believe that it should be a problem to agree to facts like that the only instance of <rolchey> and the only instance of <rolshey> can be found on folio You are not allowed to view links. Register or Login to view. or that [font=Tahoma, Verdana, Arial, sans-serif]the only instance of <sheolkain> and the only instance of <cheolkain> can be found on folio You are not allowed to view links. Register or Login to view..[/font][/font][/font][/font][/font]

[font=Tahoma, Verdana, Arial, sans-serif][font=Tahoma, Verdana, Arial, sans-serif][font=Tahoma, Verdana, Arial, sans-serif][font=Tahoma, Verdana, Arial, sans-serif]
(13-11-2019, 06:24 AM)ReneZ Wrote: You are not allowed to view links. Register or Login to view.[/font][/font][/font][/font]The real question about this is whether this number is actually significant. That means: is this evidence that there is a reason behind it, or is this something that could happen as a result of chance?

My gut feeling is that is looks significant, but gut feelings are extremely dangerous when it comes to statistics.

Your gut feelings are correct. Your chance to explain the usage of 'ch' and 'sh' in the MS as a result of chance is zero*.

[font=Tahoma, Verdana, Arial, sans-serif][font=Tahoma, Verdana, Arial, sans-serif][font=Tahoma, Verdana, Arial, sans-serif][font=Tahoma, Verdana, Arial, sans-serif]*(The You are not allowed to view links. Register or Login to view. for the[font=Tahoma, Verdana, Arial, sans-serif][font=Tahoma, Verdana, Arial, sans-serif] word[/font][font=Tahoma, Verdana, Arial, sans-serif] [/font][font=Tahoma, Verdana, Arial, sans-serif]fre[/font][font=Tahoma, Verdana, Arial, sans-serif]quencies for[/font][font=Tahoma, Verdana, Arial, sans-serif] [/font][font=Tahoma, Verdana, Arial, sans-serif]a[/font][font=Tahoma, Verdana, Arial, sans-serif]ll 612 ch/sh word pairs for the whole MS is [/font]p=4.54E-268[/font][/font][/font][/font][/font] or p=0.000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000454).
(13-11-2019, 08:49 PM)davidjackson Wrote: You are not allowed to view links. Register or Login to view.Marco, that's fascinating, thank you. So what you're saying is that sh vs ch are interchangeable with a few very defined exceptions?

Yes, as nablator wrote above, the two benches are mutually replaceable (in particular as prefixes). I guess this could also be the reason for davidsch's OP; Torsten's numbers point in this direction as well. I am not sure that the exceptions are very defined, but shedy and sho have a considerable impact because they are so frequent: maybe they could explain the difference in average length between ch-words and sh-words.
Thanks for the correlation plots. These are quite useful and address the argument that was proposed.

It is worth looking at this topic more closely.

The first and foremost thing about correlation is that correlation does not point to causality.

However, we should try to understand where the general positive correlation is coming from.
In this exercise we are comparing positive numbers only.
Furthermore, there is a clear influence from the varying page size:
- short pages typically have few ch words and sh words
- long pages can have more ch words and sh words
This automatically causes a trend with the model line going up for increasing numbers, and crossing the origin very close to (0,0).
This can be seen in essentially all plots. The dotted line behaves like this. It's slope is always close to the ratio:
nr of sh-words divided by nr of ch-words.

Let's look at the plot for all words per page. The correlation coefficient is given as +0.55
This might seem a strong positive correlation but this is quite misleading.

[attachment=3696]

In the low-area word count there is a large group where there is basically no correlation. All combinations of counts occur.
Interestingly, there is also a considerable group of pages where the sh-words are more frequent than the ch-words. These tend to be the longer pages. These two things together are already causing something like a positive correlation.

To illustrate the fact that varying page length is dominating these results, lets look at the hypothetical case that the Voynich manuscript text were composed by picking words arbitrarily from a hat. In this case:

- more total words on a page -> more ch words on a page
- more total words on a page -> more sh words on a page

This causes a positive correlation between ch-words and sh-words, without any causality.
(Clearly in this case the text would be meaningless, but not the result of auto-copying).

Because of all this, it is also interesting to look at the pair chol / chor.

The reason is that these words are relatively infrequent on the longer pages, which are all in Currier-B. As a result, the variation of absolute counts per page is much more limited.
The correlation for this case is given as +0.44
This seems like a high positive correlation but for people used to working with correlation coefficients, it is not.
The figure is included here for visual verification that for any number of chol counts, the number of shol counts can vary considerably.

[attachment=3697]
To understand which correlation values are typical here are some more samples:

First, Torsten's pairs, just for test.
Corr(chedy[515], shedy[438]): 0.84
Corr(chey[350], shey[285]): 0.68
Corr(chol[405], shol[191]): 0.41
Corr(cheey[173], sheey[148]): 0.39
Corr(cho[68], sho[126]): 0.36
Corr(cheol[169], sheol[112]): 0.43
Corr(chy[154], shy[104]): 0.27
Corr(chor[213], shor[97]): 0.23
Corr(cheedy[57], sheedy[85]): 0.60

[font=Tahoma, Verdana, Arial, sans-serif]Then words which frequently appear adjacent:[/font]
Corr(or[367], aiin[464]): 0.50
Corr(ar[370], aiin[464]): 0.67
Corr(chol[405], daiin[892]): 0.18
Corr(ol[550], aiin[464]): 0.27
Corr(ol[550], chedy[515]): 0.58
Corr(daiin[892], aiin[464]): 0.20

[font=Tahoma, Verdana, Arial, sans-serif]ch <-> qok:
Corr(chedy[515], qokedy[269]): 0.73
[/font]

Corr(chey[350], qokey[107]): 0.56
Corr(chol[405], qokol[97]): 0.07
Corr(cheey[173], qokeey[305]): 0.44
Corr(cho[68], qoko[6]): -0.03
Corr(cheol[169], qokeol[51]): 0.43

sh <-> qok:
Corr(shedy[438], qokedy[269]): 0.72
Corr(shey[285], qokey[107]): 0.63
Corr(shol[191], qokol[97]): 0.04
Corr(sheey[148], qokeey[305]): 0.53
Corr(sho[126], qoko[6]): 0.02
Corr(sheol[112], qokeol[51]): 0.27

[font=Tahoma, Verdana, Arial, sans-serif]sh <-> qot:[/font]
Corr(shedy[438], qotedy[92]): 0.61
Corr(shey[285], qotey[25]): 0.47
Corr(shol[191], qotol[45]): -0.07
Corr(sheey[148], qoteey[41]): 0.30
Corr(sho[126], qoto[3]): 0.05
Corr(sheol[112], qoteol[12]): 0.13

[font=Courier New][font=Tahoma, Verdana, Arial, sans-serif]ok <-> qok:[/font][/font]
Corr(okedy[119], qokedy[269]): 0.65
Corr(okey[65], qokey[107]): 0.49
Corr(okal[148], qokal[198]): 0.30
Corr(okeey[173], qokeey[305]): 0.78
Corr(oko[8], qoko[6]): -0.03
Corr(okeol[62], qokeol[51]): 0.34
Corr(okain[117], qokain[219]): 0.68
Corr(okaiin[216], qokaiin[266]): 0.66

[font=Tahoma, Verdana, Arial, sans-serif]Some random pairs with high correlation coefficient:

Corr(ol[550], qol[158]): 0.74
[/font]

Corr(qokeedy[305], qokeey[305]): 0.71
Corr(al[273], ar[370]): 0.77
Corr(lchedy[118], qokeedy[305]): 0.70
Corr(checthy[28], shecthy[19]): 0.81
Corr(okain[117], otain[74]): 0.70
Corr(qokey[107], shedy[438]): 0.71
Corr(oteedy[96], qokeey[305]): 0.72
What if one used % of occurrences per page, instead of count of occurrences per page? Would this remove the impact of page size? Also: would removing pages with zero matches help?
Pages: 1 2 3 4 5 6 7 8 9 10