The Voynich Ninja

Full Version: sh_ and ch_ compose the same words
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Pages: 1 2 3 4 5 6 7 8 9 10
Thanks to @farmerjohn for the additional comparisons. This shows that the same effect is also observed for other pairs than the ch* / sh* pairs.

To Marco: working with percentages would remove the contribution to the correlation that just comes from page length.  In the imaginary case that I used to illustrate the effect (arbitrarily picking words), the expected correlation would be 0.

Pages with zero matches (for both words) should probably not be removed, as this is one of the possible results.

Finally, this is still not on the topic of David's question....
(14-11-2019, 08:44 AM)ReneZ Wrote: You are not allowed to view links. Register or Login to view.To illustrate the fact that varying page length is dominating these results, lets look at the hypothetical case that the Voynich manuscript text were composed by picking words arbitrarily from a hat. In this case:
- more total words on a page -> more ch words on a page
- more total words on a page -> more sh words on a page

True. To discount this linear relation (on average) between the two, the number of words of a certain type on a page can be compared to the expected number of words (from the global frequency ratio and the total number of words on the page) to see whether there are more or less words of this type than expected on the page. Then it does not look like an excess (+, horizontally) or shortage (- horizontally) of ch_ words is correlated to an excess (+, vertically) or shortage (-, vertically) of sh_ words.

[attachment=3699]

I'm sure the same statistic can be done on words starting with any two common letters, on pages of any book and the plot will look as randomly scattered as this one.
Hi nablator, I would love to see the same graph for, say, chedy / shedy only.
I do Not understand these conclusions. (In my bowl apples and pears are both fruit.  chedy= shedy)

Let me ask you Torsten, in your view, could you answer :

from the medieval manuscript viewpoint, for a random hypothetical invented alphabet

If we would use the letter c similar to the letter k, but have a slight preference for one of them,
would you then more likely see if they are 

a) different
b) they behave similar

c) neither

(of course you could try take other letters such as s&z, u &v, v&w b&p etc.)
(14-11-2019, 02:58 PM)MarcoP Wrote: You are not allowed to view links. Register or Login to view.Hi nablator, I would love to see the same graph for, say, chedy / shedy only.
Hi MarcoP,

[attachment=3700]
Thank you, Nablator!
Your plot confirms that, in this case, a significant correlation exists.

From the list at the left of the graph, it seems to me that you have removed pages with no occurrences of either word? Above, Rene suggested to keep them:

Quote:Pages with zero matches (for both words) should probably not be removed, as this is one of the possible results. 
I think that excluding those pages could result in a lower correlation coefficient.

I think it is well possible that, adding together several correlated phenomena, the result is something that appears to be uncorrelated (as your overall plot for ch- vs sh-). I also suspect that the fact that the ratio of chX vs shX is different for different Xs could blur things when all words are added up together?
I'm finding it increasingly difficult to grasp what these plots are telling us Smile 

With respect to the exclusion of pages without occurrences, on second thought this seems better, especially for cases like chedy/shedy since there is quite a large number of pages where they don't occur and these would just create a pile of points on (0,0) (in this example).

It seems to me as if the observed behaviour could be quite dependent on the 'section' of the MS.

I wonder if it would be possible to have a copy of the Excel file with the counts.
(14-11-2019, 09:07 PM)ReneZ Wrote: You are not allowed to view links. Register or Login to view.With respect to the exclusion of pages without occurrences, on second thought this seems better, especially for cases like chedy/shedy since there is quite a large number of pages where they don't occur and these would just create a pile of points on (0,0) (in this example).

They would all be on a line in the negative x/y quadrant, perfectly correlated! Smile

Quote:I wonder if it would be possible to have a copy of the Excel file with the counts.

Sure.

[attachment=3707]
[attachment=3709]
(14-11-2019, 08:44 AM)ReneZ Wrote: You are not allowed to view links. Register or Login to view.However, we should try to understand where the general positive correlation is coming from.
In this exercise we are comparing positive numbers only.

This is exactly the point. For all 'sh'-word-types with at least 4 tokens the corresponding 'ch'-word exists. With other words, there are no word pairs with appreciable negative correlation*.

*(The most "extreme" negative correlation occurs for word pairs like You are not allowed to view links. Register or Login to view. or You are not allowed to view links. Register or Login to view.. In this case, however, the words rarely occur more than once on the same folios. In the case of che/she only <she> occurs twice on folio You are not allowed to view links. Register or Login to view..
  Pearson's Correlation(count(kchey[21]), [font=Courier New]count(kshey[6])): -0.05[/font]
  Pearson's Correlation([font=Courier New]count(che[2]),    count(she[25])) : -0.03[/font])


(14-11-2019, 08:44 AM)ReneZ Wrote: You are not allowed to view links. Register or Login to view.Furthermore, there is a clear influence from the varying page size: 
- short pages typically have few ch words and sh words
- long pages can have more ch words and sh words
This automatically causes a trend with the model line going up for increasing numbers, and crossing the origin very close to (0,0).
This can be seen in essentially all plots. The dotted line behaves like this. It's slope is always close to the ratio:
nr of sh-words divided by nr of ch-words.

The correlation coefficient is defined as "dividing the covariance of the two variables by the product of their standard deviations" (see You are not allowed to view links. Register or Login to view.). The deviation for small values is probably small. Therefore small values have less impact and zero matches only weaken the result.

Let's illustrate this effect with an example. In my previous post I have calculated the correlation coefficient only for the word frequencies for the 612 ch/sh word pairs for the whole MS. If I calculate the correlation coefficient for the token frequencies for all 1211 shWords instead this doesn't have much impact since most of the 599 additional shWords occur only once anyway. Therefore, the correlation coefficient for all 1211 shWords is +0.928 whereas the correlation coefficient for the 612 ch/sh word pairs was +0.93.
Pearson's Correlation(count(chWords@VMS), count(shWords@VMS)): +0.9304 (n= 612 word pairs,p=4.54E-268)
Pearson's Correlation(count(chWords@VMS), count(shWords@VMS)): +0.9280 (n=1211 shWords   ,p=too close to zero to compute)

What does this mean for varying page sizes:
To exclude an effect of the folio size I have calculated the correlation coefficient for folios containing less text (herbal folios in quire 1 to 7) and for folios containing lot of text (Quire 13 and 20). The correlation coefficient for all 428 ch/sh word pairs for all 111 herbal folios until folio You are not allowed to view links. Register or Login to view. (quire 1 to 7 and folio f57r) is +0.34. The correlation coefficient for all 489 ch/sh word pairs for all Quire 13 and 20 is +0.65. The correlation coefficient for folios with less text is weaker (+0.34) and the correlation coefficient for folios full of text is stronger (+0.65) than the coefficient for the whole MS (+0.55). With other words the effect is more dominant for folios containing more text but the effect is also observable for folios containing less text.

Pearson's Correlation(count(chWords@folios), count(shWords@folios)): +0.55 (n=612 word pairs * 225 folios=137700,p=too close to zero to compute)
Pearson's Correlation(count(chWords@herbal), count(shWords@herbal)): +0.34 (n=428 word pairs * 111 folios= 47508,p=too close to zero to compute)
[attachment=3704]

Pearson's Correlation(count(chWords@Quire13+20), count(shWords@Quire13+20)): +0.65 (n=489 word pairs * 43 folios= 21027,p=too close to zero to compute)
[attachment=3705]

Additionally I have have calculated the token frequencies for chedy/shedy for Quire 13 and 20. The resulting correlation coefficient is +0.58. (The correlation coefficient for chedy/shedy for the whole MS was +0.84. But to exclude the influence from varying page sizes the data sample did exclude numerous folios containing <chedy> and <shedy>.)

Pearson's Correlation(chedy[501]@VMS,        shedy[426]@VMS)       : +0.836 (n=225 folios,p=5.44E-50)
Pearson's Correlation(chedy[400]@Quire13+20, shedy[360]@Quire13+20): +0.58  (n= 43 folios,p=4.23E-05)
[attachment=3706]

It is notable that the data points above the trend line belong to Quire 13 and the points below the trend line belong to Quire 20. Therefore I have also calculated the correlation coefficient for each quire separately:
Pearson's Correlation(chedy[210]@Quire13, shedy[247]@Quire13): +0.76 (n=20 folios)
Pearson's Correlation(chedy[190]@Quire20, shedy[113]@Quire20): +0.3  (n=23 folios)

The result for Quire 20 explains the calculated result of [font=Tahoma, Verdana, Arial, sans-serif]+0.58 for [font=Tahoma, Verdana, Arial, sans-serif]Quire 13 and 20[/font]. It also illustrates the following statement "No obvious rule can be deduced [/font]which words form the top-frequency tokens at a specific location because a token dominating one page might be rare or missing on the next one" (Timm & Schinner 2019, p. 3).

(14-11-2019, 08:44 AM)ReneZ Wrote: You are not allowed to view links. Register or Login to view.Because of all this, it is also interesting to look at the pair chol / chor.

The reason is that these words are relatively infrequent on the longer pages, which are all in Currier-B. As a result, the variation of absolute counts per page is much more limited.
The correlation for this case is given as +0.44

As explained before the effect is stronger on pages containing more text. Folios with less text simply contain fewer repeated chWords and/or fewer repeated shWords. The reason for the observed correlation coefficient therefore is that chol/shol are relatively infrequent on the longer pages. But even for folios containing less text, like You are not allowed to view links. Register or Login to view., the effect can be obvious (see You are not allowed to view links. Register or Login to view.).


(14-11-2019, 08:44 AM)ReneZ Wrote: You are not allowed to view links. Register or Login to view.This seems like a high positive correlation but for people used to working with correlation coefficients, it is not.
The figure is included here for visual verification that for any number of chol counts, the number of shol counts can vary considerably.

Indeed, +0.44 "only" indicates a moderate positive linear relationship (see You are not allowed to view links. Register or Login to view.).
(14-11-2019, 03:39 PM)Davidsch Wrote: You are not allowed to view links. Register or Login to view.Let me ask you Torsten, in your view, could you answer :

from the medieval manuscript viewpoint, for a random hypothetical invented alphabet

If we would use the letter c similar to the letter k, but have a slight preference for one of them,
would you then more likely see if they are 

a) different
b) they behave similar

c) neither

I would expect that both letters are interchangeable and that in one context letter 'c' is used and in another context letter 'k'. It would for instance surprise me to see multiple instances of 'color' and 'colour' in the same paragraph and I would find it stunning to read 'color' and 'colour' along with 'kolor' and 'kolour'.
Pages: 1 2 3 4 5 6 7 8 9 10