I am not sure if this was mentioned, the stat results also might change if you use only initial words with ch_ and sh_ or not.
I preferred to use this for the other words where that is not the case: (don't know how much there were)
words that have ch/sh inside must be split, such a way that there is a space inserted before ch/sh, such that it becomes 2 words, where the second word start with ch/sh
(15-11-2019, 12:48 PM)Davidsch Wrote: You are not allowed to view links. Register or Login to view.For example if we see ok-chor- everywhere and find only 3 instances ot-chor, we can simply assume those 3 times were a slip of the pen.
(ok-chor and ot-chor have almost an equal count, something like 18 and 20, which is very usual for these ch-sh words)
This is exactly the point. We would expect to find <chedy> everywhere and to see only some rare instances of <shedy>. But this is not the outcome we get. <shedy> is not used instead of <chedy> it is used together with <chedy>. We would expect a negative correlation between 'ch' and 'sh', but what we find is a positive correlation.
(16-11-2019, 02:54 PM)Davidsch Wrote: You are not allowed to view links. Register or Login to view.words that have ch/sh inside must be split, such a way that there is a space inserted before ch/sh, such that it becomes 2 words, where the second word start with ch/sh
Interesting...
![[Image: 3,width=650,height=650,appearanceId=2.jpg]](https://image.spreadshirtmedia.net/image-server/v1/products/T949A2PA2009PT25X0Y10D117975437FS5256/views/3,width=650,height=650,appearanceId=2.jpg)
Source: You are not allowed to view links.
Register or
Login to view.
(16-11-2019, 06:51 AM)ReneZ Wrote: You are not allowed to view links. Register or Login to view.The effect of varying page length, which causes an artificial positive correlation, seems to be effectively eliminated by two different methods:
1) the one proposed by @nablator, by subtracting the expected values based on some average
2) by computing the percentages
While they are different, the resulting correlations are quite similar (certainly equivalent), and any remaining correlation is clearly significant.
The word pairs with ch / Sh largely have a ratio of almost 2:1, but the most frequent pair chedy : Shedy is clearly an exception. This is mostly caused by the biological section, which again turns out to be 'different' in yet one more respect. It has always been seen as the most repetitive of all texts in the MS.
You did not mention that you also removed all zero matches. You warned first that this might be a mistake: "Pages with zero matches (for both words) should probably not be removed, as this is one of the possible results." You warning was correct (see You are not allowed to view links.
Register or
Login to view.).
Keep also in mind that percentages are bounded by [0, 1], and the underlying assumption of the Pearson Correlation test is that values are normally distributed; these are manifestly incompatible. Moreover, the same percentage values can stand for different page sizes. Therefore it is hard to interpret correlation values between percentages from page length especially since we can not exclude that the page size has any influence.
(16-11-2019, 08:34 PM)Torsten Wrote: You are not allowed to view links. Register or Login to view.You did not mention that you also removed all zero matches. You warned first that this might be a mistake: "Pages with zero matches (for both words) should probably not be removed, as this is one of the possible results." You warning was correct (see You are not allowed to view links. Register or Login to view.).
For the
chedy /
Shedy pair, there are zero matches in *all* A-language pages, and there are no zero matches, as far as I know, in the B-language pages. The statistics per 'section' of the MS are therefore not affected by this.
(16-11-2019, 08:34 PM)Torsten Wrote: You are not allowed to view links. Register or Login to view.Keep also in mind that percentages are bounded by [0, 1], and the underlying assumption of the Pearson Correlation test is that values are normally distributed;
One can compute the correlation coefficient for any set of data without any assumption on their distribution.
You have done exactly the same. Counts are bounded as [0, N] and are nowhere similar to being normally distributed.
(16-11-2019, 08:34 PM)Torsten Wrote: You are not allowed to view links. Register or Login to view.... the same percentage values can stand for different page sizes. Therefore it is hard to interpret correlation values between percentages from page length especially since we can not exclude that the page size has any influence.
By comparing the correlation for 'counts' and the correlation for 'percentages', as has been done in this thread, one can see the influence of the page size. This is precisely what has demonstrated that for this particular, and quite interesting, word pair, only the biological section has a significant positive correlation.
(16-11-2019, 09:39 PM)ReneZ Wrote: You are not allowed to view links. Register or Login to view.One can compute the correlation coefficient for any set of data without any assumption on their distribution.
You have done exactly the same. Counts are bounded as [0, N] and are nowhere similar to being normally distributed.
You obviously didn't understand my comment (see You are not allowed to view links.
Register or
Login to view.
). Moreover, it is one of the stunning facts for the VMS that the word length distribution matches almost perfectly a binomial distribution (see You are not allowed to view links.
Register or
Login to view.
).
(16-11-2019, 09:39 PM)ReneZ Wrote: You are not allowed to view links. Register or Login to view.By comparing the correlation for 'coun
ts' and the correlation for 'percentages', as has been done in this thread, one can see the influence of the page size. This is precisely what has demonstrated that for this particular, and quite interesting, word pair, only the biological section has a significant positive correlation.
Exactly, "the effect is stronger on pages containing more text" (You are not allowed to view links. Register or Login to view.).
(16-11-2019, 09:39 PM)ReneZ Wrote: You are not allowed to view links. Register or Login to view.
This is precisely what has demonstrated that for this particular, and quite interesting, word pair, only the biological section has a significant positive correlation.
Keep in mind that Prescott H. Currier used the positive correlation for "final 'dy'" to distinguish between Currier A and B. This also includes <chedy> and <shedy>. Currier wrote in 1976:
"The Herbal Section contains both Language 'A' and 'B.' The principal differences between the two 'languages' in this Section are:
(a) Final 'dy' is very high in Language 'B'; almost non-existent in Language 'A.'
..." (Currier 1976).
Currier described this differences as "Suffice it to say, the differences are obvious and statistically significant" (You are not allowed to view links. Register or Login to view.).
Moreover, by "reordering the sections with respect to the frequency of token <chedy> replaces the seemingly irregular mixture of two separate languages by the gradual evolution of a single system from 'state A' to 'state B.'" (Timm & Schinner 2019, p. 7).
You even mentioned yourself the fact that the distribution of <chedy> and <shedy> allows us to identify specific sections in the MS.
(16-11-2019, 10:33 PM)Torsten Wrote: You are not allowed to view links. Register or Login to view.You obviously didn't understand my comment (see You are not allowed to view links. Register or Login to view.).
I am afraid that you don't have the slightest idea what I understand or don't understand.
I prefer a more constructive way of discussing.
(16-11-2019, 10:40 PM)ReneZ Wrote: You are not allowed to view links. Register or Login to view. (16-11-2019, 10:33 PM)Torsten Wrote: You are not allowed to view links. Register or Login to view.You obviously didn't understand my comment (see stats.stackexchange.com).
I am afraid that you don't have the slightest idea what I understand or don't understand.
I prefer a more constructive way of discussing.
Please excuse my hasty words, I did not want to sound that mean.
My point is that I find it hard to interpret the outcome of the Pearson Correlation test for percent values. For instance the correlation coefficient for the percentage values for Quire13 and 20 is +0.71 whereas the result for the token frequencies is +0.58. But what does the value of +0.71 mean? The value is even higher as the value for Quire 13 alone.
Pearson's Correlation(percent(chedy[400]@Quire13+20), percent(shedy[360]@Quire13+20)): +0.71 (n= 43 folios,p=4.23E-05)
Pearson's Correlation(chedy[400]@Quire13+20, shedy[360]@Quire13+20): +0.58 (n= 43 folios,p=4.23E-05)
Pearson's Correlation(percent(chedy[210]@Quire13), percent(shedy[247]@Quire13)): +0.68 (n= 20 folios)
Pearson's Correlation(chedy[210]@Quire13, shedy[247]@Quire13): +0.76 (n=20 folios)
Pearson's Correlation(percent(chedy[190]@Quire20), percent(shedy[113]@Quire20)): +0.12 (n=23 folios)
Pearson's Correlation(chedy[190]@Quire20, shedy[113]@Quire20): +0.3 (n=23 folios)
While I understand the usefulness of percentages in removing the spurious correlation induced by varying page-lengths, I second Torsten's question. I am also interested in understanding what can be the meaning of overall correlation being greater than that of the individual sections.
I have tried computing page-percentage plots and correlation measures for a few more couples of ch- / sh- words. The plot I get for chedy / shedy seems to me reasonably close (though not identical) to the one computed by Rene on the basis of Nablator's spreadsheet. I have included {0,0} pages. The red line is y=x.
As always, it is possible I have made errors somewhere.
[
attachment=3720]
These are the plots for the next four more frequent ch- words vs their sh- counterparts.
[
attachment=3719]
These are correlation values on the whole manuscript and on various subsections. I extracted subsections using the ivtff +I parameter. It seems that what I get is not always identical to the classification used by Rene. For instance, for Cosmo chedy/shedy, I also find matches in 68v3 and 70r2, while 85r2 (having no illustrations) is classified as "Text".
In general, it seems that the overall correlation is rather low. Per-section correlation also tends to be low, but there are a couple of possibly interesting exceptions, e.g. chey/shey in Stars. It seems clear that, with increasingly rarer words, data points tend to cluster on the two axis, with one of the two counts=0. Could it be that, for words which only have 1 or 2 average occurrences per page, it woud be more informative to measure folio or bi-folio percentages?
| N.Pages | Xhedy | Xhol | Xhey | Xhor | Xheey |
_ALL | 225 | 0.771 | 0.497 | 0.220 | 0.229 | 0.011 |
HerbalA | 95 | - | 0.388 | 0.148 | -0.007 | -0.045 |
HerbalB | 32 | 0.414 | 0.220 | -0.102 | -0.183 | 0.010 |
Stars | 23 | 0.120 | -0.051 | 0.522 | 0.214 | -0.124 |
Bio | 20 | 0.687 | 0.334 | 0.253 | -0.093 | 0.346 |
Pharma | 16 | -0.066 | 0.513 | -0.107 | 0.209 | -0.446 |
Cosmo | 12 | 0.647 | -0.124 | 0.349 | -0.079 | -0.217 |
No problem, let the discussion continue...
I have no satisfactory explanation for the phenomenon, but the artefacts from varying page length are important, and I think sufficiently demonstrated. Any pair of words that occur consistently throughout a meaningful text will show this type of correlation if the text is cut in unequal pieces and the correlation is computed over these pieces.
One can normalise in several different ways. Two have been used here. A third would be to concatenate all the text and then cut it in equal pieces, but this may be less useful, since there is evidently some folio-dependence in the MS text.
For one thing this phenomenon shows that trying to turn this text into something meaningful by simple substitution will not work. No language behaves like this. This thread is specifically about the similarity of ch_ and sh_ words, but similar things is happening for other pairs, for example words including k vs. t etc.
(17-11-2019, 02:21 PM)MarcoP Wrote: You are not allowed to view links. Register or Login to view.The plot I get for chedy / shedy seems to me reasonably close (though not identical) to the one computed by Rene on the basis of Nablator's spreadsheet. I have included {0,0} pages. The red line is y=x.
I think that nablator's spreadsheet was based on the Takeshi Takahashi transcription. If you used another one, the difference would be explained.