Davidsch > 16-11-2019, 02:54 PM
Torsten > 16-11-2019, 05:39 PM
(15-11-2019, 12:48 PM)Davidsch Wrote: You are not allowed to view links. Register or Login to view.For example if we see ok-chor- everywhere and find only 3 instances ot-chor, we can simply assume those 3 times were a slip of the pen.
(ok-chor and ot-chor have almost an equal count, something like 18 and 20, which is very usual for these ch-sh words)
nablator > 16-11-2019, 06:26 PM
(16-11-2019, 02:54 PM)Davidsch Wrote: You are not allowed to view links. Register or Login to view.words that have ch/sh inside must be split, such a way that there is a space inserted before ch/sh, such that it becomes 2 words, where the second word start with ch/shInteresting...
Torsten > 16-11-2019, 08:34 PM
(16-11-2019, 06:51 AM)ReneZ Wrote: You are not allowed to view links. Register or Login to view.The effect of varying page length, which causes an artificial positive correlation, seems to be effectively eliminated by two different methods:
1) the one proposed by @nablator, by subtracting the expected values based on some average
2) by computing the percentages
While they are different, the resulting correlations are quite similar (certainly equivalent), and any remaining correlation is clearly significant.
The word pairs with ch / Sh largely have a ratio of almost 2:1, but the most frequent pair chedy : Shedy is clearly an exception. This is mostly caused by the biological section, which again turns out to be 'different' in yet one more respect. It has always been seen as the most repetitive of all texts in the MS.
ReneZ > 16-11-2019, 09:39 PM
(16-11-2019, 08:34 PM)Torsten Wrote: You are not allowed to view links. Register or Login to view.You did not mention that you also removed all zero matches. You warned first that this might be a mistake: "Pages with zero matches (for both words) should probably not be removed, as this is one of the possible results." You warning was correct (see You are not allowed to view links. Register or Login to view.).
(16-11-2019, 08:34 PM)Torsten Wrote: You are not allowed to view links. Register or Login to view.Keep also in mind that percentages are bounded by [0, 1], and the underlying assumption of the Pearson Correlation test is that values are normally distributed;
(16-11-2019, 08:34 PM)Torsten Wrote: You are not allowed to view links. Register or Login to view.... the same percentage values can stand for different page sizes. Therefore it is hard to interpret correlation values between percentages from page length especially since we can not exclude that the page size has any influence.
Torsten > 16-11-2019, 10:33 PM
(16-11-2019, 09:39 PM)ReneZ Wrote: You are not allowed to view links. Register or Login to view.One can compute the correlation coefficient for any set of data without any assumption on their distribution.
(16-11-2019, 09:39 PM)ReneZ Wrote: You are not allowed to view links. Register or Login to view.By comparing the correlation for 'counts' and the correlation for 'percentages', as has been done in this thread, one can see the influence of the page size. This is precisely what has demonstrated that for this particular, and quite interesting, word pair, only the biological section has a significant positive correlation.
(16-11-2019, 09:39 PM)ReneZ Wrote: You are not allowed to view links. Register or Login to view.This is precisely what has demonstrated that for this particular, and quite interesting, word pair, only the biological section has a significant positive correlation.
ReneZ > 16-11-2019, 10:40 PM
(16-11-2019, 10:33 PM)Torsten Wrote: You are not allowed to view links. Register or Login to view.You obviously didn't understand my comment (see You are not allowed to view links. Register or Login to view.).
Torsten > 17-11-2019, 10:55 AM
(16-11-2019, 10:40 PM)ReneZ Wrote: You are not allowed to view links. Register or Login to view.(16-11-2019, 10:33 PM)Torsten Wrote: You are not allowed to view links. Register or Login to view.You obviously didn't understand my comment (see stats.stackexchange.com).
I am afraid that you don't have the slightest idea what I understand or don't understand.
I prefer a more constructive way of discussing.
MarcoP > 17-11-2019, 02:21 PM
N.Pages | Xhedy | Xhol | Xhey | Xhor | Xheey | |
_ALL | 225 | 0.771 | 0.497 | 0.220 | 0.229 | 0.011 |
HerbalA | 95 | - | 0.388 | 0.148 | -0.007 | -0.045 |
HerbalB | 32 | 0.414 | 0.220 | -0.102 | -0.183 | 0.010 |
Stars | 23 | 0.120 | -0.051 | 0.522 | 0.214 | -0.124 |
Bio | 20 | 0.687 | 0.334 | 0.253 | -0.093 | 0.346 |
Pharma | 16 | -0.066 | 0.513 | -0.107 | 0.209 | -0.446 |
Cosmo | 12 | 0.647 | -0.124 | 0.349 | -0.079 | -0.217 |
ReneZ > 17-11-2019, 03:25 PM
(17-11-2019, 02:21 PM)MarcoP Wrote: You are not allowed to view links. Register or Login to view.The plot I get for chedy / shedy seems to me reasonably close (though not identical) to the one computed by Rene on the basis of Nablator's spreadsheet. I have included {0,0} pages. The red line is y=x.