Emma May Smith > 08-08-2024, 09:24 AM
(08-08-2024, 06:00 AM)Torsten Wrote: You are not allowed to view links. Register or Login to view.(08-08-2024, 12:25 AM)Emma May Smith Wrote: You are not allowed to view links. Register or Login to view.The z--scores show how unusual the occurrence of a feature is in one position of a distribution.
That's the point. The Z-score is just a statistical measure that quantifies the distance between a data point and the mean of a dataset. Therefore Z-scores didn't allow conclusions like "[ey] and [edy] are followed by [qo] with the same preference". I depends on the distribution of the dataset what the difference between two Z-score is. With other words the difference between 3.5 and 3.7 might be that [.qo] occurs 1.5 times more often after [edy.] than after [ey.].
Koen G > 08-08-2024, 10:20 AM
(08-08-2024, 12:25 AM)Emma May Smith Wrote: You are not allowed to view links. Register or Login to view.(Koen---or anybody with the authority---can I add a bigger file to this post? About 305 kb, but all text so no risk of viruses.)
Torsten > 08-08-2024, 10:34 AM
(08-08-2024, 09:24 AM)Emma May Smith Wrote: You are not allowed to view links. Register or Login to view.I think we're closer on this point than you suggest. So long as we know what kind of pattern/relationship interests us, then z--scores are fine, within their limitations. We could certainly be tighter with how we describe those relationships, but saying that the occurrence of [qo] after [ey] and [edy] is equally unusual still gets us something interesting: the token count in that position is (similarly) raised.
(To note, in the data I'm using, the token counts for [qo] after [ey] and [edy] is 723 and 1388, so an even bigger difference than the one you state. But the likelihood is 0.31 and 0.39 respectively, and the token count for [qo] immediately before the word is only 60% of the one immediately after. So we can see how the raw token count gives us a somewhat false impression, while the z--score picks out how exceptional this distribution is.)