RE: Analysis of circular text?
Anton > 29-05-2022, 11:51 PM
Well, here are the first results, for the Zipf curve.
I expressed the Zipf curve as A/r^B, where r is rank.
RobGea's sets' parameters are as follows:
Circular text: 2317 words in total, vocabulary size 1100, hapax legomena count 838
Non-circular text: 36376 words in total, vocabulary size 7602, hapax legomena count 5242
Least squares regression (Matlab lsqcurvefit function) yields:
CT: A = 70.3, B = 0.609
Non-CT: A = 1055, B = 0.7
Residuals are non-negligent, but I think to better fit the raw data one has to exclude hapax legomena from the sets (?)
Interestingly, if we introduce C as A divided by the total word count, then C=0.03 for both sets, and the power coefficient is rather close between the two (0.6 vs 0.7). This suggests that both sets are behaving rather similarly from the rank vs count perspective. However, the figures do not suggest good fit for the Zipf law where C needs be somewhat close to 0.1 and the power coefficient needs be close to 1, if I understand correctly.
(I wonder, in the papers claiming the VMS fitting the Zipf law, what were the assessed parameters of the curve?)
But again, if I understand correctly, the Zipf-Mandelbrot law might be a better fit, so I will check that as well in the next step.