(23-05-2022, 08:52 PM)RobGea Wrote: You are not allowed to view links. Register or Login to view.Hi Anton,
2 files of ranked word frequency, 1 for text in Circles, 1 for all text except text in Circles
Thanks! Will get back in this thread.
(23-05-2022, 08:30 PM)R. Sale Wrote: You are not allowed to view links. Register or Login to view.I have no idea what you're doing but go ahead.
As for myself, I'm in the first place looking for the possible answer to the question of whether the circular text is message chunks or effectively just a collection of disjoint labels. I'm not confident if what I'm doing is a good way towards the answer, but let's see. Probably the whole bulk of circular text is a mix of those - some places it's one way, some places another. Like, the second-to-outer ring in You are not allowed to view links.
Register or
Login to view. definitely does not look like a "message". Perhaps such cases can be detected and pre-filtered out of the analysis, but let's begin just with bulks.
Found this on github:
Natural Language Processing - Create Zipf and Mandelbrot statistics for corpus
The python code has an issue but the R-script does the heavy lifting and produces sensible looking plots.
Wow, will have a look into it. Basically I was going to do interpolation in Matlab, but let's see what this code is capable of.
Well, here are the first results, for the Zipf curve.
I expressed the Zipf curve as A/r^B, where r is rank.
RobGea's sets' parameters are as follows:
Circular text: 2317 words in total, vocabulary size 1100, hapax legomena count 838
Non-circular text: 36376 words in total, vocabulary size 7602, hapax legomena count 5242
Least squares regression (Matlab lsqcurvefit function) yields:
CT: A = 70.3, B = 0.609
Non-CT: A = 1055, B = 0.7
Residuals are non-negligent, but I think to better fit the raw data one has to exclude hapax legomena from the sets (?)
Interestingly, if we introduce C as A divided by the total word count, then C=0.03 for both sets, and the power coefficient is rather close between the two (0.6 vs 0.7). This suggests that both sets are behaving rather similarly from the rank vs count perspective. However, the figures do not suggest good fit for the Zipf law where C needs be somewhat close to 0.1 and the power coefficient needs be close to 1, if I understand correctly.
(I wonder, in the papers claiming the VMS fitting the Zipf law, what were the assessed parameters of the curve?)
But again, if I understand correctly, the Zipf-Mandelbrot law might be a better fit, so I will check that as well in the next step.