(16-06-2026, 12:05 PM)bi3mw Wrote: You are not allowed to view links. Register or Login to view. (16-06-2026, 04:39 AM)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view.One could also try to randomly scramble the words of each section.
I gave it a try. Here are the results:
Very interesting!
Heaps's law with exponent 0.5 is said to be a consequence of Zipf's law, provided that the texts being compared are homogeneous -- generated by the same "time-invariant" process, author, etc.. The "texts" in this case are the same text truncated after N words, for varying N. Shuffling the words makes those texts homogeneous.
So this last experiment seems to say that Herbal and Bio follow Zipf's law and are fairly homogeneous even without shuffling. Perhaps because their bifolios themselves were substantially shuffled before they were bound and numbered. (Again, the deviations from the ideal curve in those two graphs seem to be due to the ideal curve being badly fitted, rather than the actual curve being "non-Heapsian".)
The good fit to Recipes after shuffling, compared to the bad fit before shuffling, implies that its word frequencies do follow Zip's law, but the section is not homogeneous. There is a part near the beginning where the vocabulary is mostly static, with the same words being used over and over, and few new words. Then the style changes and suddenly there are a lot of new words being introduced.
And that is not strange! Even if you don't accept my identification of the Recipes section as the Shennong Bencao, it is worth noting that the latter is organized into sub-sections according to the nature of the drugs. In a typical edition, the first part lists drugs that can be taken indefinitely without harm; today we would call them "dietary supplements" or "tonics". The description of those drugs typically lists the benefits of such extended consumption, and those are rather few and mostly the same for all drugs: strength and stamina, beautiful skin, good eyesight, good memory, long life, etc. The second part has drugs that should be taken only when necessary ("prescription drugs"), and then it gets interesting because there are hundreds of different diseases -- usually a different set for each drug, some rare, some common. Then come drugs that are toxic and should be taken only as a last resort and under close watch by the doctor ("hospital drugs"); but, in terms of vocabulary variation, these recipes are not that different from those in the second part. Whatever the Recipes section is, it may well be organized in a similar way -- and that would explain those TTR plots.
Finally, the original and shuffled TTR plots for Pharma, Cosmo, and Astro indicate that they are both somewhat non-Zipfian (so that the shuffled plots still deviate from the Heaps law) but they are also not homogeneous (so that the original plots deviate a lot more than the shuffled ones).
It would be interesting to see the Zipf plots (word type frequency as a function of decreasing frequency rank) for those sections. In log-log scale, a word frequency distribution that follows Zipf's law would give a straight line dipping at 45 degrees, ending with a staircase and then zero. If I am thinking correctly, Phama, Astro, and Cosmo should have an excess of words with low frequency. Which is what we expect from a text that enumerates the names of a large set of things, like stars...
All the best, --stolfi