The Voynich Ninja

Full Version: TTR Analysis, Observed vs. Reference Curve
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Pages: 1 2 3
(13-06-2026, 03:11 PM)Bernd Wrote: You are not allowed to view links. Register or Login to view.Does sample size matter here?

Yes, as I said, the length of the text does matter. A text of, say, 3,000 words is actually quite short for this type of measurement. This can hardly be compensated for even by measuring in 20-word increments.

(13-06-2026, 03:11 PM)Bernd Wrote: You are not allowed to view links. Register or Login to view.What causes the kink in Recipes?

To be honest, I don't know the answer to that yet. The type/token ratio seems to drop sharply, especially at the beginning.
(13-06-2026, 01:47 PM)bi3mw Wrote: You are not allowed to view links. Register or Login to view.For what it's worth, here are the graphs for each section.


Quite interesting, Thanks!

So it seems that Herbal and Balneological (Bio) do follow Heaps's law L(n) ≈ K n^r quite accurately, with exponent r = 1-b = 0.532 and 0.522, respectively; the "official" value being 0.5, that is , L(n) ≈ K sqrt(n).

Astro and Cosmo presumably deviate from that law because they contain many single-use words (like names of stars) embedded in the text, besides the labels on the figures.

The anomalies in Pharma and Recipes could be explained by them being formulaic (like a collection of short recipes) rather than running prose.

Quote:It is surprising that the recipe section, of all places, exhibits the greatest text diversity (lowest b-value). I would have expected the opposite.

The explanation for the uneven statistics of Astro, Cosmo, Pharma, and Recipes may be due to their diagrams/recipes being sorted by topic or nature; with some topics having many low-frequency words, others being more repetitive.

All the best, stolfi

PS.  By the way, this analysis would be clearer if the lexicon size L and the number of tokens n were plotted in log-log scale.

"Happiness to an engineer is a straight line on a log-log plot" 
--Ancient Sumerian proverb.
Here’s how I see it: the difference between the cosmos and recipes.
The cosmos is an explanation, a story, and unchanging.
A recipe is a set of instructions. Words like “cook,” “cut,” “paint,” “mix,” “spread”... are actions.
All these actions are absent in the cosmos, because there’s nothing you can do there—you can only listen.
So it doesn’t surprise me that there’s a difference.
I have revised the measurements in the herbal and recipes sections. Since both sections are over 10,000 words long, the script originally measured them in increments of 100. Now, they have also been measured in increments of 20. However, this has not resulted in any significant change in the results.

Code:
#!/usr/bin/env python3

import sys
import re
import os
import numpy as np
import matplotlib.pyplot as plt


def tokenize(text):
    return re.findall(r"\w+", text.lower())


def compute_ttr(tokens, step=100):
    x = []
    y = []

    for i in range(step, len(tokens) + 1, step):
        chunk = tokens[:i]
        types = len(set(chunk))

        x.append(i)
        y.append(types / len(chunk))

    return np.array(x[1:]), np.array(y[1:])


def fit_bounded(x, y):
    """
    Fit:
        y = 1 / (1 + a * x^b)

    Linearized:
        log((1/y) - 1) = log(a) + b * log(x)
    """
    z = (1 / y) - 1

    logx = np.log(x)
    logz = np.log(z)

    slope, intercept = np.polyfit(logx, logz, 1)

    b = slope
    a = np.exp(intercept)

    return a, b


def reference_curve(x, a, b):
    return 1 / (1 + a * x**b)


def main():
    if len(sys.argv) != 2:
        print(f"Usage: {sys.argv[0]} <textfile.txt>")
        sys.exit(1)

    filepath = sys.argv[1]
    filename = os.path.basename(filepath)

    with open(filepath, "r", encoding="utf-8") as f:
        text = f.read()

    tokens = tokenize(text)

    if len(tokens) < 200:
        print("Text is too short.")
        sys.exit(1)

    # dynamic step selection
    if len(tokens) < 12000:
        step = 20
    else:
        step = 100

    print(f"Using step size: {step}")

    # compute TTR
    x, y = compute_ttr(tokens, step=step)

    if len(x) < 2:
        print("Not enough data points for fitting.")
        sys.exit(1)

    # fit bounded model
    a, b = fit_bounded(x, y)
    y_fit = reference_curve(x, a, b)

    # output parameters
    print("\nFitted parameters:")
    print(f"  a = {a:.6f}")
    print(f"  b = {b:.6f}")
    print()

    # plot
    plt.figure(figsize=(10, 6))

    plt.plot(
        x,
        y,
        label=f"Observed TTR ({filename})",
        alpha=0.8
    )

    plt.plot(
        x,
        y_fit,
        label=f"Bounded reference curve (a={a:.3f}, b={b:.3f})",
        linewidth=2
    )

    plt.xlabel("Tokens")
    plt.ylabel("Type/Token Ratio (TTR)")
    plt.title("TTR Analysis: Observed vs Bounded Model")

    plt.legend()
    plt.grid(True, alpha=0.3)

    plt.show()


if __name__ == "__main__":
    main()
I noticed that TTR (and the exponent of Heaps) decrease with N. The larger the section, the smaller the expected TTR. Therefore, before concluding, I believe it's worthwhile to separate size from behavior. And also observe the direction. If Recipes is one of the largest sections and still shows the greatest diversity, this is the opposite of what size predicts, a stronger anomaly rather than an artifact.
I think the ideal test would be to subsample each section to the same N (that of the smallest section), calculate the TTR in each subsample, repeat the process a few hundred times, and compare the averages with their dispersion. If Recipes (or the diversity of Stars, or the homogeneity of Herbal/Balneo) still stands out after that, it's a real section effect; if it disappears, it was due to length.
(16-06-2026, 02:13 AM)petronio Wrote: You are not allowed to view links. Register or Login to view.I noticed that TTR (and the exponent of Heaps) decrease with N. The larger the section, the smaller the expected TTR.

Heaps's law says that the wordtypes-to-tokens ratio (TTR) decreases as the number of tokens N increases. However the exponent r that determines how fast that happens is supposed to be constant for all N, even as low as thousand or so.  

Thus there is no point in "separating size from behavior": the behavior is how the TTR changes with size.

And that is what I see in @bi3mw's plots.  There are two sections that fit Heaps's law quite well: Herbal and Balneo (Bio).  The TTR plots follows the theoretical curves quite well overall, and almost exactly starting at N=3000 tokens or so.  

In fact, the small discrepancies below 3000 and above 8000 seem to be due to bad fitting of the theoretical curve, rather than "non-Heapsiness" of the text.  I bet that the discrepancies would practically vanish if the curve-fitting procedure was given only the data for N > 500 (or if each data point was given a weight proportional to N).  The plots should still use the whole range of N, of course.

On the other hand, the large deviations of the other sections mean that those texts just do not follow Heaps's law.   Then there is no much point in trying to get them to follow the law for large N.  Astro, Cosmo, and Pharma are too small for that anyway.  As for Recipes, one could try to fit the law to the range 6000-11000, but I don't know what conclusion one could get from the result.

However, an exercise along the lines you propose may be interesting.  One could just run the test for the second half of the Recipes section.  I suspect that the second half will be found to follow Heaps's law more closely than the first half.

(Somewhere in Recipes there seems to be an abrupt change of handwriting in the middle of a paragraph, or even in the middle of a word, a few lines after the anomalous right-justified mid-page "title".  I like to imagine the Author firing the Scribe for that and other blunders, and hiring another one. Then the change in the Heaps plot behavior may be due to the first Scribe making lots of errors that depressed the TTR.)

One could also try to randomly scramble the words of each section.  I expect that it will not make much difference for Herbal and Balneo, but it should smooth out the plot of the other sections, maybe even make them fit the Law.

All the best, --stolfi
(16-06-2026, 04:39 AM)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view.One could also try to randomly scramble the words of each section.

I gave it a try. Here are the results:

[attachment=16055]
(16-06-2026, 12:05 PM)bi3mw Wrote: You are not allowed to view links. Register or Login to view.
(16-06-2026, 04:39 AM)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view.One could also try to randomly scramble the words of each section.
I gave it a try. Here are the results:

Very interesting!

Heaps's law with exponent 0.5 is said to be a consequence of  Zipf's law, provided that the texts being compared are homogeneous -- generated by the same "time-invariant" process, author, etc..  The "texts" in this case are the same text truncated after N words, for varying N.  Shuffling the words makes those texts homogeneous.  

So this last experiment seems to say that Herbal and Bio follow Zipf's law and are fairly homogeneous even without shuffling.  Perhaps because their bifolios themselves were substantially shuffled before they were bound and numbered.  (Again, the deviations from the ideal curve in those two graphs seem to be due to the ideal curve being badly fitted, rather than the actual curve being "non-Heapsian".)

The good fit to Recipes after shuffling, compared to the bad fit before shuffling, implies that its word frequencies do follow Zip's law, but the section is not homogeneous.  There is a part near the beginning where the vocabulary is mostly static, with the same words being used over and over, and few new words. Then the style changes and suddenly there are a lot of new words being introduced.   

And that is not strange!  Even if you don't accept my identification of the Recipes section as the Shennong Bencao, it is worth noting that the latter is organized into sub-sections according to the nature of the drugs.  In a typical edition, the first part lists drugs that can be taken indefinitely without harm; today we would call them "dietary supplements" or "tonics".  The description of those drugs typically lists the benefits of such extended consumption, and those are rather few and mostly the same for all drugs: strength and stamina, beautiful skin, good eyesight, good memory, long life, etc.  The second part has drugs that should be taken only when necessary ("prescription drugs"), and then it gets interesting because there are hundreds of different diseases -- usually a different set for each drug, some rare, some common.  Then come drugs that are toxic and should be taken only as a last resort and under close watch by the doctor ("hospital drugs"); but, in terms of vocabulary variation, these recipes are not that different from those in the second part.  Whatever the Recipes section is, it may well be organized in a similar way -- and that would explain those TTR plots.

Finally, the original and shuffled TTR plots for Pharma, Cosmo, and Astro  indicate that they are both somewhat non-Zipfian (so that the shuffled plots still deviate from the Heaps law) but they are also not homogeneous (so that the original plots deviate a lot more than the shuffled ones).

It would be interesting to see the Zipf plots (word type frequency as a function of decreasing frequency rank) for those sections.  In log-log scale, a word frequency distribution that follows Zipf's law would give a straight line dipping at 45 degrees, ending with a staircase and then zero.  If I am thinking correctly, Phama, Astro, and Cosmo should have an excess of words with low frequency.  Which is what we expect from a text that enumerates the names of a large set of things, like stars...

All the best, --stolfi
(16-06-2026, 07:16 PM)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view.The good fit to Recipes after shuffling, compared to the bad fit before shuffling, implies that its word frequencies do follow Zip's law, but the section is not homogeneous.  There is a part near the beginning where the vocabulary is mostly static, with the same words being used over and over, and few new words. Then the style changes and suddenly there are a lot of new words being introduced ......   

Thank you for your explanations, especially regarding the “Recipes” section. I think the experiment was worth it just based on the observable behavior in that section alone. So there might indeed be recipes that follow a certain pattern. That’s a plausible answer to @Bernd’s question.
Nice, the shuffling is an elegant way to separate the two: if the curve changes when you shuffle, the deviation is compositional (the section is made of blocks with different vocabulary); if it doesn't, it's the frequency distribution itself. It's complementary to size-matching, one diagnoses the cause and the other keeps the comparison between sections fair. And it makes sense that Recipes is the compositional one, with a sparse opening and a more diverse later part.
Pages: 1 2 3