The Voynich Ninja

Full Version: TTR Analysis, Observed vs. Reference Curve
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Pages: 1 2 3
While experimenting with the type/token ratio in the VMS corpus, I compared the original curve with a smoothed ideal curve. With one exception, the curves are almost identical. The “kink” above the original curve can only mean that there are more new words than “usual” in this section of text. Since this deviation occurs only once, it is quite remarkable or am I overinterpreting this observation ?

[attachment=16011]
The steeper downward slope of the very repetitive Q13 compensates all the novelty in Cosmo-Astro-Zodiac vocabulary.
(11-06-2026, 03:29 PM)bi3mw Wrote: You are not allowed to view links. Register or Login to view.While experimenting with the type/token ratio in the VMS corpus, I compared the original curve with a smoothed ideal curve. With one exception, the curves are almost identical. The “kink” above the original curve can only mean that there are more new words than “usual” in this section of text. Since this deviation occurs only once, it is quite remarkable or am I overinterpreting this observation ?

Your observation is correct. See Timm 2014: "There are pages where words of a particular series are frequently used and pages where they are rare. Graph 1 also shows that the number of words which are part of the grid decreases for the pages from <f67r1> to <f73v>. These pages belong to the Cosmological section and to the section with Zodiac illustrations. Since the grid contains all words used at least four times, this means that more unique and rare words occur on these pages." (Timm 2014, p. 7f).

[attachment=15990]
(11-06-2026, 03:29 PM)bi3mw Wrote: You are not allowed to view links. Register or Login to view.While experimenting with the type/token ratio in the VMS corpus, I compared the original curve with a smoothed ideal curve.

You are not allowed to view links. Register or Login to view. (which I believe is a consequence of  Zipf's law plus the fact that word counts must be integers) says that the number of distinct words (lexemes, word types) in a text of size N is proportional to sqrt(N).  Thus the type/token ratio should be approximately K/sqrt(N).  Is that what your "smoothed curve" is?

All the best,--stolfi
(11-06-2026, 03:29 PM)bi3mw Wrote: You are not allowed to view links. Register or Login to view.The “kink” above the original curve can only mean that there are more new words than “usual” in this section of text.

Are you including labels in the analysis?  It makes no sense to mix running prose with labels.  They are supposed to have very different statistics.

All the best, --stolfi
(11-06-2026, 05:05 PM)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view.Is that what your "smoothed curve" is?

Formular:
[attachment=15991]

Function in Python:
def reference_curve(x, a, b):
    return a * x ** (-b)
   
Call:
    # fitted reference curve
    a, b = fit_power_law(x, y)
    y_fit = reference_curve(x, a, b)
(11-06-2026, 05:20 PM)bi3mw Wrote: You are not allowed to view links. Register or Login to view.Function in Python:
def reference_curve(x, a, b):
    return a * x ** (-b)

Is b close to 1/2 ? Doesn't look like it.
(11-06-2026, 05:33 PM)nablator Wrote: You are not allowed to view links. Register or Login to view.Is b close to 1/2 ?


No, it's much closer to 1/4 than to 1/2. Isn't that okay?
(11-06-2026, 05:46 PM)bi3mw Wrote: You are not allowed to view links. Register or Login to view.
(11-06-2026, 05:33 PM)nablator Wrote: You are not allowed to view links. Register or Login to view.Is b close to 1/2 ?
No, it's much closer to 1/4 than to 1/2. Isn't that okay?

It is because Voynichese is not English. Smile
0.4 - 0.6 is for English.
If anyone wants to check specific sections, here is the code:

Code:
#!/usr/bin/env python3

import sys
import re
import os
import numpy as np
import matplotlib.pyplot as plt


def tokenize(text):
    return re.findall(r"\w+", text.lower())


def compute_ttr(tokens, step=100):
    x = []
    y = []

    for i in range(step, len(tokens) + 1, step):
        chunk = tokens[:i]
        types = len(set(chunk))

        x.append(i)
        y.append(types / len(chunk))

    return np.array(x[1:]), np.array(y[1:])


def fit_bounded(x, y):
    """
    Fit:
        y = 1 / (1 + a * x^b)

    Linearized:
        log((1/y) - 1) = log(a) + b * log(x)
    """
    z = (1 / y) - 1

    logx = np.log(x)
    logz = np.log(z)

    slope, intercept = np.polyfit(logx, logz, 1)

    b = slope
    a = np.exp(intercept)

    return a, b


def reference_curve(x, a, b):
    return 1 / (1 + a * x**b)


def main():
    if len(sys.argv) != 2:
        print(f"Usage: {sys.argv[0]} <textfile.txt>")
        sys.exit(1)

    filepath = sys.argv[1]
    filename = os.path.basename(filepath)

    with open(filepath, "r", encoding="utf-8") as f:
        text = f.read()

    tokens = tokenize(text)

    if len(tokens) < 200:
        print("Text is too short.")
        sys.exit(1)

    # dynamic step selection
    if len(tokens) < 10000:
        step = 20
    else:
        step = 100

    print(f"Using step size: {step}")

    # compute TTR
    x, y = compute_ttr(tokens, step=step)

    if len(x) < 2:
        print("Not enough data points for fitting.")
        sys.exit(1)

    # fit bounded model
    a, b = fit_bounded(x, y)
    y_fit = reference_curve(x, a, b)

    # output parameters
    print("\nFitted parameters:")
    print(f"  a = {a:.6f}")
    print(f"  b = {b:.6f}")
    print()

    # plot
    plt.figure(figsize=(10, 6))

    plt.plot(
        x,
        y,
        label=f"Observed TTR ({filename})",
        alpha=0.8
    )

    plt.plot(
        x,
        y_fit,
        label=f"Bounded reference curve (a={a:.3f}, b={b:.3f})",
        linewidth=2
    )

    plt.xlabel("Tokens")
    plt.ylabel("Type/Token Ratio (TTR)")
    plt.title("TTR Analysis: Observed vs Bounded Model")

    plt.legend()
    plt.grid(True, alpha=0.3)

    plt.show()


if __name__ == "__main__":
    main()

Edit: I've improved the code.
Pages: 1 2 3