The Voynich Ninja - TTR Analysis, Observed vs. Reference Curve

Pages: 1 2 3

While experimenting with the type/token ratio in the VMS corpus, I compared the original curve with a smoothed ideal curve. With one exception, the curves are almost identical. The “kink” above the original curve can only mean that there are more new words than “usual” in this section of text. Since this deviation occurs only once, it is quite remarkable or am I overinterpreting this observation ?

[attachment=16011]

The steeper downward slope of the very repetitive Q13 compensates all the novelty in Cosmo-Astro-Zodiac vocabulary.

(11-06-2026, 03:29 PM)bi3mw Wrote: You are not allowed to view links. Register or Login to view.While experimenting with the type/token ratio in the VMS corpus, I compared the original curve with a smoothed ideal curve. With one exception, the curves are almost identical. The “kink” above the original curve can only mean that there are more new words than “usual” in this section of text. Since this deviation occurs only once, it is quite remarkable or am I overinterpreting this observation ?

Your observation is correct. See Timm 2014: "There are pages where words of a particular series are frequently used and pages where they are rare. Graph 1 also shows that the number of words which are part of the grid decreases for the pages from <f67r1> to <f73v>. These pages belong to the Cosmological section and to the section with Zodiac illustrations. Since the grid contains all words used at least four times, this means that more unique and rare words occur on these pages." (Timm 2014, p. 7f).

[attachment=15990]

(11-06-2026, 03:29 PM)bi3mw Wrote: You are not allowed to view links. Register or Login to view.While experimenting with the type/token ratio in the VMS corpus, I compared the original curve with a smoothed ideal curve.

You are not allowed to view links. Register or Login to view. (which I believe is a consequence of Zipf's law plus the fact that word counts must be integers) says that the number of distinct words (lexemes, word types) in a text of size N is proportional to sqrt(N). Thus the type/token ratio should be approximately K/sqrt(N). Is that what your "smoothed curve" is?

All the best,--stolfi

(11-06-2026, 03:29 PM)bi3mw Wrote: You are not allowed to view links. Register or Login to view.The “kink” above the original curve can only mean that there are more new words than “usual” in this section of text.

Are you including labels in the analysis? It makes no sense to mix running prose with labels. They are supposed to have very different statistics.

All the best, --stolfi

(11-06-2026, 05:05 PM)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view.Is that what your "smoothed curve" is?

Formular:
[attachment=15991]

Function in Python:
def reference_curve(x, a, b):
return a * x ** (-b)

Call:
# fitted reference curve
a, b = fit_power_law(x, y)
y_fit = reference_curve(x, a, b)

(11-06-2026, 05:20 PM)bi3mw Wrote: You are not allowed to view links. Register or Login to view.Function in Python:
def reference_curve(x, a, b):
return a * x ** (-b)

Is b close to 1/2 ? Doesn't look like it.

(11-06-2026, 05:33 PM)nablator Wrote: You are not allowed to view links. Register or Login to view.Is b close to 1/2 ?

No, it's much closer to 1/4 than to 1/2. Isn't that okay?

(11-06-2026, 05:46 PM)bi3mw Wrote: You are not allowed to view links. Register or Login to view.
(11-06-2026, 05:33 PM)nablator Wrote: You are not allowed to view links. Register or Login to view.Is b close to 1/2 ?
No, it's much closer to 1/4 than to 1/2. Isn't that okay?

It is because Voynichese is not English. Smile

0.4 - 0.6 is for English.

If anyone wants to check specific sections, here is the code:

Code:
#!/usr/bin/env python3

import sys

import re

import os

import numpy as np

import matplotlib.pyplot as plt

def tokenize(text):

    return re.findall(r"\w+", text.lower())

def compute_ttr(tokens, step=100):

    x = []

    y = []

    for i in range(step, len(tokens) + 1, step):

        chunk = tokens[:i]

        types = len(set(chunk))

        x.append(i)

        y.append(types / len(chunk))

    return np.array(x[1:]), np.array(y[1:])

def fit_bounded(x, y):

    """

    Fit:

        y = 1 / (1 + a * x^b)

    Linearized:

        log((1/y) - 1) = log(a) + b * log(x)

    """

    z = (1 / y) - 1

    logx = np.log(x)

    logz = np.log(z)

    slope, intercept = np.polyfit(logx, logz, 1)

    b = slope

    a = np.exp(intercept)

    return a, b

def reference_curve(x, a, b):

    return 1 / (1 + a * x**b)

def main():

    if len(sys.argv) != 2:

        print(f"Usage: {sys.argv[0]} <textfile.txt>")

        sys.exit(1)

    filepath = sys.argv[1]

    filename = os.path.basename(filepath)

    with open(filepath, "r", encoding="utf-8") as f:

        text = f.read()

    tokens = tokenize(text)

    if len(tokens) < 200:

        print("Text is too short.")

        sys.exit(1)

    # dynamic step selection

    if len(tokens) < 10000:

        step = 20

    else:

        step = 100

    print(f"Using step size: {step}")

    # compute TTR

    x, y = compute_ttr(tokens, step=step)

    if len(x) < 2:

        print("Not enough data points for fitting.")

        sys.exit(1)

    # fit bounded model

    a, b = fit_bounded(x, y)

    y_fit = reference_curve(x, a, b)

    # output parameters

    print("\nFitted parameters:")

    print(f"  a = {a:.6f}")

    print(f"  b = {b:.6f}")

    print()

    # plot

    plt.figure(figsize=(10, 6))

    plt.plot(

        x,

        y,

        label=f"Observed TTR ({filename})",

        alpha=0.8

    )

    plt.plot(

        x,

        y_fit,

        label=f"Bounded reference curve (a={a:.3f}, b={b:.3f})",

        linewidth=2

    )

    plt.xlabel("Tokens")

    plt.ylabel("Type/Token Ratio (TTR)")

    plt.title("TTR Analysis: Observed vs Bounded Model")

    plt.legend()

    plt.grid(True, alpha=0.3)

    plt.show()

if __name__ == "__main__":

    main()

Edit: I've improved the code.

Pages: 1 2 3