bi3mw > 13-06-2026, 03:40 PM
(13-06-2026, 03:11 PM)Bernd Wrote: You are not allowed to view links. Register or Login to view.Does sample size matter here?
(13-06-2026, 03:11 PM)Bernd Wrote: You are not allowed to view links. Register or Login to view.What causes the kink in Recipes?
Jorge_Stolfi > 13-06-2026, 08:34 PM
(13-06-2026, 01:47 PM)bi3mw Wrote: You are not allowed to view links. Register or Login to view.For what it's worth, here are the graphs for each section.
Quote:It is surprising that the recipe section, of all places, exhibits the greatest text diversity (lowest b-value). I would have expected the opposite.
Aga Tentakulus > Yesterday, 09:23 AM
bi3mw > Yesterday, 10:46 AM
#!/usr/bin/env python3
import sys
import re
import os
import numpy as np
import matplotlib.pyplot as plt
def tokenize(text):
return re.findall(r"\w+", text.lower())
def compute_ttr(tokens, step=100):
x = []
y = []
for i in range(step, len(tokens) + 1, step):
chunk = tokens[:i]
types = len(set(chunk))
x.append(i)
y.append(types / len(chunk))
return np.array(x[1:]), np.array(y[1:])
def fit_bounded(x, y):
"""
Fit:
y = 1 / (1 + a * x^b)
Linearized:
log((1/y) - 1) = log(a) + b * log(x)
"""
z = (1 / y) - 1
logx = np.log(x)
logz = np.log(z)
slope, intercept = np.polyfit(logx, logz, 1)
b = slope
a = np.exp(intercept)
return a, b
def reference_curve(x, a, b):
return 1 / (1 + a * x**b)
def main():
if len(sys.argv) != 2:
print(f"Usage: {sys.argv[0]} <textfile.txt>")
sys.exit(1)
filepath = sys.argv[1]
filename = os.path.basename(filepath)
with open(filepath, "r", encoding="utf-8") as f:
text = f.read()
tokens = tokenize(text)
if len(tokens) < 200:
print("Text is too short.")
sys.exit(1)
# dynamic step selection
if len(tokens) < 12000:
step = 20
else:
step = 100
print(f"Using step size: {step}")
# compute TTR
x, y = compute_ttr(tokens, step=step)
if len(x) < 2:
print("Not enough data points for fitting.")
sys.exit(1)
# fit bounded model
a, b = fit_bounded(x, y)
y_fit = reference_curve(x, a, b)
# output parameters
print("\nFitted parameters:")
print(f" a = {a:.6f}")
print(f" b = {b:.6f}")
print()
# plot
plt.figure(figsize=(10, 6))
plt.plot(
x,
y,
label=f"Observed TTR ({filename})",
alpha=0.8
)
plt.plot(
x,
y_fit,
label=f"Bounded reference curve (a={a:.3f}, b={b:.3f})",
linewidth=2
)
plt.xlabel("Tokens")
plt.ylabel("Type/Token Ratio (TTR)")
plt.title("TTR Analysis: Observed vs Bounded Model")
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()
if __name__ == "__main__":
main()petronio > Today, 02:13 AM
Jorge_Stolfi > Today, 04:39 AM
(Today, 02:13 AM)petronio Wrote: You are not allowed to view links. Register or Login to view.I noticed that TTR (and the exponent of Heaps) decrease with N. The larger the section, the smaller the expected TTR.
Jorge_Stolfi > 2 hours ago
(9 hours ago)bi3mw Wrote: You are not allowed to view links. Register or Login to view.(Today, 04:39 AM)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view.One could also try to randomly scramble the words of each section.I gave it a try. Here are the results:
bi3mw > 1 hour ago
(2 hours ago)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view.The good fit to Recipes after shuffling, compared to the bad fit before shuffling, implies that its word frequencies do follow Zip's law, but the section is not homogeneous. There is a part near the beginning where the vocabulary is mostly static, with the same words being used over and over, and few new words. Then the style changes and suddenly there are a lot of new words being introduced ......