The Voynich Ninja - TTR Analysis, Observed vs. Reference Curve

Pages: 1 2 3

(11-06-2026, 05:46 PM)bi3mw Wrote: You are not allowed to view links. Register or Login to view.
(11-06-2026, 05:33 PM)nablator Wrote: You are not allowed to view links. Register or Login to view.Is b close to 1/2 ?
No, it's much closer to 1/4 than to 1/2. Isn't that okay?

Weird. I would expect a different 'a' from English, but still a 'b' close to 2, since Voynichese sort of follows Zipf's law.

Would you consider trying that code on other texts? I can provide some if you wish.

All the best, --stolfi

(11-06-2026, 07:31 PM)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view.Would you consider trying that code on other texts? I can provide some if you wish.

If you have a specific text in mind, just send it over.

Here is Culpeper for comparison:
[attachment=16009]

(11-06-2026, 07:42 PM)bi3mw Wrote: You are not allowed to view links. Register or Login to view.The formula used has no upper bound (for small values of n, it can theoretically be slightly greater than 1).

TTR > 1 means more types than tokens... shouldn't be possible.

(12-06-2026, 09:03 AM)nablator Wrote: You are not allowed to view links. Register or Login to view.TTR > 1 means more types than tokens... shouldn't be possible

I fixed the formula. Now the graph is always correct. The code in post #10 has been updated.
[attachment=16010]

The values for b are now higher as well. However, the initial observation remains the same.

Your kink is real — I ran something along these lines and can put numbers on it.

Section-level null model: shuffle section labels across folios while preserving the Currier A/B split, then size-match each pseudo-section (200 iterations). Two sections sit at opposite extremes: Herbal is far more repetitive than its size and dialect predict (windowed TTR about 14 sigma below the null mean) and Balneological is close behind (about 10 sigma below). The Stars pages go the other way: hapax share about +4.4 sigma above expectation.

@Jorge_Stolfi — I checked the label question: rerunning prose-only (dropping all label/circle/radial loci, about 9% of tokens) changes essentially nothing — Herbal -13.5, Balneo -10.7, Stars +4.3. It's a property of the running text.

One caveat for your kink specifically: the Astro and Zodiac folios carry no Currier assignments at all, so they can't enter this kind of null. The testable Cosmo remainder is positive but not significant. What is directly confirmed is nablator's compensation point — the Q13 dip is the -10 sigma Balneological deficit.

Interesting side result: boundary-structure metrics that separate Currier A from B cleanly are nearly blind to sections. Vocabulary diversity and boundary structure look like two independent layers of organization.

(12-06-2026, 03:44 PM)petronio Wrote: You are not allowed to view links. Register or Login to view.@Jorge_Stolfi — I checked the label question: rerunning prose-only (dropping all label/circle/radial loci, about 9% of tokens) changes essentially nothing — Herbal -13.5, Balneo -10.7, Stars +4.3. It's a property of the running text.

But is the kink still there when you remove the labels?

All the best, --stolfi

I checked this with more than one methodology — and tried to break it, too. Mostly it held: the excess in the Cosmo/Astro stretch survives prose-only (max deviation +0.031 vs +0.036 with labels). The Zodiac part doesn't — those folios are almost all labels, so in a prose-only stream they barely register.

So it's real text behavior, not a label artifact. Where exactly the kink sits is another matter — cumulative curves drag their whole history along, which blurs the location. I get a much cleaner separation with a different setup and heavier tests, which I've been running in parallel.

I used ivtt to remove the labels and plotted the graph again. There is no noticeable difference in the output.

For what it's worth, here are the graphs for each section. Of course, their significance is limited given the small amount of text in each section. However, the fluctuations from section to section are quite clear. It is surprising that the recipe section, of all places, exhibits the greatest text diversity (lowest b-value). I would have expected the opposite.

[attachment=16044]

Does sample size matter here?
Herbal and Recipes have over 10K tokens whereas Astro and Cosmo fall short of 3K.

What causes the kink in Recipes?

Pages: 1 2 3