The Voynich Ninja

Full Version: Mapping Voynich connections through rare tokens
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Pages: 1 2 3 4
Most of my initial Voynich analysis focused on the most common tokens (I suppose it is the easiest way to start analysing the text). But this doesn't give us much information. So I started wondering if the opposite approach might actually be more informative.

Instead of looking at the global vocabulary, I analysed rare and semi-rare tokens. Not hapax legomena, since many of those could just be scribal noise or transliteration errors, but tokens appearing only a few times across the manuscript. My reasoning was simple: rarer tokens are potentially more specific, and therefore easier to trace between folios (and it might have less errors as they are repeated in the MS).

I built page-to-page networks based on shared rare tokens. The result was surprinsingly structured. Most pages remain weakly connected, but a few behave like hubs that link otherwise distant lexical communities.

The strongest case was f86v.

This is especially interesting because You are not allowed to view links. Register or Login to view. is a pure text folio. It has no obvious visual structure like zodiac diagrams or herbal labels. Yet it repeatedly emerges as one of the most connected pages in the manuscript when analysing rare-token overlap.

What caught my attention is that You are not allowed to view links. Register or Login to view. does not connect strongly to just one section. It shares selective vocabulary with multiple areas of the manuscript, especially Marginal stars, but also Biological, Herbal, Cosmological and others.

Another thing that stood out is the behaviour of the Marginal stars section itself. Several folios from this section repeatedly emerge as hubs or dense local connectors. This suggests that the section may contain a relatively coherent but highly reused layer of vocabulary, possibly acting as a bridge between otherwise more isolated parts of the manuscript.

Main rare-token hubs detected in the network:

Folio Section Observed behaviour
f86v Text-only strongest transversal hub
fRos Cosmological cross-section connector
f111r Marginal stars dense lexical hub
f113r Marginal stars dense lexical hub
f115v Marginal stars highly connected
f108v Marginal stars bridge-like behavior
f72v Zodiac unexpectedly connected
f67r Astronomical distant lexical links
f89r Pharmaceutical cross-section overlap
f76v Biological emerges at relaxed thresholds

The heatmap below shows one example. Each colored signal corresponds to links between You are not allowed to view links. Register or Login to view. and different manuscript sections using only semi-rare tokens. The x-axis follows token order inside the folio itself.

[attachment=15874]

The important point is that the signals are sparse and selective. The page does look like a connector between different lexical neighborhoods (maybe references to text within the MS?).

The Rosettes foldout also behaves in a similarly unusual way. It does not dominate the network as strongly as f86v, but it repeatedly appears as a transversal connector between distant parts of the manuscript.

I am not claiming this proves that You are not allowed to view links. Register or Login to view. is a summary page or an index. That would be too speculative. But I do think these results support a weaker idea: some folios seem to reuse selective vocabulary drawn from multiple textual communities across the manuscript.

Maybe the most common tokens tell us how the text is generated, while the less common ones tell us how the manuscript is organized. I also think the existence of hubs is difficult to explain with a fully random distribution of rare tokens. If these tokens were placed randomly across the manuscript, I would not expect them to accumulate repeatedly around specific folios acting as lexical connectors between otherwise distant pages.

I think this opens a new way of checking for relationships between folios and sections.
This goes one step further from my previous rare-token network analysis. The following results make me think that the MS may have a meaning.

Previously I focused mostly on page-to-page connections using semi-rare tokens. This time I wanted to check whether those connections were actually meaningful, or if they were simply an artifact of section size. After all, large sections like Herbal naturally contain much more text, so they also have many more opportunities to share tokens with other parts of the manuscript.

So I built a null model. I used rare and semi-rare tokens appearing between 2 and 10 times across the manuscript and compared the real section-to-section connections against randomized versions of the same data. The null model preserves how many times each token appears and how many pages it occupies, but randomizes where those appearances occur. In other words, it estimates what level of cross-section overlap we would expect simply by chance.

This is the raw heatmap of connections.

[attachment=15894]

The graph below shows the raw shared-token structure between sections using semi-rare vocabulary (max rare frequency = 10).

[attachment=15893]

The heatmap and raw graph look interesting. Herbal and marginal stars connected to almost everything, but very strong together. This initially seemed interesting, but after normalization some of those links almost completely disappeared. The reason is simple: Herbal is huge. Given the amount of text in that section, a high number of random overlaps is statistically expected.

What survived the null model was much more interesting.

The strongest surviving connections are not simply the largest sections. Instead, the Marginal stars section repeatedly shows much stronger lexical overlap than expected with several different areas of the manuscript, especially Biological, Text-only and Cosmological folios. This is important because the effect remains even after correcting for text volume and token frequency distribution.

After applying the null model, the structure changes considerably. Edge labels show observed shared rare tokens versus expected random overlap.

[attachment=15892]

What caught my attention is that Marginal stars still behaves like an anomalous connector section even after normalization. Meanwhile some visually related sections, like Astronomical and Cosmological, are not nearly as dominant as one might intuitivly expect.

I think this potentially changes the interpretation of these folios. Instead of just being isolated thematic sections, some pages may actually function as lexical bridges or references between different textual areas of the manuscript.

The large hubs themselves are also interesting. The strongest hubs remain mostly concentrated in Marginal stars and Text-only folios:

f111r, f111v, f108v, f115v, f86v5, f86v6

What I find difficult to explain under a completely meaningless-text hypothesis is not just the existence of repeated tokens, but the existence of selective cross-sectional structure after normalization. It would be relatively easy to generate pseudo-language with local repetition. It is much harder to accidentally generate pages that repeatedly behave as connectors between specific textual communities while still preserving coherent statistical structure.

Of course this still does not prove semantic meaning. Shared vocabulary could also emerge from formulaic writing, structural markers, categories, references, or repeated textual functions rather than direct semantic content. But I do think the results are becoming harder to reconcile with a purely random distribution of semi-rare tokens.

One possible interpretation is that some sections may contain more referential or organizational language than others. If so, pages from Marginal stars and some Text-only folios could potentially be pointing toward recipes, plants, biological processes, or other content elsewhere in the manuscript. Even if the actual meaning remains unknown, the internal organization of the text may already be partially visible through these rare-token networks.
(01-06-2026, 12:55 PM)quimqu Wrote: You are not allowed to view links. Register or Login to view.rare and semi-rare tokens

I am interested in this. But how do you define rare and semi-rare? Is it words that occur only 2,3,4... times in the manuscript? What is the bar for a word to no longer be rare?
(02-06-2026, 11:03 AM)dashstofsk Wrote: You are not allowed to view links. Register or Login to view.
(01-06-2026, 12:55 PM)quimqu Wrote: You are not allowed to view links. Register or Login to view.rare and semi-rare tokens

I am interested in this. But how do you define rare and semi-rare? Is it words that occur only 2,3,4... times in the manuscript? What is the bar for a word to no longer be rare?

I took tokens that appear between 2 and 10 times all over the MS. Hapax legomena do not give any information about links. And tokens very used give noise. The maximum of 10 is arbitrary. I could have chosen 4, 10, 20...
I don't understand the numbers. What does the number 84 mean that you have for Herbal-Herbal ?
(02-06-2026, 11:54 AM)dashstofsk Wrote: You are not allowed to view links. Register or Login to view.I don't understand the numbers. What does the number 84 mean that you have for Herbal-Herbal ?

84 are the number of rare tokens shared within herbal pages (there is a looping arrow)
(01-06-2026, 12:55 PM)quimqu Wrote: You are not allowed to view links. Register or Login to view.So I started wondering if the opposite approach might actually be more informative.

Or you could use a balanced approach where rare terms are upweighted to reflect their relative importance.

Quote:In addition to creating the A matrix as described, which uses straight TF values, two weighting schemes are also employed to modify the values contained in A. The two schemes applied are Term Frequency-Inverse Document Frequency (TF-IDF) and Log-Entropy (LE).
You are not allowed to view links. Register or Login to view.
So out of all the rare words ( of which I count 1963 ) 84 fall exclusively within the Herbal pages? Also 82 ( Herbal-Stars ) fall exclusively within both the Herbal and Stars pages? Similarly, 5 exclusively within the Text pages? Or have I got this wrong?

But also what is your objective? Is it to determine whether rare words are distributed randomly or are localised within sections or by illustration type?

It seems to me that localised is going to be adequately explained under both the meaningful and meaningless hypotheses.
(01-06-2026, 12:55 PM)quimqu Wrote: You are not allowed to view links. Register or Login to view.I think this opens a new way of checking for relationships between folios and sections.

You piqued my curiosity.  In the work I've done it's very apparent scribes had different methods and word shapes. Broadly, Currier A and B.  So I decided to see how your low count theory worked across scribes. I looked at hapax tokens that are unique to each scribe and then asked, how those tokens were shared with other scribes.  This would match your low count idea without throwing hapax out as noise. 

What I immediately noticed was the slow decline in similarity from scribe 1, which is Currier A, compared to the others.  Scribes 2-5 compared to each other looks mostly flat.

Hapax per scribe:

ScribeScribe-local hapax
11338
21261
3525
4298
5210

Hapax matches to other scribe with Jaccard Overlap

PairMatchesJaccard Overlap
1–21054.21%
1–3553.04%
1–4372.31%
1–5221.44%
2–3804.69%
2–4291.89%
2–5312.15%
3–4222.74%
3–5294.11%
4–5183.67%

Hapax matches to other scribes with Jaccard Overlap sorted by overlap

RankPairJaccard Overlap
12–34.69%
21–24.21%
33–54.11%
44–53.67%
51–33.04%
63–42.74%
71–42.31%
82–52.15%
92–41.89%
101–51.44%

Some examples:

TokenScribes
cheockhy1, 2, 3, 4
odor2, 3, 4, 5
ofaiin1, 2, 3, 5
okody1, 2, 4, 5

And here's another interesting tidbit. If I strip the gallows from those hapax tokens, that decline flattens out.

Hapax matches to other scribes with Jaccard Overlap with gallows removed

PairMatchesJaccard Overlap
1–2353.80%
1–3213.05%
1–4193.11%
1–5162.74%
2–3324.93%
2–4223.79%
2–5234.18%
3–482.31%
3–5154.82%
4–5104.22%

2-10 count word matches to other scribes with Jaccard Overlap

RankPairShared TokensJaccard Overlap
13–54015.87%
22–38112.27%
31–210711.33%
44–51510.07%
51–3618.96%
63–4227.80%
71–4386.60%
82–4325.51%
91–5213.61%
102–5183.09%

Checking all unique words instead of low count words didn't affect that table much.  But, through every test except hapax stripping, you can see the drop in similarity between scribe 1 and all the later scribes.

RankPairShared TokensJaccard Overlap
12–338016.51%
23–514516.00%
31–252015.42%
41–329211.95%
52–51999.91%
64–5619.84%
73–41019.59%
81–41928.83%
92–41687.84%
101–51396.54%
(02-06-2026, 01:24 PM)dashstofsk Wrote: You are not allowed to view links. Register or Login to view.So out of all the rare words ( of which I count 1963 ) 84 fall exclusively within the Herbal pages? Also 82 ( Herbal-Stars ) fall exclusively within both the Herbal and Stars pages? Similarly, 5 exclusively within the Text pages? Or have I got this wrong?

But also what is your objective? Is it to determine whether rare words are distributed randomly or are localised within sections or by illustration type?

It seems to me that localised is going to be adequately explained under both the meaningful and meaningless hypotheses.

After the null normalization most Herbal connections basically disappear. That does not mean Herbal has no shared tokens. It means the amount of overlap is close to what we would statistically expect simply because Herbal is such a large section with lots of text and lots of opportunities for random overlap.

So the graph edges are not “exclusive words between two sections”. An edge just means that pages from those two sections share a certain amount of semi-rare tokens. Some of those same tokens may also appear in other sections as well.

What becomes interesting is when a connection is much stronger than the null expectation. That is what happens mainly with Marginal stars and some Text-only folios. Those sections still keep unusually strong links after normalization, while Herbal mostly falls back to expected levels.

And yes, my original motivation was basically this: hapax legomena are too isolated to build relationships, while the very common tokens are almost everywhere and probably dominated by the generative structure of the text itself. Semi-rare tokens seemed like a possible middle ground.

If some of those tokens correspond to more specific content, maybe plant names, ingredients, procedures, references or categories, then they could appear selectively between related folios. That was the initial intuition.

I also have the full token-by-token and page-by-page detail behind the graphs, so this is not only section-level statistics. I can trace exactly which rare tokens create each connection and between which folios. I think that part is probably more important than the global graph itself.
Pages: 1 2 3 4