The Voynich Ninja

Full Version: Automated Topic Analysis of the Voynich Manuscript
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Introduction

Previous studies have examined the topics of the Voynich manuscript by looking at the distribution of words or patterns across its pages. However, to the best of my knowledge, there has not yet been a fully automated topic modelling analysis that compares multiple algorithms.

In this work, I present an automated page-by-page topic analysis of the Voynich manuscript using three different models:
  • LDA (Latent Dirichlet Allocation) – which finds 5 topics
  • BERTopic – which finds 5 topics
  • NMF (Non-negative Matrix Factorization) – which finds 3 topics

The goal is to see how each model clusters the pages, whether patterns align with the manuscript’s illustrated sections (Botanical, Astronomical, Biological, Cosmological, Pharmacological, Recipes), and to observe if there are topics that dominate certain sections.

METHODOLOGY: HOW THE MODELS DETECT TOPICS

LDA (Latent Dirichlet Allocation)

LDA treats each page as a bag of words and assumes:
  • Each page is a mixture of topics (in different proportions)
  • Each topic is a distribution of words
    Through repeated statistical assignments, LDA discovers which words tend to appear together, grouping them into topics. Pages are then assigned the topic (or mix of topics) that best matches their word patterns.

BERTopic

BERTopic uses transformer-based embeddings (BERT) to represent each page as a high-dimensional vector capturing semantic similarity. It then applies dimensionality reduction (UMAP) and clustering (HDBSCAN) to group similar pages. Finally, it extracts the most representative words for each cluster to define topics. This allows for more nuanced grouping, even with subtle vocabulary differences.

NMF (Non-negative Matrix Factorization)

NMF uses a term-frequency matrix (TF-IDF weighted) and factorizes it into two smaller matrices:
  • One representing topics as weighted combinations of words
  • One representing pages as weighted combinations of topics
    Because all values are non-negative, each page’s topic weights are easy to interpret. The dominant topic for a page is the one with the highest weight.

RESULTS

Each model produces two complementary visualizations:

Timeline Plot (Top)
  • Horizontal axis (X) = Ordered folios of the Voynich manuscript.
  • Vertical axis (Y) = Dominant topic assigned to each folio (numbered according to the model).
  • Color = Illustrated section of the folio (Botanical, Astronomical, etc.).
  • Marker shape = Topic number.

How to interpret: Clusters of the same marker in the same color band indicate topic consistency within a section. Sudden changes of marker shape within a section may suggest variation or topic.

Heatmap (Bottom)
  • Rows (Y) = Illustrated sections of the manuscript.
  • Columns (X) = Detected topics from the model.
  • Cell value (and color) = Proportion of pages in that section assigned to each topic (normalized so each row sums to 1).

How to interpret: A bright yellow cell (value near 1) means that almost all folios in that section belong to a single topic → high homogeneity. A row with several colored cells means that section contains multiple topics → possible internal diversity or mixed content.

Note 1: topic numbers 1, 2, 3, 4, 5 are not the same topics for all the models. It is just a label for a topic in a model.
Note 2: ordered folios in timeline diagram should be read as "pages". Eg: page 48 should be f25v and 49 should be You are not allowed to view links. Register or Login to view. (as page 1 is f1r)

LDA (5 Topics)

[Image: st8uiUT.png]
  • Botanical and Pharmacological sections include all 5 topics, suggesting vocabulary variety and perhaps multiple subthemes.
  • Astronomical section covers 4 topics (all except Topic 4; Topic 5 appears only in one page).
  • Biological and cosmological sections are entirely assigned to Topic 1 – extremely homogeneous.
  • Recipes section is mostly Topic 1, with some pages in Topic 3.

BERTopic (5 Topics)

[size=1][font='Proxima Nova Regular', 'Helvetica Neue', Helvetica, Arial, sans-serif][Image: PXyXvjz.png][/font][/size]
  • Botanical is dominated by Topic 2 (but touches all other topics to some degree).
  • Astronomical is mostly Topic 4 with some Topic 1.
  • Biological is entirely Topic 3.
  • Cosmological uses Topics 3 and 1.
  • Pharmacological touches all 5 topics.
  • Recipes uses all topics except Topic 2 (striking, since Topic 2 dominates Botanical and Pharmacological) and leans toward Topics 3 and 1.

NMF (3 Topics)

[Image: Ay7kGky.png]
  • Botanical spans all 3 topics, but is dominated by Topic 2 up to around page 48 (folio 24) before alternating among the three.
  • Astronomical is entirely Topic 3.
  • Biological is entirely Topic 1.
  • Cosmological is entirely Topic 3 (like Astronomical).
  • Pharmacological alternates between Topics 2 and 3.
  • Recipes alternates between Topics 1 and 3.

MY THOUGHTS

Across all three models, the Biological and Cosmological sections appear linguistically homogeneous (each model consistently assigns a single dominant topic to them, with at most two topics for the Cosmological section in the BERTopic model). This could reflect genuine stylistic uniformity or simply the models’ sensitivity to repeated patterns in the text. But what if the Cosmological section is in fact closely linked to the Biological section?

Botanical and Pharmacological sections consistently appear more heterogeneous:
  • LDA and BERTopic detect a wider spread of topics here, possibly due to multiple subsections or thematic variation within the illustrations.
  • Recipes are particularly interesting: they often share topics with Botanical or Pharmacological sections in LDA/BERTopic, but show different topic distributions in NMF.

A striking observation in BERTopic:
  • Topic 1 dominates Botanical and Pharmacological, but is absent from Recipes.
  • This might suggest a shift in terminology or a distinct textual purpose for the Recipes section despite visual similarity to Pharmacological folios.

In NMF:
  • Topic 3 covers Astronomical and Cosmological sections entirely.
  • This may mean that NMF sees these two illustrated sections as linguistically similar — perhaps due to formulaic text or repeated glyph patterns.

I would like to read your comments.

Thank you
OK, this looks interesting.

How long did this take you?
(31-07-2025, 12:26 PM)ReneZ Wrote: You are not allowed to view links. Register or Login to view.How long did this take you?

This took me a couple of days of work. But the program runs in about 2 minutes for all the three models, one after the other.
Really interesting work. I recommend looking under the hood at how LDA is defining "topics" with specific word types. In my own dalliances with LDA, I have found that LDA tends to create one "topic" that's defined by many of the globally commonest word types within the VMS and then assigns a lot of folios to this one topic. Something like that could be why according to LDA, the biological and cosmological sections are the same topic, while NMF says the two sections cleanly separate into different topics.
(31-07-2025, 02:23 PM)magnesium Wrote: You are not allowed to view links. Register or Login to view.Really interesting work. I recommend looking under the hood at how LDA is defining "topics" with specific word types. In my own dalliances with LDA, I have found that LDA tends to create one "topic" that's defined by many of the globally commonest word types within the VMS and then assigns a lot of folios to this one topic. Something like that could be why according to LDA, the biological and cosmological sections are the same topic, while NMF says the two sections cleanly separate into different topics.

Thanks for the suggestion! I actually followed your advice and removed the most common words. I choosed those appearing in more than 60% of folios: ['dy', 'dar', 'daiin', 's', 'or', 'chol', 'y']

After cleaning, I re-ran all three models:

[Image: nTIG91a.png]


LDA now detects 9 topics but still clusters cosmological and biological almost together.


[Image: pbnFyAl.png]


BERTopic now detects 4 topics.


[Image: ze4W2ef.png]


NMF gives now 4 topics (elbow method, not very clear):

[Image: Tv54pf2.png]
(31-07-2025, 03:12 PM)quimqu Wrote: You are not allowed to view links. Register or Login to view.LDA now detects 9 topics but still clusters cosmological and biological almost together.

Isn't the number of topics a parameter of LDA? Which approach/algorithm/implementation do you use for deciding how many topics there are?

Quote:Selecting the number of topics in Latent Dirichlet Allocation (LDA) models is considered to be a difficult task, for which various approaches have been proposed.
You are not allowed to view links. Register or Login to view.
(31-07-2025, 03:35 PM)nablator Wrote: You are not allowed to view links. Register or Login to view.Isn't the number of topics a parameter of LDA? Which approach/algorithm/implementation do you use for deciding how many topics there are?

I iterate through different topic numbers and calculate the Coherence Score, then I take the maximum. In the last case (without repetitive words) I got this scores:

[Image: sv0H42X.png]

I use this function that uses gensim LdaModel and CoherenceModel

Code:
# 4. Coherence computation function
def compute_coherence(dictionary, corpus, texts, start=2, limit=15, step=1):
    coherence_values = []
    model_list = []
   
    for num_topics in range(start, limit + 1, step):
        model = LdaModel(
            corpus=corpus,
            id2word=dictionary,
            num_topics=num_topics,
            random_state=42,
            passes=20,
            alpha='auto'
        )
        model_list.append(model)
       
        coherencemodel = CoherenceModel(
            model=model,
            texts=texts,
            dictionary=dictionary,
            coherence='c_v'
        )
        coherence_values.append(coherencemodel.get_coherence())
   
    return model_list, coherence_values
Very interesting!

Many years ago I did something vaguely similar, but with clustering in an explicit space rather than the black boxes that I presume are those models you used.

Basically the idea was
  1. Pick a list of the N most common words in the manuscript
  2. Compute their frequencies on each page, as a point in N-space
  3. Choose three orthogonal axes X,Y,Z in N-space
  4. Project the points on the planes XY, XZ, YZ, colored by section
  5. Look at the plots and mumble something
Here are some plots, using N = 50 

  You are not allowed to view links. Register or Login to view.
  You are not allowed to view links. Register or Login to view.
  You are not allowed to view links. Register or Login to view.

here each dot is a page, and the lines connect the pages in presumed reading order.  Lines are omitted when there are missing pages in the manuscript, and when the section changes from one page to the next.

There are more details in the You are not allowed to view links. Register or Login to view. . Beware that I haven't looked at this work in 25 years, and the data was the old interlinear transcription which had rather variable quality.

My conclusions: 
  1. Each section is fairly homogeneous in this metric, except that herbal-A is very different from herbal-B
  2. Sections can be almost well-separated in this N-space.
  3. Yet the difference between herbal-A and herbal-B is comparable to that between other two sections.
  4. Indeed there seems to be a progressive variation from section to section, with herbal-A at one end, herbal-B at the other end  and the other sections wandering about between the two.
  5. Within each section the pages vary "randomly" around the section's centroid, without a clear gradient.
  6. It is not clear whether the pages within each section are really independent, or each page has some tendency to stay close to the adjacent pages.  I would expect the latter to be the case in the Biological section, maybe not in the other sections.
All the best, --jorge
(31-07-2025, 06:13 PM)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view.Very interesting!

Thank you Jorge!

Your experiment results were alse very interesting. I think there are some important connections with mine.

I will keep working with my "black boxes" Wink and update results. Now I am working with a smaller corpus: I just removed the the high-frequency words, the hapax legomena (words that appear only once) and frequent words that appear in all sections, with just a final dictionary of 2459 words. This gives quite interesting results. I'll keep you updated!

Thanks again