31-07-2025, 10:46 AM
Introduction
Previous studies have examined the topics of the Voynich manuscript by looking at the distribution of words or patterns across its pages. However, to the best of my knowledge, there has not yet been a fully automated topic modelling analysis that compares multiple algorithms.
In this work, I present an automated page-by-page topic analysis of the Voynich manuscript using three different models:
The goal is to see how each model clusters the pages, whether patterns align with the manuscript’s illustrated sections (Botanical, Astronomical, Biological, Cosmological, Pharmacological, Recipes), and to observe if there are topics that dominate certain sections.
METHODOLOGY: HOW THE MODELS DETECT TOPICS
LDA (Latent Dirichlet Allocation)
LDA treats each page as a bag of words and assumes:
BERTopic
BERTopic uses transformer-based embeddings (BERT) to represent each page as a high-dimensional vector capturing semantic similarity. It then applies dimensionality reduction (UMAP) and clustering (HDBSCAN) to group similar pages. Finally, it extracts the most representative words for each cluster to define topics. This allows for more nuanced grouping, even with subtle vocabulary differences.
NMF (Non-negative Matrix Factorization)
NMF uses a term-frequency matrix (TF-IDF weighted) and factorizes it into two smaller matrices:
RESULTS
Each model produces two complementary visualizations:
Timeline Plot (Top)
How to interpret: Clusters of the same marker in the same color band indicate topic consistency within a section. Sudden changes of marker shape within a section may suggest variation or topic.
Heatmap (Bottom)
How to interpret: A bright yellow cell (value near 1) means that almost all folios in that section belong to a single topic → high homogeneity. A row with several colored cells means that section contains multiple topics → possible internal diversity or mixed content.
Note 1: topic numbers 1, 2, 3, 4, 5 are not the same topics for all the models. It is just a label for a topic in a model.
Note 2: ordered folios in timeline diagram should be read as "pages". Eg: page 48 should be f25v and 49 should be You are not allowed to view links. Register or Login to view. (as page 1 is f1r)
LDA (5 Topics)
![[Image: st8uiUT.png]](https://i.imgur.com/st8uiUT.png)
BERTopic (5 Topics)
[size=1][font='Proxima Nova Regular', 'Helvetica Neue', Helvetica, Arial, sans-serif]
[/font][/size]
NMF (3 Topics)
![[Image: Ay7kGky.png]](https://i.imgur.com/Ay7kGky.png)
MY THOUGHTS
Across all three models, the Biological and Cosmological sections appear linguistically homogeneous (each model consistently assigns a single dominant topic to them, with at most two topics for the Cosmological section in the BERTopic model). This could reflect genuine stylistic uniformity or simply the models’ sensitivity to repeated patterns in the text. But what if the Cosmological section is in fact closely linked to the Biological section?
Botanical and Pharmacological sections consistently appear more heterogeneous:
A striking observation in BERTopic:
In NMF:
I would like to read your comments.
Thank you
Previous studies have examined the topics of the Voynich manuscript by looking at the distribution of words or patterns across its pages. However, to the best of my knowledge, there has not yet been a fully automated topic modelling analysis that compares multiple algorithms.
In this work, I present an automated page-by-page topic analysis of the Voynich manuscript using three different models:
- LDA (Latent Dirichlet Allocation) – which finds 5 topics
- BERTopic – which finds 5 topics
- NMF (Non-negative Matrix Factorization) – which finds 3 topics
The goal is to see how each model clusters the pages, whether patterns align with the manuscript’s illustrated sections (Botanical, Astronomical, Biological, Cosmological, Pharmacological, Recipes), and to observe if there are topics that dominate certain sections.
METHODOLOGY: HOW THE MODELS DETECT TOPICS
LDA (Latent Dirichlet Allocation)
LDA treats each page as a bag of words and assumes:
- Each page is a mixture of topics (in different proportions)
- Each topic is a distribution of words
Through repeated statistical assignments, LDA discovers which words tend to appear together, grouping them into topics. Pages are then assigned the topic (or mix of topics) that best matches their word patterns.
BERTopic
BERTopic uses transformer-based embeddings (BERT) to represent each page as a high-dimensional vector capturing semantic similarity. It then applies dimensionality reduction (UMAP) and clustering (HDBSCAN) to group similar pages. Finally, it extracts the most representative words for each cluster to define topics. This allows for more nuanced grouping, even with subtle vocabulary differences.
NMF (Non-negative Matrix Factorization)
NMF uses a term-frequency matrix (TF-IDF weighted) and factorizes it into two smaller matrices:
- One representing topics as weighted combinations of words
- One representing pages as weighted combinations of topics
Because all values are non-negative, each page’s topic weights are easy to interpret. The dominant topic for a page is the one with the highest weight.
RESULTS
Each model produces two complementary visualizations:
Timeline Plot (Top)
- Horizontal axis (X) = Ordered folios of the Voynich manuscript.
- Vertical axis (Y) = Dominant topic assigned to each folio (numbered according to the model).
- Color = Illustrated section of the folio (Botanical, Astronomical, etc.).
- Marker shape = Topic number.
How to interpret: Clusters of the same marker in the same color band indicate topic consistency within a section. Sudden changes of marker shape within a section may suggest variation or topic.
Heatmap (Bottom)
- Rows (Y) = Illustrated sections of the manuscript.
- Columns (X) = Detected topics from the model.
- Cell value (and color) = Proportion of pages in that section assigned to each topic (normalized so each row sums to 1).
How to interpret: A bright yellow cell (value near 1) means that almost all folios in that section belong to a single topic → high homogeneity. A row with several colored cells means that section contains multiple topics → possible internal diversity or mixed content.
Note 1: topic numbers 1, 2, 3, 4, 5 are not the same topics for all the models. It is just a label for a topic in a model.
Note 2: ordered folios in timeline diagram should be read as "pages". Eg: page 48 should be f25v and 49 should be You are not allowed to view links. Register or Login to view. (as page 1 is f1r)
LDA (5 Topics)
![[Image: st8uiUT.png]](https://i.imgur.com/st8uiUT.png)
- Botanical and Pharmacological sections include all 5 topics, suggesting vocabulary variety and perhaps multiple subthemes.
- Astronomical section covers 4 topics (all except Topic 4; Topic 5 appears only in one page).
- Biological and cosmological sections are entirely assigned to Topic 1 – extremely homogeneous.
- Recipes section is mostly Topic 1, with some pages in Topic 3.
BERTopic (5 Topics)
[size=1][font='Proxima Nova Regular', 'Helvetica Neue', Helvetica, Arial, sans-serif]
![[Image: PXyXvjz.png]](https://i.imgur.com/PXyXvjz.png)
- Botanical is dominated by Topic 2 (but touches all other topics to some degree).
- Astronomical is mostly Topic 4 with some Topic 1.
- Biological is entirely Topic 3.
- Cosmological uses Topics 3 and 1.
- Pharmacological touches all 5 topics.
- Recipes uses all topics except Topic 2 (striking, since Topic 2 dominates Botanical and Pharmacological) and leans toward Topics 3 and 1.
NMF (3 Topics)
![[Image: Ay7kGky.png]](https://i.imgur.com/Ay7kGky.png)
- Botanical spans all 3 topics, but is dominated by Topic 2 up to around page 48 (folio 24) before alternating among the three.
- Astronomical is entirely Topic 3.
- Biological is entirely Topic 1.
- Cosmological is entirely Topic 3 (like Astronomical).
- Pharmacological alternates between Topics 2 and 3.
- Recipes alternates between Topics 1 and 3.
MY THOUGHTS
Across all three models, the Biological and Cosmological sections appear linguistically homogeneous (each model consistently assigns a single dominant topic to them, with at most two topics for the Cosmological section in the BERTopic model). This could reflect genuine stylistic uniformity or simply the models’ sensitivity to repeated patterns in the text. But what if the Cosmological section is in fact closely linked to the Biological section?
Botanical and Pharmacological sections consistently appear more heterogeneous:
- LDA and BERTopic detect a wider spread of topics here, possibly due to multiple subsections or thematic variation within the illustrations.
- Recipes are particularly interesting: they often share topics with Botanical or Pharmacological sections in LDA/BERTopic, but show different topic distributions in NMF.
A striking observation in BERTopic:
- Topic 1 dominates Botanical and Pharmacological, but is absent from Recipes.
- This might suggest a shift in terminology or a distinct textual purpose for the Recipes section despite visual similarity to Pharmacological folios.
In NMF:
- Topic 3 covers Astronomical and Cosmological sections entirely.
- This may mean that NMF sees these two illustrated sections as linguistically similar — perhaps due to formulaic text or repeated glyph patterns.
I would like to read your comments.
Thank you