![]() |
Automated Topic Analysis of the Voynich Manuscript - Printable Version +- The Voynich Ninja (https://www.voynich.ninja) +-- Forum: Voynich Research (https://www.voynich.ninja/forum-27.html) +--- Forum: Analysis of the text (https://www.voynich.ninja/forum-41.html) +--- Thread: Automated Topic Analysis of the Voynich Manuscript (/thread-4834.html) |
Automated Topic Analysis of the Voynich Manuscript - quimqu - 31-07-2025 Introduction Previous studies have examined the topics of the Voynich manuscript by looking at the distribution of words or patterns across its pages. However, to the best of my knowledge, there has not yet been a fully automated topic modelling analysis that compares multiple algorithms. In this work, I present an automated page-by-page topic analysis of the Voynich manuscript using three different models:
The goal is to see how each model clusters the pages, whether patterns align with the manuscript’s illustrated sections (Botanical, Astronomical, Biological, Cosmological, Pharmacological, Recipes), and to observe if there are topics that dominate certain sections. METHODOLOGY: HOW THE MODELS DETECT TOPICS LDA (Latent Dirichlet Allocation) LDA treats each page as a bag of words and assumes:
BERTopic BERTopic uses transformer-based embeddings (BERT) to represent each page as a high-dimensional vector capturing semantic similarity. It then applies dimensionality reduction (UMAP) and clustering (HDBSCAN) to group similar pages. Finally, it extracts the most representative words for each cluster to define topics. This allows for more nuanced grouping, even with subtle vocabulary differences. NMF (Non-negative Matrix Factorization) NMF uses a term-frequency matrix (TF-IDF weighted) and factorizes it into two smaller matrices:
RESULTS Each model produces two complementary visualizations: Timeline Plot (Top)
How to interpret: Clusters of the same marker in the same color band indicate topic consistency within a section. Sudden changes of marker shape within a section may suggest variation or topic. Heatmap (Bottom)
How to interpret: A bright yellow cell (value near 1) means that almost all folios in that section belong to a single topic → high homogeneity. A row with several colored cells means that section contains multiple topics → possible internal diversity or mixed content. Note 1: topic numbers 1, 2, 3, 4, 5 are not the same topics for all the models. It is just a label for a topic in a model. Note 2: ordered folios in timeline diagram should be read as "pages". Eg: page 48 should be f25v and 49 should be You are not allowed to view links. Register or Login to view. (as page 1 is f1r) LDA (5 Topics) ![]()
BERTopic (5 Topics) [size=1][font='Proxima Nova Regular', 'Helvetica Neue', Helvetica, Arial, sans-serif] ![]()
NMF (3 Topics) ![]()
MY THOUGHTS Across all three models, the Biological and Cosmological sections appear linguistically homogeneous (each model consistently assigns a single dominant topic to them, with at most two topics for the Cosmological section in the BERTopic model). This could reflect genuine stylistic uniformity or simply the models’ sensitivity to repeated patterns in the text. But what if the Cosmological section is in fact closely linked to the Biological section? Botanical and Pharmacological sections consistently appear more heterogeneous:
A striking observation in BERTopic:
In NMF:
I would like to read your comments. Thank you RE: Automated Topic Analysis of the Voynich Manuscript - ReneZ - 31-07-2025 OK, this looks interesting. How long did this take you? RE: Automated Topic Analysis of the Voynich Manuscript - quimqu - 31-07-2025 (31-07-2025, 12:26 PM)ReneZ Wrote: You are not allowed to view links. Register or Login to view.How long did this take you? This took me a couple of days of work. But the program runs in about 2 minutes for all the three models, one after the other. RE: Automated Topic Analysis of the Voynich Manuscript - magnesium - 31-07-2025 Really interesting work. I recommend looking under the hood at how LDA is defining "topics" with specific word types. In my own dalliances with LDA, I have found that LDA tends to create one "topic" that's defined by many of the globally commonest word types within the VMS and then assigns a lot of folios to this one topic. Something like that could be why according to LDA, the biological and cosmological sections are the same topic, while NMF says the two sections cleanly separate into different topics. RE: Automated Topic Analysis of the Voynich Manuscript - quimqu - 31-07-2025 (31-07-2025, 02:23 PM)magnesium Wrote: You are not allowed to view links. Register or Login to view.Really interesting work. I recommend looking under the hood at how LDA is defining "topics" with specific word types. In my own dalliances with LDA, I have found that LDA tends to create one "topic" that's defined by many of the globally commonest word types within the VMS and then assigns a lot of folios to this one topic. Something like that could be why according to LDA, the biological and cosmological sections are the same topic, while NMF says the two sections cleanly separate into different topics. Thanks for the suggestion! I actually followed your advice and removed the most common words. I choosed those appearing in more than 60% of folios: ['dy', 'dar', 'daiin', 's', 'or', 'chol', 'y'] After cleaning, I re-ran all three models: ![]() LDA now detects 9 topics but still clusters cosmological and biological almost together. ![]() BERTopic now detects 4 topics. ![]() NMF gives now 4 topics (elbow method, not very clear): ![]() RE: Automated Topic Analysis of the Voynich Manuscript - nablator - 31-07-2025 (31-07-2025, 03:12 PM)quimqu Wrote: You are not allowed to view links. Register or Login to view.LDA now detects 9 topics but still clusters cosmological and biological almost together. Isn't the number of topics a parameter of LDA? Which approach/algorithm/implementation do you use for deciding how many topics there are? Quote:Selecting the number of topics in Latent Dirichlet Allocation (LDA) models is considered to be a difficult task, for which various approaches have been proposed.You are not allowed to view links. Register or Login to view. RE: Automated Topic Analysis of the Voynich Manuscript - quimqu - 31-07-2025 (31-07-2025, 03:35 PM)nablator Wrote: You are not allowed to view links. Register or Login to view.Isn't the number of topics a parameter of LDA? Which approach/algorithm/implementation do you use for deciding how many topics there are? I iterate through different topic numbers and calculate the Coherence Score, then I take the maximum. In the last case (without repetitive words) I got this scores: ![]() I use this function that uses gensim LdaModel and CoherenceModel Code: # 4. Coherence computation function RE: Automated Topic Analysis of the Voynich Manuscript - Jorge_Stolfi - 31-07-2025 Very interesting! Many years ago I did something vaguely similar, but with clustering in an explicit space rather than the black boxes that I presume are those models you used. Basically the idea was
You are not allowed to view links. Register or Login to view. You are not allowed to view links. Register or Login to view. You are not allowed to view links. Register or Login to view. here each dot is a page, and the lines connect the pages in presumed reading order. Lines are omitted when there are missing pages in the manuscript, and when the section changes from one page to the next. There are more details in the You are not allowed to view links. Register or Login to view. . Beware that I haven't looked at this work in 25 years, and the data was the old interlinear transcription which had rather variable quality. My conclusions:
RE: Automated Topic Analysis of the Voynich Manuscript - quimqu - 31-07-2025 (31-07-2025, 06:13 PM)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view.Very interesting! Thank you Jorge! Your experiment results were alse very interesting. I think there are some important connections with mine. I will keep working with my "black boxes" ![]() Thanks again RE: Automated Topic Analysis of the Voynich Manuscript - GreyCat - 03-08-2025 That's really interesting - brings up those ideas of herbal astrology like in Culpepper. |