The Voynich Ninja

Full Version: Automated Topic Analysis of the Voynich Manuscript
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Pages: 1 2 3 4 5 6 7 8 9 10 11
This is extremely cool! Is your code available? I'd love to see how you're doing it.
(03-09-2025, 03:22 AM)synapsomorphy Wrote: You are not allowed to view links. Register or Login to view.This is extremely cool! Is your code available? I'd love to see how you're doing it.

Let me clean my code a bit and I'll make it public.
OK, to deepen in the automatic code analysis, I have compared the outputs between LDA, BerTOPIC and NMF. All three are different models to find topics in texts (LDA: probabilistic model of word co-occurrence; NMF: matrix factorization of TF-IDF; BERTopic: clustering of semantic embeddings).

I have topped the max number of topics to 10, for better visualization. Here are the results:

LDA
[Image: axcJsah.gif]

BerTOPIC
[Image: 0G5ZKmt.gif]

NMF
[Image: FbDNZ0Z.gif]

All three models find similar topics distribution. Of course that topic modelling is based partially in words and how often they appear, but the three models go beyond just word counting:

LDA (Latent Dirichlet Allocation): LDA looks for bundles of word-forms that tend to appear together in the same passages. A topic is not simply the most frequent words, but a cluster of co-occurring tokens. For example, if certain Voynich forms regularly show up in the same folios, LDA groups them into the same hidden theme.

NMF (Non-Negative Matrix Factorization): NMF does not just count either. It highlights which word-forms are distinctive for a passage compared to the rest of the manuscript. Very frequent tokens are down-weighted, while more characteristic forms stand out. Each topic is then built from these distinctive distributions, showing what makes certain sections look different from others.

BERTopic: In natural languages, BERTopic uses language models to capture semantic meaning. With Voynich, we do not know meanings, so it cannot recover semantics. What it can do is use its embeddings to detect structural or distributional similarities across passages. Two folios might be grouped together even if they do not share identical tokens, because their patterns of word-forms and symbol sequences resemble each other. In this way, BERTopic functions more as a pattern recognizer than a semantic model for Voynich.
(03-09-2025, 03:22 AM)synapsomorphy Wrote: You are not allowed to view links. Register or Login to view.This is extremely cool! Is your code available? I'd love to see how you're doing it.

Hello,

you can find my public Notebook You are not allowed to view links. Register or Login to view.. Note that it is prepared for Kaggle, and I put instructions about the dataset to be used.
Wow quimqu, this is fascinating work. If you publish this formally, I would also include not only your analysis of Torsten Timm’s auto-generated VMS substitute, but also a control group using meaningful text of a comparable length. A novel involving 3+ story arcs which alternate by chapter might be a good choice. A stream-of-consciousness novel like William Faulkner’s The Sound and the Fury would be an interesting specimen. I’m sure your bots could easily distinguish which story arc, or which period in time in the main character’s mind, a page or paragraph belonged to.

This a similar technology used by companies like TurnItIn.com, for detecting plagiarism in academic essays, no? I remember reading that bots like these are vindicating long-held suspicions that parts of ancient works are later insertions by other authors, and confirming just common pseudo-attribution was in the olden days. Simply put, textual analysis is getting good at determining, with a high degree of confidence, which pieces of writing were, and were not, composed by the same author. This has some privacy-eroding implications that scare me a bit. I can use a VPN and the Tor browser every time I blog about something seditious or controversial, but that’ll be little help to me in a world where the prosecution can call an expert witness like you with a bot like yours to testify, and the bot says the problematic writings have a 95% confidence match with things written in my name.

I digress. Textual analysis and computer language modeling are not my area of expertise. But as a word and language nerd, they fascinate me to no end.

quimqu, have you looked into any of the efforts to restore the original order of the VMS folios? Wladimir would be your man here to talk to abut this, along with the great Koen G, of course. It’s well-established that the VMS has been taken apart and rebound at least once, by people who clearly could not read it. For example, there are at least two bifolios of Herbal B misbound into the first 7 quires, which are otherwise all Herbal A. I’d be interested to see your experiment rerun, with the extant folios in the best estimate available for their original order.
(03-09-2025, 10:04 PM)RenegadeHealer Wrote: You are not allowed to view links. Register or Login to view.quimqu, have you looked into any of the efforts to restore the original order of the VMS folios?

Thank you so much for your encouraging words! Yes, I do plan to publish the study once it’s more complete—I still have quite a few threads to pull on. The idea of experimenting with the folio reordering is really intriguing, and I’ll definitely look into how I could approach it. Thanks a lot for the suggestion!
I’ve been running paragraph-level topic models on the text. The first passes (NMF/LDA) made it clear that a lot of variance is explained by style-language A/B and scribal hands. However, the plots also showed sharp colour changes exactly at section boundaries (e.g., Herbal → Pharma), suggesting that genuine themes coexist with stylistic effects.

To separate the two, I cleaned the text (removed single-character tokens and a small, empirically derived list of near-ubiquitous “stop-forms”) and switched to DMR (Dirichlet-Multinomial Regression). In DMR the topic word distributions are global, while the topic mixture for each paragraph depends on covariates (here: language, writing hand, Currier). In practice this lets style influence how much of each topic appears without redefining the topics themselves. I compared several K topic numbers and selected K=6 (highest AMI between topics and editorial sections). Here are the topic distributions through the folia and separated by section, language, writting hand and currier hand:

[Image: v6tNPxL.png]
[size=1][font='Proxima Nova Regular', 'Helvetica Neue', Helvetica, Arial, sans-serif][Image: LucX0za.png][/font][/size]
[size=1][font='Proxima Nova Regular', 'Helvetica Neue', Helvetica, Arial, sans-serif][size=1][font='Proxima Nova Regular', 'Helvetica Neue', Helvetica, Arial, sans-serif][Image: h4ve8z5.png][/font][/size][/font][/size]
[size=1][font='Proxima Nova Regular', 'Helvetica Neue', Helvetica, Arial, sans-serif][size=1][font='Proxima Nova Regular', 'Helvetica Neue', Helvetica, Arial, sans-serif][size=1][font='Proxima Nova Regular', 'Helvetica Neue', Helvetica, Arial, sans-serif][Image: 1Djm1ok.png][/font][/size][/font][/size][/font][/size]

The six topics align as follows:
  • T0 – Marginal stars: qok-… with -ain/-ey; dominant on the marginal-stars pages.
  • T1 – Astronomical/Zodiac: frequent -iin/-ar/-ey; peaks in Astronomical and Zodiac.
  • T2 – Pharmaceutical: ch/che- + -ol/-or (e.g., chol, cheol, cheor), plus qok+-ol; strong in Pharma.
  • T3 – Text/Cosmological: general high-frequency forms (ar, or, aiin…); Text-only and parts of Cosmological.
  • T4 – Biological (balneological): distinctive -edy/-dy with qo-/qoke-/oke-; the balneological block.
  • T5 – Herbal: ch/sh + o/l patterns; Herbal pages, especially in Language A.

[Image: D01thGY.png]

Distribution. Heatmaps and stacked timelines show near-pure blocks per section: Zodiac/Astro→T1, Pharma→T2, Biological→T4, Herbal→T5, Marginal→T0, Text/Cosmo→T3. Where sections change, the dominant topic usually flips at the same boundary.

[Image: Zs1DuKt.png]

Language and hands. Language and hands still influence prevalence. Language A leans Herbal (T5); Language B spreads over T0/T1/T3/T4. Some hands show long runs dominated by one topic, yet topic shifts track section changes. This points to thematic structure that is not merely scribal style.


[Image: 4wnRjm9.png]
A cautious reading is that this patterning is inconsistent with pure gibberish: the DMR analysis (K=6) uncovers stable, section-aligned topics whose vocabularies persist across scribes and languages, suggesting the text carries systematic content rather than random noise.

I would appreciate any comment.
Just to clarify: in your analysis you take the parags (only, excluding labels and titles) of each page, do some cleanup, compute the token frequency distribution in those parags for each page, and then compare the distributions of different pages to identify "topics", by various criteria.  Is that so?

You may be interested in this python code that I wrote to deal with uncertain word spaces.   Basically the script assumes that each comma is either a word space (like a period) or no space with some probabilities P and 1-P, say 50-50.  Then it enumerates the possibilities and outputs all possible words with a fractional count.  

For instance, in the line Che,y,ky.Cho.dal,dar, it would output the following counts
  • 0.50 Che
  • 0.25 Chey
  • 0.25 y
  • 0.25 yky
  • 0.25 Cheyky
  • 0.50 ky
  • 1.00 Cho
  • 0.50 dal
  • 0.50 daldar
  • 0.50 dar
Note that the output lines that contain a given input glyph must add to 1.

This is still far from the "correct" freq distr, since it cannot account for uncertain spaces that were omitted in the transcription, or transcribed with ".".  But hopefully it is better that treating all commas as periods (P=1), or deleting all commas (P=0).

Beware that my "IVTFF" files are not yet quite legal IVTFF as defined byt Rene), so you may need to do some adjusting in order to read his files.

The same principle should be used when computing word pair frequencies.
Namely, from that line we should get
  • 0.250 Che.y
  • 0.250 Che.yky
  • 0.250 Chey.ky
  • 0.250 Cheyky.Cho
  • 0.250 y.ky
  • 0.250 yky.Cho
  • 0.500 ky.Cho
  • 0.500 Cho.dal
  • 0.500 Cho.daldar
  • 0.500 dal.dar
Here the lines that contain each input possible-word must add to 2; except the first and last possible-word,which must add to 1.

But my code does not do this yet.

All the best. --jorge
(21-09-2025, 08:44 PM)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view.Just to clarify: in your analysis you take the parags (only, excluding labels and titles) of each page, do some cleanup, compute the token frequency distribution in those parags for each page, and then compare the distributions of different pages to identify "topics", by various criteria.  Is that so?

Well, not exactly. Let me try to explain.

A topic model tries to compress a corpus into a small set of recurring word patterns and to say, for each text unit, how much of each pattern is present. In my case the text unit is a “paragraph”: in my case, pargraphs are the MS paragraphs, the lone lines outside the paragraphs and I also have merged labels and very short lines into their nearest text. My goal was to have a short group of words to compare topics. Then I turn each paragraph into a bag-of-words after cleaning (I remove single-grapheme tokens and a short empirical list of near-ubiquitous, non-discriminative forms; commas in the EVA runs I showed were treated as spaces, but that’s a switch I can flip; I will also take a look at your script and see how I can handle this; i am very aware of the commas issue that you explained several times).

The model learns two things: a set of word distributions (the topics themselves) and, for each paragraph, a vector of proportions telling how those topics are mixed. You choose K, the number of topics; too small and themes are merged, too large and they fragment.

Plain LDA or NMF will happily use topics to capture whatever explains variance, including orthography or scribal habits. That's why I saw that topics aligned to hands and language but seemed also to aligne with sections. But I needed a way to "clean" the scribal habits from them. DMR changes the default in a simple way that matters here: the topic vocabularies are shared by everyone, but the prior over a paragraph’s topic proportions depends on its metadata (language A/B, writing hand, Currier hand) through a small regression. In practice, style doesn’t redefine the topics; it mainly nudges how much of each topic a paragraph tends to use. That’s why, after cleaning, DMR yields topics that line up with sections, while language and hands show up as differences in prevalence rather than different word lists.

Each paragraph is modeled as a mixture of K topics. In standard LDA, the prior over those mixtures is the same for every paragraph. In DMR, that prior depends on the paragraph’s metadata (language A/B, writing hand, Currier hand), while the topic word distributions themselves stay shared across the corpus. I chose K by checking which value made the inferred topics align best with the manuscript sections (using AMI and the heatmaps/timelines); on this data, K=6 worked best.
(21-09-2025, 09:28 PM)quimqu Wrote: You are not allowed to view links. Register or Login to view.Then I turn each paragraph into a bag-of-words

Ah, ok - so your unit is the parag, not the page.

(21-09-2025, 09:28 PM)quimqu Wrote: You are not allowed to view links. Register or Login to view.The model learns two things: a set of word distributions (the topics themselves) and, for each paragraph, a vector of proportions telling how those topics are mixed. You choose K, the number of topics; too small and themes are merged, too large and they fragment.

So you are trying several K-clustering algorithms on the same set of data points, where each point is a bag-of-words -- essentially an N-vector where N is the number of distinct word types.  Where the algorithms do not assign each parag to a single cluster (topic), but gives it a "belonging" or "mixing" score for each cluster.  Is that correct?

Can we interpret those scores as Bayesian probabilities of the parag belonging to each topic?

All the best, --jorge
Pages: 1 2 3 4 5 6 7 8 9 10 11