I ran two types of topic models, NMF (Non-negative Matrix Factorization) and LDA (Latent Dirichlet Allocation), on the Voynich text. I did this twice: first using paragraphs as the unit of analysis, and then using the folios (front and back) as units.
Both models clearly show a peak at K = 2, meaning that the text naturally splits into two distinct writing styles or “languages”.
[
attachment=11709]
These models don’t actually
understand the language; they simply look at which words tend to appear together and group them into underlying patterns or topics that best explain the data.
When we compare the model’s results with the known Currier A and Currier B classifications, the match is very strong, but not perfect. Below is a confusion matrix showing how well the model’s inferred “languages” align with Currier’s.
[
attachment=11710]
I’ve also included a list of the few folios where the classification doesn’t match.
[
attachment=11711]
At the paragraph level, the number of mismatches increases slightly (which makes sense, since a single folio can mix both styles), but for now let’s focus on the folio-level agreement.
I also built a visualization tool that highlights which words contribute most to each topic or language. Each topic is shown in a different color, and the intensity of the color reflects how strongly that word belongs to that topic. Words that are very characteristic of one style appear in bright, saturated colors, while neutral words appear in lighter tones. This makes it possible to
see the stylistic contrast within a single paragraph or folio at a glance.
Here are a few examples. I think this way of visualizing the text makes the two “languages” of the Voynich much easier to grasp intuitively.
For example: You are not allowed to view links.
Register or
Login to view. [
attachment=11716]
You can see that a lot of words are coloured un blue, meaning they are words with weights in language A. The few words in pink(language B) have almost no weight, so the folio is defined as language A.
Let's check one of the folio where the detected language is not the Currier language indicated, You are not allowed to view links.
Register or
Login to view. [
attachment=11713]
As you can see, even if Currier noted this folio as B, according to the model, it is A. But the model has it not so clear, language B has also some weight, even if for the model, language A is the winner.
Let's check a balneological folio (full of qokeedy), like You are not allowed to view links.
Register or
Login to view. [
attachment=11714]
You can see it is full of pink words, indicating it is in language B, as Currier defined it.
I hope you find this interesting. If you have any doubts or comments, feel free to ask.