The Voynich Ninja

Full Version: Why is there even a Voynich B?
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Pages: 1 2 3
The thing that makes PCA based cluster analysis (in particular) of sub-dialects tricky is that

* on the one hand we don't want to beg the question by bringing too much prior knowledge to bear in doing the analysis, while

* on the other hand, we *know* that the A-B split is the dominant souce of overall variance, meaning that if you include Q13 (in particular) and the Herbal B folios (to a lesser extent) then PCA will be forced to pick a dominant eigenvector that captures as much of that variance as possible, forcing all subsequent eigenfeatures to be orthogonal to that axis -- which is not necessarily going to produce good axes to tease apart other substructure.

While I've gotten around to running PCA-based k-means clustering on the running text of all the pages in the Mss. to compare with Rene's clustering results, I haven't had a chance to run it excluding Q13 or Q13 + Herbal B pages. That's definitely on my queue to see if that increases the separation between the A language "dialects".
(10-10-2025, 09:26 PM)kckluge Wrote: You are not allowed to view links. Register or Login to view.The thing that makes PCA based cluster analysis (in particular) of sub-dialects tricky is 

As there is no prior mention of PCA cluster analysis in the ppst I imagine you think I did cluster analysis, but that is not the case. My analysis is topic analysis, usindg Latent Dirichlet Allocation, that is not using PCA, neither k-means.

LDA is a probabilistic model used to discover hidden topics in a collection of documents. It is language agnlstic and assumes that each document is composed of several topics (K), and each topic is a distribution over words. By analyzing the patterns of word co-occurrence across documents, LDA infers which topics are present and how strongly each document is associated with them. The model is unsupervised and typically estimated using Bayesian inference methods such as variational inference or Gibbs sampling. Its main goal is to uncover the underlying thematic structure of text corpora. In my case, the language units were the paragraphs.
(10-10-2025, 09:37 PM)quimqu Wrote: You are not allowed to view links. Register or Login to view.
(10-10-2025, 09:26 PM)kckluge Wrote: You are not allowed to view links. Register or Login to view.The thing that makes PCA based cluster analysis (in particular) of sub-dialects tricky is 

As there is no prior mention of PCA cluster analysis in the ppst I imagine you think I did cluster analysis, but that is not the case. My analysis is topic analysis, usindg Latent Dirichlet Allocation, that is not using PCA, neither k-means.

LDA is a probabilistic model used to discover hidden topics in a collection of documents. It is language agnlstic and assumes that each document is composed of several topics (K), and each topic is a distribution over words. By analyzing the patterns of word co-occurrence across documents, LDA infers which topics are present and how strongly each document is associated with them. The model is unsupervised and typically estimated using Bayesian inference methods such as variational inference or Gibbs sampling. Its main goal is to uncover the underlying thematic structure of text corpora. In my case, the language units were the paragraphs.

Sorry for the confusion, I was thinking of Rene's PCA-ish cluster analysis @ You are not allowed to view links. Register or Login to view., which is what I (incorrectly) thought he had been referring to earlier in the thread.

As for LDA -- is it "language agnostic" in the sense that it doesn't require that the documents are in a common language, or only in the sense that it doesn't require knowing the (common) language? 

If it's only agnostic in the second sense, I have heartburn about assuming that the Voynich text is homogeneous in that way. The most common words in a given language are typically dominated by function words rather than content words, but if you compare the 10 most frequent words in the set of Herbal A pages vs. the set of Herbal B pages in the herbal quires at the front of the mss (apologies for the Currier): 

Rank:      1       2        3        4      5       6      7       8       9           10
HerbB:    8AM   SC89    OR    AR    AM    8AR   89     S89   4OFC89  ZC89
HerbA:    8AM   SOE    SOR    89    S9     2      ZOE   Q9     8AN       ZO

Given that those are all similarly formatted one-big-picture-of-a-plant pages and the minimal overlap (2 of 10) between those two sets of words I have a lot of trouble buying those reflect a common underlying language/cipher (key)/whatever.
I can’t really tell if LDA is detecting topics, dialects, writing styles, or even distinct languages — but it definitely captures the major split between Herbal A and Herbal B, among other differences. Distinguishing whether that reflects stylistic, dialectal, or linguistic variation is much less straightforward.
(11-10-2025, 08:37 AM)quimqu Wrote: You are not allowed to view links. Register or Login to view.Distinguishing whether that reflects stylistic, dialectal, or linguistic variation is much less straightforward.

Perhaps one can answer that question by looking at frequencies of pairs of words in consecutive positions.  Preferably excluding the first and last words of each line, and the first line of each parag.

If the A/B difference is due to spelling or dialect difference, there should be a roughly 1-1 mapping W -> m(W) between word types.  Then it should be possible to match most of the most common word pairs in a way that is consistent with that mapping.

That is, if  "W1 W2" is a common pair in language A, then "m(W1) m(W2)" should be common in language B.  

Since daiin is the most common word in both A and B, we can start by guessing that m(dain) = daiin.  Then we should look for the most common word pairs "daiin W" and "W daiin", and see whether we can guess m(W) for some of those words.  

However, it is important to test this method with two texts on the same known language, with the same subject but different dialects and spellings.   If it does not work on such test example, then we should try to understand why...

You are not allowed to view links. Register or Login to view. are four texts that could be used for that purpose.  The files are in UTF-8 encoding. They are non-overlapping extracts from the same novel, with three different spelling systems and translated into a different but closely related language.  Parags are separated by blank lines.  More details are in You are not allowed to view links. Register or Login to view.. For better comparison with Voynichese, you may want to ignore all lines starting with "#", delete the string "\emph", map everything to lower case, map all letters with diacritics to ASCII letters without them, and replace all punctuation by spaces.  You may also want to delete all parags that are too short (mostly dialog lines).

If this method works with such control texts but not with the VMS Herbal A/B,  then either we are dealing with two different languages (or very different "dialects"); or A and B were copied from two distinct sources with very different  "formulas".   Compare for example a typical entry from You are not allowed to view links. Register or Login to view. with You are not allowed to view links. Register or Login to view.:
  • Herba Bortines. For a twisted mouth due to some ailment. Take the leaves of this herb, cook them with wine, and apply the poultice for thirty days; it will restore the mouth to its normal state, and it is proven. Also, when cooked with red wine in the form of a poultice and applied to a cold gout for fifteen days, it heals the gout, it is proven. It is gathered in May. It grows in wild mountains and cold places.
  • Sium aquaticum is a little shrub which is found in the water — upright, fat, with broad leaves similar to hipposelinum, yet somewhat smaller and aromatic — which is eaten (either boiled or raw) to break stones [kidney, bladder] and discharge them. Eaten they also induce the movement of urine, are abortifacient, expel the menstrual flow, and are good for dysentery. (Crateuas speaks of it thus: it is a herb like a shrub, little, with round leaves, bigger than black mint, similar to eruca). It is also called anagallis aquatica, schoenos aromatica, as well as a sort of juncus odoratus, darenion, or laver

Apart from the two entries being about different plants (one imaginary, one real) with different uses, note the very different syntactic structure and common phrases.  The first one has more imperative and future mood ("take the leaves", "it will restore", etc.) whereas the second prefers descriptive mood ("is eaten", "they also induce", etc.)   If such differences were to persist for many entries, they would probably result in "language" differences like VMS A and B.

All the best, --jorge
(11-10-2025, 04:13 PM)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view.Perhaps one can answer that question by looking at frequencies of pairs of words in consecutive positions.  Preferably excluding the first and last words of each line, and the first line of each parag.

If the A/B difference is due to spelling or dialect difference, there should be a roughly 1-1 mapping W -> m(W) between word types.  Then it should be possible to match most of the most common word pairs in a way that is consistent with that mapping.

That is, if  "W1 W2" is a common pair in language A, then "m(W1) m(W2)" should be common in language B.  

Since daiin is the most common word in both A and B, we can start by guessing that m(dain) = daiin.  Then we should look for the most common word pairs "daiin W" and "W daiin", and see whether we can guess m(W) for some of those words.  

You seem to read my mind Smile It is exactly what I am doing today. Briefly, I detected the main "languages" (the topic modeling detects 3 different languges, roughly currier A and B and a third one). Then I filter by those main topics (or languages, dialects, whatever you want to name it) and I run new topic detections, but this time, allowing bigrams. The model detects a total of 8 subtopics.

Then,once I have those subtopics, I check what happens to the paragraphs if I sort them by subtopic, not knowing their main topics. About 77% is labelled with the same subtopic and main topic as the first round. The rest, are subtopics that I consider cross-language topics.

I will post results in my thread about automatic topic detection.
(11-10-2025, 04:13 PM)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view.You are not allowed to view links. Register or Login to view. are four texts that could be used for that purpose.  The files are in UTF-8 encoding. They are non-overlapping extracts from the same novel, with three different spelling systems and translated into a different but closely related language.  Parags are separated by blank lines.  More details are in You are not allowed to view links. Register or Login to view.. For better comparison with Voynichese, you may want to ignore all lines starting with "#", delete the string "\emph", map everything to lower case, map all letters with diacritics to ASCII letters without them, and replace all punctuation by spaces.  You may also want to delete all parags that are too short (mostly dialog lines).

I will give it a try and let you know.
(11-10-2025, 04:28 PM)quimqu Wrote: You are not allowed to view links. Register or Login to view.
(11-10-2025, 04:13 PM)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view.You are not allowed to view links. Register or Login to view. are four texts that could be used for that purpose.  The files are in UTF-8 encoding. They are non-overlapping extracts from the same novel, with three different spelling systems and translated into a different but closely related language.  Parags are separated by blank lines.  More details are in You are not allowed to view links. Register or Login to view.. For better comparison with Voynichese, you may want to ignore all lines starting with "#", delete the string "\emph", map everything to lower case, map all letters with diacritics to ASCII letters without them, and replace all punctuation by spaces.  You may also want to delete all parags that are too short (mostly dialog lines).



I will give it a try and let you know.

Hi Jorge,
I ran NMF topic modeling on the four Dom Casmurro files (DC1–DC4). Each file corresponds to a different version of the same novel:
  • DC1 is the original 1899 Portuguese spelling,
  • DC2 is the modern 1999 spelling,
  • DC3 is a phonetic respelling of Portuguese,
  • DC4 is the modern Spanish translation.
The NMF results show a very strong match between the learned topics and the file sections, with the best performance at K = 3 (AMI = 0.81, p ≈ 0). This means that the model clearly identifies three main linguistic clusters across the corpus. Given the setup, that outcome makes perfect sense: the two Portuguese spellings are grouped togethr, the phonetic respelling is separated apart, and the Spanish text stands apart also as its own distinct group.

As the number of topics increases, the AMI drops steadily, showing that three components are enough to capture the main differences between these texts. In other words, NMF is not detecting narrative themes but rather the orthographic and linguistic variation among the Portuguese and Spanish versions of Dom Casmurro. You can see here a comparison for two modles, NMF and LDA. NMF detects clearly the three different languages:

[attachment=11652]

If we color each found topic and each paragraph to its asigned topic color, we see following (paragraphs ordered DC1, DC2, DC3 and DC4 from top to bottom):

[attachment=11653]

You can see that a small part pof the paragraphs is given a wrong topic, but I asume this can happen due to the proximity of the three languages.

Finally, here you can see the top weighted words per topic that the model has found:

=== Main topic 0 ===
? Dialect (main topic words):
nao, ci, os, la, um, di, mi, nau, con, un, el, do, com, si, en, em, ao, que, las, lhe, los, us, du, para, lhi, pra, olhos, in, as, pero, me, mas, mae, su, au, disse, os olhos, dos, menti, voce

=== Main topic 1 ===
? Dialect (main topic words):
ci, di, nau, de, que, me, us, du, pra, lhi, la, in, nao, si, mais, se, no, au, menti, comu, os, um, eli, nu, el, por, para, min, mai, mas, pur, en, como, joze, padri, do, com, veis, ci nau, tamben

=== Main topic 2 ===
? Dialect (main topic words):
la, el, en, los, las, no, de, lo, pero, su, una, del, de la, le, al, es, sus, en la, con, por, madre, fue, en el, de mi, yo, mi madre, que, ojos, de los, ci, solo, habia, ni, tenia, nao, un, mi, despues, sin, lo que


Note that I deleted the paragraphs with less than 5 words.
(12-10-2025, 11:17 AM)quimqu Wrote: You are not allowed to view links. Register or Login to view.Hi Jorge, I ran NMF topic modeling on the four Dom Casmurro files (DC1–DC4). ... The NMF results show a very strong match between the learned topics and the file sections, with the best performance at K = 3 (AMI = 0.81, p ≈ 0). This means that the model clearly identifies three main linguistic clusters across the corpus. Given the setup, that outcome makes perfect sense: the two Portuguese spellings are grouped togethr, the phonetic respelling is separated apart, and the Spanish text stands apart also as its own distinct group.

Very good!

Quote:You can see that a small part of the paragraphs is given a wrong topic, but I assume this can happen due to the proximity of the three languages.

This is something to keep in mind when analyzing the VMS "languages".  Does it mean that they are farther apart than Spanish and Portuguese?

Quote:Finally, here you can see the top weighted words per topic that the model has found:

=== Main topic 0 ===
? Dialect (main topic words):
nao, ci, os, la, um, di, mi, nau, con, un, el, do, com, si, en, em, ao, que, las, lhe, los, us, du, para, lhi, pra, olhos, in, as, pero, me, mas, mae, su, au, disse, os olhos, dos, menti, voce

=== Main topic 1 ===
? Dialect (main topic words):
ci, di, nau, de, que, me, us, du, pra, lhi, la, in, nao, si, mais, se, no, au, menti, comu, os, um, eli, nu, el, por, para, min, mai, mas, pur, en, como, joze, padri, do, com, veis, ci nau, tamben

=== Main topic 2 ===
? Dialect (main topic words):
la, el, en, los, las, no, de, lo, pero, su, una, del, de la, le, al, es, sus, en la, con, por, madre, fue, en el, de mi, yo, mi madre, que, ojos, de los, ci, solo, habia, ni, tenia, nao, un, mi, despues, sin, lo que

This is bizarre.  The words "nao" ("não")  and "os" (for instance) should be strictly Portuguese in 1899 or 1999 spellings (DC1 and DC2), "nau" ("nãu") and "ci" strictly from the phonetic spelling (DC3), and "los" and "pero" strictly from Spanish (DC4).  But they appear across those three topics.  Although topic 2 seems to be mostly Spanish, while topics 0 and 1 are mostly a mix of official and phonetic Portuguese words.

The phonetic Portuguese spelling used in DC3 retains the word spaces of the original (which should be almost precisely the same in the 1899 and 1999 spellings); except for the adverbial suffix "-mente", which in the phonetic version has been turned into a separate word "mênti" (justified the stress pattern).   It is possible (just possible) that VMS A and B differ on the splitting of some words, besides other things.

All the best, --jorge
(12-10-2025, 11:54 AM)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view.This is bizarre.  The words "nao" ("não")  and "os" (for instance) should be strictly Portuguese in 1899 or 1999 spellings (DC1 and DC2), "nau" ("nãu") and "ci" strictly from the phonetic spelling (DC3), and "los" and "pero" strictly from Spanish (DC4).  But they appear across those three topics.  Although topic 2 seems to be mostly Spanish, while topics 0 and 1 are mostly a mix of official and phonetic Portuguese words.

The reason some words appear in the wrong topic is that NMF doesn’t know which section each word comes from. It only looks at word frequencies and how words tend to occur together across the whole dataset. Because of that, topics are not exclusive. The model can assign a word to more than one topic if it helps to minimize the overall reconstruction error.
Frequent or function words, like “os”, “la”, or “que”, behave very similarly across all texts, so NMF spreads them over several topics. Also, because the model uses only three components for four writing systems, some words that are similar in spelling or usage (like “nao” and “nau”) end up mixed between Portuguese and phonetic topics.
NMF doesn’t make linguistic judgments. It just finds statistical patterns, and some overlap between topics is normal (it is language agnositc).
Pages: 1 2 3