Options

Why is there even a Voynich B?

Index
Why is there even a Voynich B?
RE: Why is there even a Voynich B?

Jorge_Stolfi > 12-10-2025, 01:44 PM

(12-10-2025, 12:04 PM)quimqu Wrote: You are not allowed to view links. Register or Login to view.Frequent or function words, like “os”, “la”, or “que”, behave very similarly across all texts, so NMF spreads them over several topics.

Not sure I understand this point. But the words that I listed in my previous post occur exclusively in only one of the three "languages" (DC1+DC2, DC3, and DC4).

I suspect that the problem is that NMF had a large rate of mis-classification, so that (say) 10-20% of the DC4 parags were classified as topic 0 or 1, instead of topic 2. Is this the case?

Could you please list a couple of parags from DC3 that were so mis-classified? Maybe they are neither Portuguese nor Spanish, but just "Iberian"...

All the best, --jorge
RE: Why is there even a Voynich B?

quimqu > 12-10-2025, 07:04 PM

(12-10-2025, 01:44 PM)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view.Not sure I understand this point. But the words that I listed in my previous post occur exclusively in only one of the three "languages" (DC1+DC2, DC3, and DC4).

I suspect that the problem is that NMF had a large rate of mis-classification, so that (say) 10-20% of the DC4 parags were classified as topic 0 or 1, instead of topic 2.  Is this the case?

Could you please list a couple of parags from DC3 that were so mis-classified? Maybe they are neither Portuguese nor Spanish, but just "Iberian"...

All the best, --jorge

Those words only occur in one of the three language variants, that’s correct. The issue is that NMF doesn’t know that. It doesn’t look at which file a word comes from; it only sees a big table of word counts. When it decomposes that table, each topic is a weighted combination of all the words, and every word gets some weight in every topic, even if that word only appears in one language.

So “nao” might only appear in DC1–DC2, but because NMF works with continuous values, it can still give “nao” a small non-zero weight in another topic if that helps to approximate the overall structure of the data. This doesn’t mean the model thinks “nao” appears elsewhere, it just reflects how the math spreads variance across components.

NMF doesn’t make hard assignments like “this word belongs only to topic 1.” It produces soft, overlapping topics, so a word that is exclusive in reality can still have small weights in other topics for mathematical reasons.

In fact, that's why I find it so interesting for the Voynich topic/languae analysis, as we don't know what it means (if it means anything) but we can detect different constructions of the text.

Now firstly, let me attach the same topic assignment plot that I attached in my previous post but horizontally, so you can see that in fact there are not that much paragraphs that are wrong. In fact there are only 4 Portuguese paragraphs that are labelled as Spanish and one Spanish paragraph that is labelled as Portuguese. But all the phonetic Portuguese paragraphs were labelled correctly.


Here are some lines that were given to an uncorrect language. I have marked the most weighted words with colors for the topics in the paragraph. You will see words in up to three colors (drawn), in two colors, in one single color, or no color at all (meaning the model does not take them into account).

If we go to the first sentence in Portuguese that is labelled as Spanish, we have:

Most words are weighted in different topics, but in concrete, "cosme", "tio", "tio cosme" and "como" make Spanish weight more.

The second one is:

In this case it is almost a drawn, but the words weight a bit more in Spanish topic.

If we go to the Spanish paragraph labelled as Portuguese:

In this case, as in the previous, the words weight more for Portuguese topic.

In general, you can see that the NMF has found the languages almost perfectly. I think now you can understand better my results in my You are not allowed to view links. Register or Login to view. about the Voynich. If you have suggestions, doubts, whatever that makes us advance in this topic, please feel free to ask.
RE: Why is there even a Voynich B?

Jorge_Stolfi > 12-10-2025, 08:44 PM

Thanks for the detailed reply!

So, what can we conclude by comparing these results to your previous one of the VMS? Does NMF find the difference between Voynichese A and B greater or smaller than that between Spanish and Portuguese? Or that between official and phonetic Portuguese? I mean, in terms of number of apparently mis-classified pages?

Those ambigous parags that you listed are barely above the 5 word cut-off. I suppose that the NMF is more accurate when it works with bigger chunks of text. You analyzed the VMS at both the page and parag level, right? But the VMS parags are still rather large. How would the NMF perform if you deleted from the PT+ES dataset all parags with less than (say) 30 words? Would it get a perfect score?

All the best, --jorge
RE: Why is there even a Voynich B?

quimqu > 12-10-2025, 09:48 PM

I will post news about the topics this week (hopefully tomorrow) in my thread. The topic models detect 3 different languages/dialects. I have been working also with subtopics of those topics (but this is a bit tricky).
Next Oldest Next Newest

Why is there even a Voynich B?

Index

RE: Why is there even a Voynich B?

RE: Why is there even a Voynich B?

RE: Why is there even a Voynich B?

RE: Why is there even a Voynich B?