(12-10-2025, 01:44 PM)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view.Not sure I understand this point. But the words that I listed in my previous post occur exclusively in only one of the three "languages" (DC1+DC2, DC3, and DC4).
I suspect that the problem is that NMF had a large rate of mis-classification, so that (say) 10-20% of the DC4 parags were classified as topic 0 or 1, instead of topic 2. Is this the case?
Could you please list a couple of parags from DC3 that were so mis-classified? Maybe they are neither Portuguese nor Spanish, but just "Iberian"...
All the best, --jorge
Those words only occur in one of the three language variants, that’s correct. The issue is that NMF doesn’t know that. It doesn’t look at which file a word comes from; it only sees a big table of word counts. When it decomposes that table, each topic is a weighted combination of all the words, and every word gets some weight in every topic, even if that word only appears in one language.
So “nao” might only appear in DC1–DC2, but because NMF works with continuous values, it can still give “nao” a small non-zero weight in another topic if that helps to approximate the overall structure of the data. This doesn’t mean the model thinks “nao” appears elsewhere, it just reflects how the math spreads variance across components.
NMF doesn’t make hard assignments like “this word belongs only to topic 1.” It produces soft, overlapping topics, so a word that is exclusive in reality can still have small weights in other topics for mathematical reasons.
In fact, that's why I find it so interesting for the Voynich topic/languae analysis, as we don't know what it means (if it means anything) but we can detect different constructions of the text.
Now firstly, let me attach the same topic assignment plot that I attached in my previous post but horizontally, so you can see that in fact there are not that much paragraphs that are wrong. In fact there are only 4 Portuguese paragraphs that are labelled as Spanish and one Spanish paragraph that is labelled as Portuguese. But all the phonetic Portuguese paragraphs were labelled correctly.
Here are some lines that were given to an uncorrect language. I have marked the most weighted words with colors for the topics in the paragraph. You will see words in up to three colors (drawn), in two colors, in one single color, or no color at all (meaning the model does not take them into account).
If we go to the first sentence in Portuguese that is labelled as Spanish, we have:
Most words are weighted in different topics, but in concrete, "cosme", "tio", "tio cosme" and "como" make Spanish weight more.
The second one is:
In this case it is almost a drawn, but the words weight a bit more in Spanish topic.
If we go to the Spanish paragraph labelled as Portuguese:
In this case, as in the previous, the words weight more for Portuguese topic.
In general, you can see that the NMF has found the languages almost perfectly. I think now you can understand better my results in my You are not allowed to view links.
Register or
Login to view. about the Voynich. If you have suggestions, doubts, whatever that makes us advance in this topic, please feel free to ask.