21-10-2025, 07:44 AM
(20-10-2025, 10:24 AM)quimqu Wrote: You are not allowed to view links. Register or Login to view.(20-10-2025, 06:29 AM)MarcoP Wrote: You are not allowed to view links. Register or Login to view.EDIT: another point that I think Rene mentioned in the past. Results based on only a few samples are more noisy and unreliable than results based on larger sets. This could play a role in the fact that Bio/Q13 paragraphs get more consistent results than the much shorter Stars/Q20 paragraphs.Even though I agree in general terms, I don’t fully agree regarding topic modelling. Topic modelling is intended to be applied at the sentence level. The shorter Stars paragraphs are actually the perfect size, whereas the longer herbal or biological paragraphs might be too long. I suppose (though I can’t confirm it) that the longer paragraphs should contain internal sentences, but since there is no punctuation, we can’t recognize them yet.
Hi quimqu, I am afraid that what can be said of topic modelling in general cannot always be said of its application to Voynichese.
For instance, Blei et. al You are not allowed to view links. Register or Login to view. mentions a data-set that is orders of magnitudes larger than the Voynich corpus:
Quote: Our data are 16,000 documents from a subset of the TREC AP corpus (Harman, 1992). After removing a standard list of stop words, we used the EM algorithm described in Section 5.3 to find the Dirichlet and conditional multinomial parameters for a 100-topic LDA model.
I am sure that, with a large enough corpus written in a consistent language, single words like “ball” and “knife” can distinguish between “sports” and “crimes”, but Voynich sections are so tiny and the language is so volatile that we aren’t even able to guess if Voynich words encode language words and even less which words (if any) are stop words i.e. function words. So we are unable to remove stop words, and instead of “ball” and “knife” we might be classifying based on “these” and “which” (in the very optimistic scenario that single Voynich words do correspond to single plain-text words).
But, more importantly, we don’t know if the differences we see are due to different topics, different scribes, different languages or dialects, different cipher mechanisms such as different usage of nulls, different spelling, different preferences in spontaneous gibberish etc. We might even be classifying based on "potato" and "potahto".
Personally, I believe that the behavior of the ‘ed’ bigram shown in You are not allowed to view links. Register or Login to view. cannot be explained by different topics, so I feel sure that different topics can at most be a minor component of the Voynich language shift. Scribes alone also cannot be “the” reason, since scribe 1 uses two different languages in his Herbal and Pharma pages (see "eo" bigram); as Jorge says, it’s likely a mixture of factors. If the text is meaningful, we can be sure that topics will play some role in its composition (that’s basically what “meaningful” means), but such differences appear to be obscured by other phenomena that we cannot understand.