OK, to deepen in the automatic code analysis, I have compared the outputs between LDA, BerTOPIC and NMF. All three are different models to find topics in texts (LDA: probabilistic model of word co-occurrence; NMF: matrix factorization of TF-IDF; BERTopic: clustering of semantic embeddings).
I have topped the max number of topics to 10, for better visualization. Here are the results:
LDA
BerTOPIC
NMF
All three models find similar topics distribution. Of course that topic modelling is based partially in words and how often they appear, but the three models go beyond just word counting:
LDA (Latent Dirichlet Allocation): LDA looks for bundles of word-forms that tend to appear together in the same passages. A topic is not simply the most frequent words, but a cluster of co-occurring tokens. For example, if certain Voynich forms regularly show up in the same folios, LDA groups them into the same hidden theme.
NMF (Non-Negative Matrix Factorization): NMF does not just count either. It highlights which word-forms are distinctive for a passage compared to the rest of the manuscript. Very frequent tokens are down-weighted, while more characteristic forms stand out. Each topic is then built from these distinctive distributions, showing what makes certain sections look different from others.
BERTopic: In natural languages, BERTopic uses language models to capture semantic meaning. With Voynich, we do not know meanings, so it cannot recover semantics. What it can do is use its embeddings to detect structural or distributional similarities across passages. Two folios might be grouped together even if they do not share identical tokens, because their patterns of word-forms and symbol sequences resemble each other. In this way, BERTopic functions more as a pattern recognizer than a semantic model for Voynich.