03-09-2025, 03:22 AM
03-09-2025, 09:12 AM
03-09-2025, 12:03 PM
OK, to deepen in the automatic code analysis, I have compared the outputs between LDA, BerTOPIC and NMF. All three are different models to find topics in texts (LDA: probabilistic model of word co-occurrence; NMF: matrix factorization of TF-IDF; BERTopic: clustering of semantic embeddings).
I have topped the max number of topics to 10, for better visualization. Here are the results:
LDA
![[Image: axcJsah.gif]](https://i.imgur.com/axcJsah.gif)
BerTOPIC
![[Image: 0G5ZKmt.gif]](https://i.imgur.com/0G5ZKmt.gif)
NMF
![[Image: FbDNZ0Z.gif]](https://i.imgur.com/FbDNZ0Z.gif)
All three models find similar topics distribution. Of course that topic modelling is based partially in words and how often they appear, but the three models go beyond just word counting:
LDA (Latent Dirichlet Allocation): LDA looks for bundles of word-forms that tend to appear together in the same passages. A topic is not simply the most frequent words, but a cluster of co-occurring tokens. For example, if certain Voynich forms regularly show up in the same folios, LDA groups them into the same hidden theme.
NMF (Non-Negative Matrix Factorization): NMF does not just count either. It highlights which word-forms are distinctive for a passage compared to the rest of the manuscript. Very frequent tokens are down-weighted, while more characteristic forms stand out. Each topic is then built from these distinctive distributions, showing what makes certain sections look different from others.
BERTopic: In natural languages, BERTopic uses language models to capture semantic meaning. With Voynich, we do not know meanings, so it cannot recover semantics. What it can do is use its embeddings to detect structural or distributional similarities across passages. Two folios might be grouped together even if they do not share identical tokens, because their patterns of word-forms and symbol sequences resemble each other. In this way, BERTopic functions more as a pattern recognizer than a semantic model for Voynich.
I have topped the max number of topics to 10, for better visualization. Here are the results:
LDA
![[Image: axcJsah.gif]](https://i.imgur.com/axcJsah.gif)
BerTOPIC
![[Image: 0G5ZKmt.gif]](https://i.imgur.com/0G5ZKmt.gif)
NMF
![[Image: FbDNZ0Z.gif]](https://i.imgur.com/FbDNZ0Z.gif)
All three models find similar topics distribution. Of course that topic modelling is based partially in words and how often they appear, but the three models go beyond just word counting:
LDA (Latent Dirichlet Allocation): LDA looks for bundles of word-forms that tend to appear together in the same passages. A topic is not simply the most frequent words, but a cluster of co-occurring tokens. For example, if certain Voynich forms regularly show up in the same folios, LDA groups them into the same hidden theme.
NMF (Non-Negative Matrix Factorization): NMF does not just count either. It highlights which word-forms are distinctive for a passage compared to the rest of the manuscript. Very frequent tokens are down-weighted, while more characteristic forms stand out. Each topic is then built from these distinctive distributions, showing what makes certain sections look different from others.
BERTopic: In natural languages, BERTopic uses language models to capture semantic meaning. With Voynich, we do not know meanings, so it cannot recover semantics. What it can do is use its embeddings to detect structural or distributional similarities across passages. Two folios might be grouped together even if they do not share identical tokens, because their patterns of word-forms and symbol sequences resemble each other. In this way, BERTopic functions more as a pattern recognizer than a semantic model for Voynich.
03-09-2025, 04:21 PM
(03-09-2025, 03:22 AM)synapsomorphy Wrote: You are not allowed to view links. Register or Login to view.This is extremely cool! Is your code available? I'd love to see how you're doing it.
Hello,
you can find my public Notebook You are not allowed to view links. Register or Login to view.. Note that it is prepared for Kaggle, and I put instructions about the dataset to be used.
03-09-2025, 10:04 PM
Wow quimqu, this is fascinating work. If you publish this formally, I would also include not only your analysis of Torsten Timm’s auto-generated VMS substitute, but also a control group using meaningful text of a comparable length. A novel involving 3+ story arcs which alternate by chapter might be a good choice. A stream-of-consciousness novel like William Faulkner’s The Sound and the Fury would be an interesting specimen. I’m sure your bots could easily distinguish which story arc, or which period in time in the main character’s mind, a page or paragraph belonged to.
This a similar technology used by companies like TurnItIn.com, for detecting plagiarism in academic essays, no? I remember reading that bots like these are vindicating long-held suspicions that parts of ancient works are later insertions by other authors, and confirming just common pseudo-attribution was in the olden days. Simply put, textual analysis is getting good at determining, with a high degree of confidence, which pieces of writing were, and were not, composed by the same author. This has some privacy-eroding implications that scare me a bit. I can use a VPN and the Tor browser every time I blog about something seditious or controversial, but that’ll be little help to me in a world where the prosecution can call an expert witness like you with a bot like yours to testify, and the bot says the problematic writings have a 95% confidence match with things written in my name.
I digress. Textual analysis and computer language modeling are not my area of expertise. But as a word and language nerd, they fascinate me to no end.
quimqu, have you looked into any of the efforts to restore the original order of the VMS folios? Wladimir would be your man here to talk to abut this, along with the great Koen G, of course. It’s well-established that the VMS has been taken apart and rebound at least once, by people who clearly could not read it. For example, there are at least two bifolios of Herbal B misbound into the first 7 quires, which are otherwise all Herbal A. I’d be interested to see your experiment rerun, with the extant folios in the best estimate available for their original order.
This a similar technology used by companies like TurnItIn.com, for detecting plagiarism in academic essays, no? I remember reading that bots like these are vindicating long-held suspicions that parts of ancient works are later insertions by other authors, and confirming just common pseudo-attribution was in the olden days. Simply put, textual analysis is getting good at determining, with a high degree of confidence, which pieces of writing were, and were not, composed by the same author. This has some privacy-eroding implications that scare me a bit. I can use a VPN and the Tor browser every time I blog about something seditious or controversial, but that’ll be little help to me in a world where the prosecution can call an expert witness like you with a bot like yours to testify, and the bot says the problematic writings have a 95% confidence match with things written in my name.
I digress. Textual analysis and computer language modeling are not my area of expertise. But as a word and language nerd, they fascinate me to no end.
quimqu, have you looked into any of the efforts to restore the original order of the VMS folios? Wladimir would be your man here to talk to abut this, along with the great Koen G, of course. It’s well-established that the VMS has been taken apart and rebound at least once, by people who clearly could not read it. For example, there are at least two bifolios of Herbal B misbound into the first 7 quires, which are otherwise all Herbal A. I’d be interested to see your experiment rerun, with the extant folios in the best estimate available for their original order.
04-09-2025, 03:36 PM
(03-09-2025, 10:04 PM)RenegadeHealer Wrote: You are not allowed to view links. Register or Login to view.quimqu, have you looked into any of the efforts to restore the original order of the VMS folios?
Thank you so much for your encouraging words! Yes, I do plan to publish the study once it’s more complete—I still have quite a few threads to pull on. The idea of experimenting with the folio reordering is really intriguing, and I’ll definitely look into how I could approach it. Thanks a lot for the suggestion!