![]() |
Automated Topic Analysis of the Voynich Manuscript - Printable Version +- The Voynich Ninja (https://www.voynich.ninja) +-- Forum: Voynich Research (https://www.voynich.ninja/forum-27.html) +--- Forum: Analysis of the text (https://www.voynich.ninja/forum-41.html) +--- Thread: Automated Topic Analysis of the Voynich Manuscript (/thread-4834.html) |
RE: Automated Topic Analysis of the Voynich Manuscript - quimqu - 04-08-2025 (03-08-2025, 10:20 AM)GreyCat Wrote: You are not allowed to view links. Register or Login to view.That's really interesting - brings up those ideas of herbal astrology like in Culpepper. Sorry, what do you mean with Culpepper? Thank you RE: Automated Topic Analysis of the Voynich Manuscript - RobGea - 04-08-2025 Nicholas Culpeper (18 October 1616 – 10 January 1654) was an English botanist, herbalist, physician and astrologer. His book "The English Physitian,1652", ( later "Complete Herbal, 1653" ) is a source of pharmaceutical and herbal lore of the time. From wikipedia : You are not allowed to view links. Register or Login to view. "The Complete Herbal" by Nicholas Culpeper is a historical medicinal guide written in the mid-17th century. This work combines herbalism, astrology, and early medical practices. From Project Gutenberg: You are not allowed to view links. Register or Login to view. TL;DR Culpeper was an astrological botanist and his book "The Complete Herbal" was widely read. RE: Automated Topic Analysis of the Voynich Manuscript - Jorge_Stolfi - 04-08-2025 (04-08-2025, 03:17 PM)RobGea Wrote: You are not allowed to view links. Register or Login to view.Nicholas Culpeper (18 October 1616 – 10 January 1654) was an English botanist, herbalist, physician and astrologer. Here is You are not allowed to view links. Register or Login to view., that should be better suited to statistical analysis than the raw Project Gutemberg text. The file is iso-latin-1 encoding, but the text itself is mostly ascii except that the period of abbreviation is replaced by '°' to distinguish it from sentence period. For most analyses you can consider only the lines that start wit 'a' (words) and 's' (symbols, namely numbers and '&' for 'and'). If you need the punctuation too, each sign is on a separate line that starts with 'p'. There are a couple of lines of Latin and English verses; they are marked off by '# @begin {ltv}' etc. let me know if you need to remove them and can't figure out the markup. # # Nicholas Culpeper, "The English Physitian" ("Culpeper's Herbal") # # Last edited on 2016-05-09 22:07:03 by stolfilocal # # From a Yale electronic edition. # # # @chars null {} # @chars blank {_} # @chars alpha {ABCDEFGHIJKLMNOPQRSTUVWXYZ} # @chars alpha {abcdefghijklmnopqrstuvwxyz} # @chars alpha {'~°} # @chars symbol {0123456789&*} # @chars punct {.,!?():;-«»÷=} # # # # SOURCE AND CREDITS # # # # This is the full text of "Culpeper's Herbal", actually # # "The English Physitian", a long-popular handbook of herbal # # medicine by Nicholas Culpeper's (1616-1654), # # # # The source for this file was an electronic version prepared # # by Richard Siderits, M.D. Yale University, and adapted # # to HTML by Toby Appel. The file was fetched on 2001-01-20 from # # You are not allowed to view links. Register or Login to view. # # From the printed book's library catalog: # # # # Culpeper, Nicholas, 1616-1654. # # "The English physitian: or an astrologo-physical # # discourse of the vulgar herbs of this nation" # # London : Peter Cole, 1652. # # 8 p.l., 255 p. (i.e. 159 p.), [5] p., front. (port.) # # Pages numbered 1-92, 189-255. # # # # From Richard Siderits's introduction: # # # # Nicholas Culpeper, a legendary figure in the field of herbal # # medicine and author of /The English Physitian/, transcribed # # within, was a man of mystery and glory - a revolutionary who # # taxed the hierarchal politicos, challenged the procedures and # # policies of the clergy and championed the wonderings of common # # folk, much to the chagrin of the established pedantists. # # # # Within this manuscript, the reader will find the wit, intellect, # # ethic and conviction of a man maligned by his colleagues and # # much respected by his community. Culpeper worked to bring # # medicinal treatments from the mysterious to the comprehensible. # # His philosophy was to teach the common folk to minister to # # themselves by providing them with the tools and knowledge for # # self health. His mind and ambition was to reform the whole # # system of medicine by being an innovative questioner paving the # # way for new thoughts and principles contrary to established # # traditions. # # # # A man of and for the common people, Culpeper wrote with a # # personal style revealing his insights as well as his struggles. # # Culpeper's writing tends to be comprehensive and exhaustive in # # its approach to reconciling astrology and medicine. # # # # LANGUAGE, SPELLING AND ENCODING # # # # The language is mostly English prose in ASCII encoding, with # # original (hence not very consistent) spelling, punctuation and # # capitalization; except that "~" is used for hyphen, # # to distinguish it from punctuation dashes, and "°" for # # period of abbreviation, to distinguish it from final stop. # # # # There are some tables, Latin phrases, and English verses # # scattered through some sections; these inserts have been marked # # (see below). # # # # Indented text is marked with "{»}" comments. Significant line # # breaks or ends (in tables, indices, verse, etc.) are marked with "÷", and # # paragraph breaks with "=". # # ... # @section 1 {opn} $ {opn} # @section 2 {tpg} $ {opn}{tpg} # @section 3 {tt} $ {opn}{tpg}{tt} @ 180 a THE @ 181 a ENGLISH a PHYSITIAN p : p ÷ # @section 3 {tt1} $ {opn}{tpg}{tt1} @ 183 a OR p ÷ @ 184 a An a Astrologo~Physical a Discourse a of a the a Vulgar @ 185 a Herbs a of a this a Nation p . @ 186 p = # @section 3 {tx} $ {opn}{tpg}{tx} @ 188 a Being a a a Compleat a Method a of a Physick p , a whereby a a a man @ 189 a may a preserve a his a Body a in a Health p ; a or a cure a himself p , a being @ 190 a sick p , a for a three a pence a charge p , a with ... RE: Automated Topic Analysis of the Voynich Manuscript - quimqu - 05-08-2025 (04-08-2025, 09:20 PM)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view.(04-08-2025, 03:17 PM)RobGea Wrote: You are not allowed to view links. Register or Login to view.Nicholas Culpeper (18 October 1616 – 10 January 1654) was an English botanist, herbalist, physician and astrologer.Here is You are not allowed to view links. Register or Login to view., that should be better suited to statistical analysis than the raw Project Gutemberg text. Hello Jorge and Rob, thank you for introducing me to the Culpeper "The English Physitian" (also named "Complete Herbal"). I run the same experiments as I did with the Voynich MS. That's running LDA, BERTopic and NMF models on the text of Culpeper's book. I cleaned first the data and labelled according to the section, then run the topic models. Each dot is a text section. I cannot label it by the page, as I haven't it in the txt file. Here is a summary of the topics found by each model, its distribution and the most important words per topic: LDA: ![]() You can see the word clouds for each topic (where a word is bigger if it has more importance in the topic): ![]() Here is a summary of what the most important words of each topic tell us about it: Topic 0 — General theory, method, and practice Key words: herbs, reason, tree, way, shall, make, keep, time
Key words: good, herb, helpeth, water, juyce, pains, wine, decoction, ulcers, applied
Topic 2 — Habitat / growing conditions Key words: places, groweth, land, gardens, fields, wild, sides
Topic 3 — Morphological description Key words: leavs, smal, seed, long, green, flowers, root, colour, stalks, white
BERTopic: ![]() The word clouds are: ![]() Summary of the topics: Topic 0 — Virtues / medicinal uses Key words: good, herb, helpeth, water, juyce, pains, wine, decoction
Topic 2 — Morphological description Key words: smal, leavs, long, flowers, colour, green, round
Topic 1 — Habitat / place Key words: groweth, places, land, gardens, fields, woods
Topics 3, 4, 5, 7 — Time / flowering and harvest
Topic 6 — General introductory remarks Key words: description, known, well, need, write, trouble, sith
NMF: ![]() The word clouds are: ![]() The topic descriptions are: NMF Topic 0 — Morphological description Key words: smal, leavs, long, flowers, root, colour, white, branches
NMF Topics 1 & 5 — Flowering and seasonal timing
NMF Topic 3 — Habitat / place Key words: groweth, places, land, gardens, fields, woods, moist, meadows, hedg, wild
NMF Topic 2 — Virtues / medicinal uses Key words: also, herb, good, helpeth, ulcers, wounds, water, decoction, pains, wine
NMF Topic 4 — Introductory / general remarks Key words: description, known, well, need, garden, vertues, followeth, every
RE: Automated Topic Analysis of the Voynich Manuscript - bi3mw - 05-08-2025 Deleted RE: Automated Topic Analysis of the Voynich Manuscript - Jorge_Stolfi - 06-08-2025 Intersting! But it seems that, with all methods, the "topics" are based on the words that are most common overall, which, not coincidentally, are those with least information. The really interesting topics would be about the conditions that the plants are supposed to treat. And another unfortunate thing is that the plants in the herbal section are arranged alphabetically, rather than by similarity of purpose. Thus any topic will be uniformly scattered through the whole herbal section, rather than lumped in interesting ways... RE: Automated Topic Analysis of the Voynich Manuscript - quimqu - 06-08-2025 (06-08-2025, 01:03 AM)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view.Intersting! Yes, exactly! That’s the challenge. With Culpeper, we know what’s important and what’s filler, so the topics end up reflecting structure or common language — but not always the really interesting stuff like conditions treated. And since the plants are ordered alphabetically, the useful groupings are scattered. With the Voynich, we don’t even have that baseline. We don’t know which words are meaningful, which ones are structural, or even if there’s a true "meaning" at all in the usual sense. That makes it hard to distinguish signal from noise. Still, the Culpeper exercise is helpful, I think, as it shows that topic models tend to pick up structure and repetition, not necessarily themes. I can say (tell me if I am wrong), that we can see the "same" thing in the Voynich through its topics clustering by section type. I think it tell us there’s a consistent underlying structure, even if we can’t read it yet. It’s not a translation, but it’s a start... RE: Automated Topic Analysis of the Voynich Manuscript - quimqu - 06-08-2025 I update here the modelling with the Voynich MS. I don't know if this will be of interest, but I think it gives some things to think about. What I have done:
These are the results. LDA LDA model detects better coherence with just 3 topics. But it is too generalistic, so I took the peak of 9 topics to work with: ![]() You can see in this plot, that some of them overlap at about their 30% of their distance map radius, but I think it is worth the try: ![]() This is the distribution of the topics per paragraph: ![]() And these the word clouds per topic: ![]() And the topic distribution by section: ![]() BERTopic BERTopic automates the number of topic found. Noise goes to topic -1. After training, BERTopic finds also 9 topics (plus the noise -1 topic). Here is the distribution: ![]() And here the word clouds: ![]() And the topic distribution by section: ![]() NMF Automated elbow detection for NMF finds 6 topics as the optimum: ![]() The distribution across the paragraphs is following: ![]() The word clouds: ![]() And the distribution by section: ![]() So, if you need any information about my work, feel free to contact me. Bonus I am just starting to analyse the results, but I did something. I know ChatGPT is not a friend here, but it is a good tool to summarize results. So I gave him the most weighted words per topic and per model, and the models distribution, and ask it to tell me which correlations or informations can it give me. Please don’t dismiss or demonize my work simply because I consulted ChatGPT for ideas. ![]() Here is the answer, which can be also interesting: 1️⃣ Distribution of topics across sections (the heatmap-like proportions) LDA
2️⃣ Topic keywords (high-level interpretation) LDA
3️⃣ General observations
4️⃣ Hypotheses
RE: Automated Topic Analysis of the Voynich Manuscript - RobGea - 06-08-2025 Nevermind RE: Automated Topic Analysis of the Voynich Manuscript - Aga Tentakulus - 08-08-2025 I think it's definitely worth a try. As long as only plants are compared. Or the possible recipe section. Of course, you shouldn't mix them. |