04-08-2025, 02:57 PM
04-08-2025, 02:57 PM
04-08-2025, 03:17 PM
Nicholas Culpeper (18 October 1616 – 10 January 1654) was an English botanist, herbalist, physician and astrologer.
His book "The English Physitian,1652", ( later "Complete Herbal, 1653" ) is a source of pharmaceutical and herbal lore of the time.
From wikipedia : You are not allowed to view links. Register or Login to view.
"The Complete Herbal" by Nicholas Culpeper is a historical medicinal guide written in the mid-17th century. This work combines herbalism, astrology, and early medical practices.
From Project Gutenberg: You are not allowed to view links. Register or Login to view.
TL;DR Culpeper was an astrological botanist and his book "The Complete Herbal" was widely read.
His book "The English Physitian,1652", ( later "Complete Herbal, 1653" ) is a source of pharmaceutical and herbal lore of the time.
From wikipedia : You are not allowed to view links. Register or Login to view.
"The Complete Herbal" by Nicholas Culpeper is a historical medicinal guide written in the mid-17th century. This work combines herbalism, astrology, and early medical practices.
From Project Gutenberg: You are not allowed to view links. Register or Login to view.
TL;DR Culpeper was an astrological botanist and his book "The Complete Herbal" was widely read.
04-08-2025, 09:20 PM
(04-08-2025, 03:17 PM)RobGea Wrote: You are not allowed to view links. Register or Login to view.Nicholas Culpeper (18 October 1616 – 10 January 1654) was an English botanist, herbalist, physician and astrologer.
His book "The English Physitian,1652", ( later "Complete Herbal, 1653" ) is a source of pharmaceutical and herbal lore of the time.
Here is You are not allowed to view links. Register or Login to view., that should be better suited to statistical analysis than the raw Project Gutemberg text.
The file is iso-latin-1 encoding, but the text itself is mostly ascii except that the period of abbreviation is replaced by '°' to distinguish it from sentence period. For most analyses you can consider only the lines that start wit 'a' (words) and 's' (symbols, namely numbers and '&' for 'and'). If you need the punctuation too, each sign is on a separate line that starts with 'p'. There are a couple of lines of Latin and English verses; they are marked off by '# @begin {ltv}' etc. let me know if you need to remove them and can't figure out the markup.
# # Nicholas Culpeper, "The English Physitian" ("Culpeper's Herbal")
# # Last edited on 2016-05-09 22:07:03 by stolfilocal
# # From a Yale electronic edition.
# #
# @chars null {}
# @chars blank {_}
# @chars alpha {ABCDEFGHIJKLMNOPQRSTUVWXYZ}
# @chars alpha {abcdefghijklmnopqrstuvwxyz}
# @chars alpha {'~°}
# @chars symbol {0123456789&*}
# @chars punct {.,!?():;-«»÷=}
# #
# # SOURCE AND CREDITS
# #
# # This is the full text of "Culpeper's Herbal", actually
# # "The English Physitian", a long-popular handbook of herbal
# # medicine by Nicholas Culpeper's (1616-1654),
# #
# # The source for this file was an electronic version prepared
# # by Richard Siderits, M.D. Yale University, and adapted
# # to HTML by Toby Appel. The file was fetched on 2001-01-20 from
# # You are not allowed to view links. Register or Login to view.
# # From the printed book's library catalog:
# #
# # Culpeper, Nicholas, 1616-1654.
# # "The English physitian: or an astrologo-physical
# # discourse of the vulgar herbs of this nation"
# # London : Peter Cole, 1652.
# # 8 p.l., 255 p. (i.e. 159 p.), [5] p., front. (port.)
# # Pages numbered 1-92, 189-255.
# #
# # From Richard Siderits's introduction:
# #
# # Nicholas Culpeper, a legendary figure in the field of herbal
# # medicine and author of /The English Physitian/, transcribed
# # within, was a man of mystery and glory - a revolutionary who
# # taxed the hierarchal politicos, challenged the procedures and
# # policies of the clergy and championed the wonderings of common
# # folk, much to the chagrin of the established pedantists.
# #
# # Within this manuscript, the reader will find the wit, intellect,
# # ethic and conviction of a man maligned by his colleagues and
# # much respected by his community. Culpeper worked to bring
# # medicinal treatments from the mysterious to the comprehensible.
# # His philosophy was to teach the common folk to minister to
# # themselves by providing them with the tools and knowledge for
# # self health. His mind and ambition was to reform the whole
# # system of medicine by being an innovative questioner paving the
# # way for new thoughts and principles contrary to established
# # traditions.
# #
# # A man of and for the common people, Culpeper wrote with a
# # personal style revealing his insights as well as his struggles.
# # Culpeper's writing tends to be comprehensive and exhaustive in
# # its approach to reconciling astrology and medicine.
# #
# # LANGUAGE, SPELLING AND ENCODING
# #
# # The language is mostly English prose in ASCII encoding, with
# # original (hence not very consistent) spelling, punctuation and
# # capitalization; except that "~" is used for hyphen,
# # to distinguish it from punctuation dashes, and "°" for
# # period of abbreviation, to distinguish it from final stop.
# #
# # There are some tables, Latin phrases, and English verses
# # scattered through some sections; these inserts have been marked
# # (see below).
# #
# # Indented text is marked with "{»}" comments. Significant line
# # breaks or ends (in tables, indices, verse, etc.) are marked with "÷", and
# # paragraph breaks with "=".
# #
...
# @section 1 {opn}
$ {opn}
# @section 2 {tpg}
$ {opn}{tpg}
# @section 3 {tt}
$ {opn}{tpg}{tt}
@ 180
a THE
@ 181
a ENGLISH
a PHYSITIAN
p :
p ÷
# @section 3 {tt1}
$ {opn}{tpg}{tt1}
@ 183
a OR
p ÷
@ 184
a An
a Astrologo~Physical
a Discourse
a of
a the
a Vulgar
@ 185
a Herbs
a of
a this
a Nation
p .
@ 186
p =
# @section 3 {tx}
$ {opn}{tpg}{tx}
@ 188
a Being
a a
a Compleat
a Method
a of
a Physick
p ,
a whereby
a a
a man
@ 189
a may
a preserve
a his
a Body
a in
a Health
p ;
a or
a cure
a himself
p ,
a being
@ 190
a sick
p ,
a for
a three
a pence
a charge
p ,
a with
...
05-08-2025, 09:18 PM
(04-08-2025, 09:20 PM)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view.(04-08-2025, 03:17 PM)RobGea Wrote: You are not allowed to view links. Register or Login to view.Nicholas Culpeper (18 October 1616 – 10 January 1654) was an English botanist, herbalist, physician and astrologer.Here is You are not allowed to view links. Register or Login to view., that should be better suited to statistical analysis than the raw Project Gutemberg text.
His book "The English Physitian,1652", ( later "Complete Herbal, 1653" ) is a source of pharmaceutical and herbal lore of the time.
Hello Jorge and Rob,
thank you for introducing me to the Culpeper "The English Physitian" (also named "Complete Herbal"). I run the same experiments as I did with the Voynich MS. That's running LDA, BERTopic and NMF models on the text of Culpeper's book. I cleaned first the data and labelled according to the section, then run the topic models. Each dot is a text section. I cannot label it by the page, as I haven't it in the txt file.
Here is a summary of the topics found by each model, its distribution and the most important words per topic:
LDA:
![[Image: DeQXDhv.png]](https://i.imgur.com/DeQXDhv.png)
You can see the word clouds for each topic (where a word is bigger if it has more importance in the topic):
![[Image: IpnhiPS.png]](https://i.imgur.com/IpnhiPS.png)
Here is a summary of what the most important words of each topic tell us about it:
Topic 0 — General theory, method, and practice
Key words: herbs, reason, tree, way, shall, make, keep, time
- This reads like preface / general instructions: reasoning about herbs, general use, “ways” of preparation, timing, keeping herbs.
- These could be Prologue or introductory notes about how and why to use herbs.
Key words: good, herb, helpeth, water, juyce, pains, wine, decoction, ulcers, applied
- This is remedies: describes health benefits (helpeth, good for pains) and preparations (water, wine, decoction, applied).
- It should be the “Virtues” section where the medicinal value of each herb is explained.
Topic 2 — Habitat / growing conditions
Key words: places, groweth, land, gardens, fields, wild, sides
- This is the “Place” section: where the plant is found or grown (wild, fields, gardens, land).
- It also includes words like known, description, well which might introduce local familiarity (“well known in every garden”).
Topic 3 — Morphological description
Key words: leavs, smal, seed, long, green, flowers, root, colour, stalks, white
- This is a description of the plant’s physical characteristics: leaves, seeds, flowers, roots, color.
- Matches the “Description” section of each entry.
BERTopic:
![[Image: WJHmLtF.png]](https://i.imgur.com/WJHmLtF.png)
The word clouds are:
![[Image: 3U5MGjX.png]](https://i.imgur.com/3U5MGjX.png)
Summary of the topics:
Topic 0 — Virtues / medicinal uses
Key words: good, herb, helpeth, water, juyce, pains, wine, decoction
- This is the “Virtues” section: medicinal benefits of the plant, how it helps ailments, and how it’s prepared (water, wine, decoction, juyce).
- Almost identical to what we saw for NMF Topic 1 earlier.
Topic 2 — Morphological description
Key words: smal, leavs, long, flowers, colour, green, round
- This is the “Description” section: form, size, and color of leaves, flowers, roots, and other parts.
- A direct match to the physical description part of each herb entry.
Topic 1 — Habitat / place
Key words: groweth, places, land, gardens, fields, woods
- This is the “Place” section: where the herb grows (gardens, fields, wild places, woods).
- Tells you where to find or cultivate the plant.
Topics 3, 4, 5, 7 — Time / flowering and harvest
- Topic 3: flower, july, ripe, seed, august, june — summer flowering/fruiting
- Topic 4: flowreth, ripe, july, seed, june, august — flowering + seed production
- Topic 5: flowreth, july, june, august, beginning, months — start and duration of flowering
- Topic 7: seedeth, flowreth, august, july, june — seed maturation
Topic 6 — General introductory remarks
Key words: description, known, well, need, write, trouble, sith
- These are introductory phrases like “This herb is well known, its description is as follows…” or general commentary.
- This fits the opening lines of an entry before moving to the detailed description.
NMF:
![[Image: 4HJR9Di.png]](https://i.imgur.com/4HJR9Di.png)
The word clouds are:
![[Image: 1WTif90.png]](https://i.imgur.com/1WTif90.png)
The topic descriptions are:
NMF Topic 0 — Morphological description
Key words: smal, leavs, long, flowers, root, colour, white, branches
- This is clearly the Description section (appearance of the plant).
- It talks about physical attributes: size, leaf shape, flower color, roots.
- Voynich parallel: The text that would accompany the plant drawings.
NMF Topics 1 & 5 — Flowering and seasonal timing
- Topic 1: july, flowreth, june, august, ripe, seed — flowering/fruiting schedule, mainly summer months.
- Topic 5: flower, summer, seed, ripe, end, august, may, april, winter — seasonal cycle with a broader time span (spring → summer → harvest).
NMF Topic 3 — Habitat / place
Key words: groweth, places, land, gardens, fields, woods, moist, meadows, hedg, wild
- This is the Place section.
- Lists locations where the plant grows (gardens, wild places, meadows, hedgerows, waste ground).
NMF Topic 2 — Virtues / medicinal uses
Key words: also, herb, good, helpeth, ulcers, wounds, water, decoction, pains, wine
- This is the Virtues section.
- Discusses medicinal benefits (help for ulcers, wounds, pains) and preparation methods (wine, decoction, boyled, drunk).
NMF Topic 4 — Introductory / general remarks
Key words: description, known, well, need, garden, vertues, followeth, every
- This looks like introductory sentences that precede the Description section.
- Phrases like “This herb is well known… the description followeth…”
- It’s almost the template line that starts many herb entries.
05-08-2025, 11:10 PM
Deleted
06-08-2025, 01:03 AM
Intersting!
But it seems that, with all methods, the "topics" are based on the words that are most common overall, which, not coincidentally, are those with least information. The really interesting topics would be about the conditions that the plants are supposed to treat.
And another unfortunate thing is that the plants in the herbal section are arranged alphabetically, rather than by similarity of purpose. Thus any topic will be uniformly scattered through the whole herbal section, rather than lumped in interesting ways...
But it seems that, with all methods, the "topics" are based on the words that are most common overall, which, not coincidentally, are those with least information. The really interesting topics would be about the conditions that the plants are supposed to treat.
And another unfortunate thing is that the plants in the herbal section are arranged alphabetically, rather than by similarity of purpose. Thus any topic will be uniformly scattered through the whole herbal section, rather than lumped in interesting ways...
06-08-2025, 08:51 AM
(06-08-2025, 01:03 AM)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view.Intersting!
But it seems that, with all methods, the "topics" are based on the words that are most common overall, which, not coincidentally, are those with least information. The really interesting topics would be about the conditions that the plants are supposed to treat.
And another unfortunate thing is that the plants in the herbal section are arranged alphabetically, rather than by similarity of purpose. Thus any topic will be uniformly scattered through the whole herbal section, rather than lumped in interesting ways...
Yes, exactly! That’s the challenge.
With Culpeper, we know what’s important and what’s filler, so the topics end up reflecting structure or common language — but not always the really interesting stuff like conditions treated. And since the plants are ordered alphabetically, the useful groupings are scattered.
With the Voynich, we don’t even have that baseline. We don’t know which words are meaningful, which ones are structural, or even if there’s a true "meaning" at all in the usual sense. That makes it hard to distinguish signal from noise.
Still, the Culpeper exercise is helpful, I think, as it shows that topic models tend to pick up structure and repetition, not necessarily themes. I can say (tell me if I am wrong), that we can see the "same" thing in the Voynich through its topics clustering by section type. I think it tell us there’s a consistent underlying structure, even if we can’t read it yet.
It’s not a translation, but it’s a start...
06-08-2025, 10:38 AM
I update here the modelling with the Voynich MS. I don't know if this will be of interest, but I think it gives some things to think about.
What I have done:
These are the results.
LDA
LDA model detects better coherence with just 3 topics. But it is too generalistic, so I took the peak of 9 topics to work with:
![[Image: uKObvS1.png]](https://i.imgur.com/uKObvS1.png)
You can see in this plot, that some of them overlap at about their 30% of their distance map radius, but I think it is worth the try:
![[Image: Iy74uAV.png]](https://i.imgur.com/Iy74uAV.png)
This is the distribution of the topics per paragraph:
![[Image: BG1af5i.png]](https://i.imgur.com/BG1af5i.png)
And these the word clouds per topic:
![[Image: t6IrJ93.png]](https://i.imgur.com/t6IrJ93.png)
And the topic distribution by section:
![[Image: SW3Onuu.png]](https://i.imgur.com/SW3Onuu.png)
BERTopic
BERTopic automates the number of topic found. Noise goes to topic -1. After training, BERTopic finds also 9 topics (plus the noise -1 topic). Here is the distribution:
![[Image: RTU52LU.png]](https://i.imgur.com/RTU52LU.png)
And here the word clouds:
![[Image: IbqZk3U.png]](https://i.imgur.com/IbqZk3U.png)
And the topic distribution by section:
![[Image: g0uWdhB.png]](https://i.imgur.com/g0uWdhB.png)
NMF
Automated elbow detection for NMF finds 6 topics as the optimum:
![[Image: jAICkeb.png]](https://i.imgur.com/jAICkeb.png)
The distribution across the paragraphs is following:
![[Image: fK4jaEF.png]](https://i.imgur.com/fK4jaEF.png)
The word clouds:
![[Image: xzJ4sta.png]](https://i.imgur.com/xzJ4sta.png)
And the distribution by section:
![[Image: 8Cn4xM6.png]](https://i.imgur.com/8Cn4xM6.png)
So, if you need any information about my work, feel free to contact me.
Bonus
I am just starting to analyse the results, but I did something. I know ChatGPT is not a friend here, but it is a good tool to summarize results. So I gave him the most weighted words per topic and per model, and the models distribution, and ask it to tell me which correlations or informations can it give me. Please don’t dismiss or demonize my work simply because I consulted ChatGPT for ideas.
Here is the answer, which can be also interesting:
1️⃣ Distribution of topics across sections (the heatmap-like proportions)
LDA
2️⃣ Topic keywords (high-level interpretation)
LDA
3️⃣ General observations
4️⃣ Hypotheses
What I have done:
- Created a list of paragraphs (paragraphs starting with ¡ and ending with ., but also the "lists" of words like in the cosmological roses, ara taken as paragraphs.
- Removed the most frequent used words, assuming that they are common words (who knows?). Words removed are: ['aiin', 'ar', 'chedy', 'chey', 'chol', 'daiin', 'dar', 'dy', 'ol', 'or', 'qokeey', 's', 'shedy', 'shey', 'y']
- Removed the single word paragraphs.
- Ran LDA, BERTopic and NMF models.
These are the results.
LDA
LDA model detects better coherence with just 3 topics. But it is too generalistic, so I took the peak of 9 topics to work with:
![[Image: uKObvS1.png]](https://i.imgur.com/uKObvS1.png)
You can see in this plot, that some of them overlap at about their 30% of their distance map radius, but I think it is worth the try:
![[Image: Iy74uAV.png]](https://i.imgur.com/Iy74uAV.png)
This is the distribution of the topics per paragraph:
![[Image: BG1af5i.png]](https://i.imgur.com/BG1af5i.png)
And these the word clouds per topic:
![[Image: t6IrJ93.png]](https://i.imgur.com/t6IrJ93.png)
And the topic distribution by section:
![[Image: SW3Onuu.png]](https://i.imgur.com/SW3Onuu.png)
BERTopic
BERTopic automates the number of topic found. Noise goes to topic -1. After training, BERTopic finds also 9 topics (plus the noise -1 topic). Here is the distribution:
![[Image: RTU52LU.png]](https://i.imgur.com/RTU52LU.png)
And here the word clouds:
![[Image: IbqZk3U.png]](https://i.imgur.com/IbqZk3U.png)
And the topic distribution by section:
![[Image: g0uWdhB.png]](https://i.imgur.com/g0uWdhB.png)
NMF
Automated elbow detection for NMF finds 6 topics as the optimum:
![[Image: jAICkeb.png]](https://i.imgur.com/jAICkeb.png)
The distribution across the paragraphs is following:
![[Image: fK4jaEF.png]](https://i.imgur.com/fK4jaEF.png)
The word clouds:
![[Image: xzJ4sta.png]](https://i.imgur.com/xzJ4sta.png)
And the distribution by section:
![[Image: 8Cn4xM6.png]](https://i.imgur.com/8Cn4xM6.png)
So, if you need any information about my work, feel free to contact me.
Bonus
I am just starting to analyse the results, but I did something. I know ChatGPT is not a friend here, but it is a good tool to summarize results. So I gave him the most weighted words per topic and per model, and the models distribution, and ask it to tell me which correlations or informations can it give me. Please don’t dismiss or demonize my work simply because I consulted ChatGPT for ideas.

Here is the answer, which can be also interesting:
1️⃣ Distribution of topics across sections (the heatmap-like proportions)
LDA
- LDA topics are spread fairly evenly across sections; no single topic is strongly concentrated in one section.
- Topic 2 stands out:
- Biological has a very high proportion (0.52)
- That suggests some vocabulary in Topic 2 is unique to that section (possibly labels, patterns, or specialized words).
- Biological has a very high proportion (0.52)
- Botanical is broadly distributed across many topics, which is expected given it’s the largest section.
- Astronomical and Cosmological are not dominated by a single topic — they are mixed.
- Large share of paragraphs labeled -1 (outliers/no clear topic) especially in Cosmological (0.46), Pharmacological (0.50), Recipes (0.57).
→ Suggests these sections have less internally consistent clusters or use more varied vocabulary.
- Topic 0 is Botanical-heavy (0.59).
- Topic 2 is Biological-heavy (0.23).
- Overall: BERTopic seems to be splitting Botanical and Biological into distinct clusters, but struggles to model Cosmological/Recipes.
- Much stronger specialization than LDA:
- Topic 1 is almost entirely Botanical (0.65).
- Topic 3 is largely Biological (0.46).
- Topic 2 is Cosmological-heavy (0.67).
- Topic 5 is Pharmacological-heavy (0.65).
- Topic 1 is almost entirely Botanical (0.65).
- This is a clear separation:
NMF aligns topics with manuscript sections much more than LDA or BERTopic.
2️⃣ Topic keywords (high-level interpretation)
LDA
- Top words dominated by common Voynich stems: qokeedy, qokedy, chor, otedy, dal, dain...
- These are global frequent tokens, which is why LDA distributes topics evenly across sections.
- Topics 0–8 look more like statistical clusters of frequent syllable-like units than semantic groups.
- Hypothesis: LDA in Voynich case mostly groups by script/word-form patterns (maybe scribal habits, line positions, gallows usage) rather than semantic categories.
- Topics 0–8 clearly group orthographically similar word families:
- Topic 0: chor, chy, sho, cthy...
- Topic 1: qokaiin, qokain, okaiin...
- Topic 3: otoldy, okol, okolo...
- Topic 5: qokchedy, qopchedy...
- Topic 0: chor, chy, sho, cthy...
- This is strong evidence BERTopic clusters by word-form morphology — essentially “roots” or repeating glyph patterns.
- These clusters sometimes align with sections (Topic 0 heavily Botanical, etc.).
- The clearest segmentation:
- Topic 1: chor, chy, cthy, shol
→ matches clusters in Herbal/plant section (Botanical).
- Topic 2: okaiin, okar, qokaiin...
→ occurs in sections with lots of aiin endings (Cosmological + Recipes).
- Topic 3: qokain, qokal, qol
→ concentrated in Biological section.
- Topic 5: cheol, okeol, sheol
→ concentrated in Pharmacological.
- Topic 1: chor, chy, cthy, shol
- Hypothesis: NMF is isolating morphological families tied to certain sections — possibly reflecting differences in vocabulary or “currier hands”.
3️⃣ General observations
- Alphabetical arrangement in Culpeper caused topics to scatter across the section;
Voynich doesn’t have alphabetical ordering so topic clustering should reflect functional segmentation.
We see some of that in NMF.
- LDA here is limited — its topics align weakly with sections, suggesting it mainly captures high-frequency tokens.
- BERTopic finds morphological families but leaves many outliers (-1) — possibly because Voynich word distribution is flatter than typical text corpora.
- NMF shows the strongest correspondence between topics and sections, suggesting:
- Different sections use distinct “lexical sets” (whether actual lexicon or scribal variants).
- Voynich sections may have semi-autonomous vocabularies.
- Different sections use distinct “lexical sets” (whether actual lexicon or scribal variants).
4️⃣ Hypotheses
- Different scribes or functional registers
- The alignment of NMF topics to sections suggests either different scribes with different preferred glyph sequences, or different formulaic content by section.
- The alignment of NMF topics to sections suggests either different scribes with different preferred glyph sequences, or different formulaic content by section.
- Section-specific vocabulary
- Biological vs. Botanical vs. Pharmacological differences may indicate thematic sets of words (real or pseudo-lexical).
- Biological vs. Botanical vs. Pharmacological differences may indicate thematic sets of words (real or pseudo-lexical).
- Cosmological as an anomaly
- Cosmological sections cluster together in NMF (Topic 2) despite being short; could imply repeated formulae or unique word-shapes.
- Cosmological sections cluster together in NMF (Topic 2) despite being short; could imply repeated formulae or unique word-shapes.
- Recipes section is diffuse
- BERTopic and LDA both show Recipes as not strongly tied to one topic; perhaps because of greater lexical diversity or mixing of vocabulary sets.
- BERTopic and LDA both show Recipes as not strongly tied to one topic; perhaps because of greater lexical diversity or mixing of vocabulary sets.
- Pharmacological section is specialized
- NMF Topic 5 dominance suggests a stable repeated vocabulary unique to this section.
- NMF Topic 5 dominance suggests a stable repeated vocabulary unique to this section.
06-08-2025, 01:34 PM
Nevermind
08-08-2025, 07:39 AM
I think it's definitely worth a try.
As long as only plants are compared. Or the possible recipe section.
Of course, you shouldn't mix them.
As long as only plants are compared. Or the possible recipe section.
Of course, you shouldn't mix them.