I haven't seen much on this forum discussing term frequency across topics. Some examples I did find were the "interesting Vwords" series by -JKP-, You are not allowed to view links.
Register or
Login to view., and obviously
Topic Modeling in the Voynich Manuscript by Sterneck et al.
This could very well be because the results are already common knowledge or because the data is functionally meaningless, but either way at the very least I hope posting this will save someone the potentially wasted effort of doing it themselves.
While the VMS has intrigued me for some time my background is in GIS/cartography, so while data science may be familiar to me, applying them to text and language is not. My amateurish-ness will be very apparent. I apologize in advance. If you have a moment to spare some criticism for my process or for misusing a term, it would be appreciated. While I am naturally inclined to just play with my data and not to post anything, making mistakes and being called out on them the fastest way I know to learn.
My goal was to quantify how difference each topic is from each other, and to build profiles for each 'vord' which could shed light on its potential meanings. For example if a term is used at a much higher frequency in herbal than astrological, we could presume the vord's meaning is more relevant to herbal than astrological. Terms which appears at similar frequency in 2 topics but not in the other 3 might indicate a different meaning than if that vord was equal across all topics, or only highly frequent in one, and so on.
Process
I tried to follow some the same initially methodology as Sterneck et al. I also used the Takahashi EVA transcription as corrected by Zandbergen and Stolti, and sectioned the VMS into 5 topics: 1. herbal, 2. astrological, 3 balneological, 4 pharmaceutical, and 5. starred. Each topic was then considered a separate document.
Where I differed was I assumed full text pages go with the closest illustrations and didn't analyze anything by hand or scribe. These are places where I could improve if this has any value. Additionally, as I was not looking to measure importance, I used a simple term frequency percent.
Code:
tf = raw count/ total terms in document * 100
Percent was used here in order to scale the outputs to be more human readable.
Each term's frequency was calculated for per topic and then each topic's frequency was subtracted from each other to get the absolute difference. Afterwards, for visualization purposes, the results were z score normalized; frequency across topic, and differences across all differences within that topics. Differences were then totaled in order to give an overall idea of how "different" each topic is. This process was repeated with a 2 frequency cut-off just to see what impact of unique terms had on the metrics, if any.
RESULTS
Topic Totals
Topic profiles top 40 terms.
Spreadsheet of all terms included as attachment.
TOPIC_FREQUENCY.xlsx (Size: 1.25 MB / Downloads: 14)
Observations
Differences
Assuming contents were indeed related to the illustrations, I had gone into this process expecting herbal to be least like astrological and most like pharmaceutical. While pharmaceutical does to have less difference when compared to herbal I wouldn't say there is strong evidence this is because the content is similar, as herbal seems to be fairly to be similar to nearly all the other topics. This included astrological which surprised me.
I was also not expecting balneological to be so different from everything else, excepting the starred topics. These two were quite similar. Like herbal the starred 'recipes' were similar to all the other topics, but curiously not pharmaceutical. I wonder if there is any evidence here that the starred section originally followed balneological.
For the most part, filtering unique words did not change the overall patterning, just exaggerated it. The one exception is removing the unique words makes astrological appear more similar to pharmaceutical. To me this speaks to the text actually reflecting differing topics rather than gibberish as I would have expected removing unique words would universally make the text more similar.
Observations - Term Patterns
While it might be interesting to look at individual vords, I think in order to save this being a huge block of text I will illustrate some of the potential uses.
As an example here is 3 comparisons. The first two are similar in structure, but by looking at their raw frequencies/histograms alone we might not be able to tell much from them. However by creating a topic frequency profiles we may be able to learn a bit more from these. For instance both cheol and sheol as well as cheey and sheey have a fairly similar profile, where as comparing daiin and ol, these two vord have much different profiles. Could this indicate that that C and S are prefixes which alter the meaning but not significantly? As an analogy, consider the purpose vs re-purpose; both words have similar structure, a related meaning, but one would still be used more often in certain discussions. Of course I don't know, but it may be a different way to look at terms.
"Reading" the VMS
With some effort it may be possible to possible to "read" passages of the VMS by looking at term topic profiles. What I present here is more a proof of concept than anything else. If any of this has any value this process could be improved by better understanding and better parsing of the profiles, as well as potentially by classifying differences from std into high median and low categories.
<f106r.7,+P0> olched.qoiin.ychedy.qokam.sheol.qokor.cheees<$>
This becomes:
vord weighted towards balneological | vord unique to starred | vord weighted towards astrological and away from pharmaceutical | vord weighted away from pharmaceutical | vord weighted towards pharmaceutical and away from herbal | vord weighted against astrological | unique starred word
It might be then possible to assign potential words to these weights based on building profiles for known works that cover similar topics (assuming the contents are even related to the illustrations). So that a program could pick words based on a similarly matching topic difference/frequency profiles. For instance the above could become:
water. stirred. night. running. inside. dirt. abracadabra.
Obviously, I'm pulling these words from thin air to illustrate the concept and I doubt word tables could be built for balenological or starred sections, but maybe for herbal, pharmaceutical, and astrological. It's a stretch for sure.
Improvements?
Assuming any of this has value, I have written some questions I have been asking myself.
Is it important to differentiate by hand or scribe? Should all-text pages be separated into their own category or removed entirely?
Should I build and test a control?
Can I use this to build similar profiles for character n-grams to shed light on if spaces are legit. For example if ok and aiin have the same profile, it might indicate they are part of the same term (okaiin), but if they differ greatly it might indicate they are separate words.
Can I use to word bi-grams and tri-grams to shed more light on their meanings? For example how often does daiin ol show up in herbal vs balneological?