Options

How Different is Each Topic Each Other - Term Frequency Across Topic

Index
How Different is Each Topic Each Other - Term Frequency Across Topic
How Different is Each Topic Each Other - Term Frequency Across Topic

A.Wilmarth > 30-07-2024, 03:45 AM

I haven't seen much on this forum discussing term frequency across topics. Some examples I did find were the "interesting Vwords" series by -JKP-, You are not allowed to view links. Register or Login to view., and obviously Topic Modeling in the Voynich Manuscript by Sterneck et al.

This could very well be because the results are already common knowledge or because the data is functionally meaningless, but either way at the very least I hope posting this will save someone the potentially wasted effort of doing it themselves.

While the VMS has intrigued me for some time my background is in GIS/cartography, so while data science may be familiar to me, applying them to text and language is not. My amateurish-ness will be very apparent. I apologize in advance. If you have a moment to spare some criticism for my process or for misusing a term, it would be appreciated. While I am naturally inclined to just play with my data and not to post anything, making mistakes and being called out on them the fastest way I know to learn.

My goal was to quantify how difference each topic is from each other, and to build profiles for each 'vord' which could shed light on its potential meanings. For example if a term is used at a much higher frequency in herbal than astrological, we could presume the vord's meaning is more relevant to herbal than astrological. Terms which appears at similar frequency in 2 topics but not in the other 3 might indicate a different meaning than if that vord was equal across all topics, or only highly frequent in one, and so on.

Process

I tried to follow some the same initially methodology as Sterneck et al. I also used the Takahashi EVA transcription as corrected by Zandbergen and Stolti, and sectioned the VMS into 5 topics: 1. herbal, 2. astrological, 3 balneological, 4 pharmaceutical, and 5. starred. Each topic was then considered a separate document.

Where I differed was I assumed full text pages go with the closest illustrations and didn't analyze anything by hand or scribe. These are places where I could improve if this has any value. Additionally, as I was not looking to measure importance, I used a simple term frequency percent.
Code:
tf = raw count/ total terms in document * 100
Percent was used here in order to scale the outputs to be more human readable.

Each term's frequency was calculated for per topic and then each topic's frequency was subtracted from each other to get the absolute difference. Afterwards, for visualization purposes, the results were z score normalized; frequency across topic, and differences across all differences within that topics. Differences were then totaled in order to give an overall idea of how "different" each topic is. This process was repeated with a 2 frequency cut-off just to see what impact of unique terms had on the metrics, if any.

RESULTS

Topic Totals


Topic profiles top 40 terms.



Spreadsheet of all terms included as attachment.

  TOPIC_FREQUENCY.xlsx (Size: 1.25 MB / Downloads: 17)

Observations

Differences

Assuming contents were indeed related to the illustrations, I had gone into this process expecting herbal to be least like astrological and most like pharmaceutical. While pharmaceutical does to have less difference when compared to herbal I wouldn't say there is strong evidence this is because the content is similar, as herbal seems to be fairly to be similar to nearly all the other topics. This included astrological which surprised me.

I was also not expecting balneological to be so different from everything else, excepting the starred topics. These two were quite similar. Like herbal the starred 'recipes' were similar to all the other topics, but curiously not pharmaceutical. I wonder if there is any evidence here that the starred section originally followed balneological.

For the most part, filtering unique words did not change the overall patterning, just exaggerated it. The one exception is removing the unique words makes astrological appear more similar to pharmaceutical. To me this speaks to the text actually reflecting differing topics rather than gibberish as I would have expected removing unique words would universally make the text more similar.

Observations - Term Patterns

While it might be interesting to look at individual vords, I think in order to save this being a huge block of text I will illustrate some of the potential uses.



As an example here is 3 comparisons. The first two are similar in structure, but by looking at their raw frequencies/histograms alone we might not be able to tell much from them. However by creating a topic frequency profiles we may be able to learn a bit more from these. For instance both cheol and sheol as well as cheey and sheey have a fairly similar profile, where as comparing daiin and ol, these two vord have much different profiles. Could this indicate that that C and S are prefixes which alter the meaning but not significantly? As an analogy, consider the purpose vs re-purpose; both words have similar structure, a related meaning, but one would still be used more often in certain discussions. Of course I don't know, but it may be a different way to look at terms.

"Reading" the VMS

With some effort it may be possible to possible to "read" passages of the VMS by looking at term topic profiles. What I present here is more a proof of concept than anything else. If any of this has any value this process could be improved by better understanding and better parsing of the profiles, as well as potentially by classifying differences from std into high median and low categories.



<f106r.7,+P0> olched.qoiin.ychedy.qokam.sheol.qokor.cheees<$>

This becomes:

vord weighted towards balneological | vord unique to starred | vord weighted towards astrological and away from pharmaceutical | vord weighted away from pharmaceutical | vord weighted towards pharmaceutical and away from herbal | vord weighted against astrological | unique starred word

It might be then possible to assign potential words to these weights based on building profiles for known works that cover similar topics (assuming the contents are even related to the illustrations). So that a program could pick words based on a similarly matching topic difference/frequency profiles. For instance the above could become:

water. stirred. night. running. inside. dirt. abracadabra.

Obviously, I'm pulling these words from thin air to illustrate the concept and I doubt word tables could be built for balenological or starred sections, but maybe for herbal, pharmaceutical, and astrological. It's a stretch for sure.

Improvements?

Assuming any of this has value, I have written some questions I have been asking myself.

Is it important to differentiate by hand or scribe? Should all-text pages be separated into their own category or removed entirely?

Should I build and test a control?

Can I use this to build similar profiles for character n-grams to shed light on if spaces are legit. For example if ok and aiin have the same profile, it might indicate they are part of the same term (okaiin), but if they differ greatly it might indicate they are separate words.

Can I use to word bi-grams and tri-grams to shed more light on their meanings? For example how often does daiin ol show up in herbal vs balneological?
RE: How Different is Each Topic Each Other - Term Frequency Across Topic

Koen G > 30-07-2024, 08:03 AM

Welcome to the forum!

This isn't my area of expertise, and I'm sure there will be others who can offer better insights. It does look to me though that similarities between sections may be due to Currier languages, so separating by Currier dialect may be necessary.
RE: How Different is Each Topic Each Other - Term Frequency Across Topic

HermesRevived > 30-07-2024, 12:33 PM

This is a useful study precisely because it ignores Currier languages and scribes. Compiling a vocabulary by topic is very useful. I especially like weighting vords as you suggest:

"vord weighted towards balneological | vord unique to starred | vord weighted towards astrological and away from pharmaceutical | vord weighted away from pharmaceutical | vord weighted towards pharmaceutical and away from herbal | vord weighted against astrological | unique starred word"

That could be usefully developed I think. Thanks for sharing your research. I think it is worth going through carefully and teasing out the implications.
RE: How Different is Each Topic Each Other - Term Frequency Across Topic

Emma May Smith > 30-07-2024, 03:58 PM

I welcome this kind of study and absolutely think it's valuable. I also think that there's a lot you can do with this data.

As suggested, breaking down the text by Currier language and by hand might add some further insight/clarity to the data, especially comparing between the two Herbals.

Using parts of words would be highly interesting. We tend to consider the structure of words quite rigid, though we acknowledge the difference in words across sections. So being able to show if/how word structure varies takes the analysis one step deeper. Some bigrams are already known to vary and there may be studies you can build upon.

(Also, will you be attending the Voynich Day talks this Sunday?)
RE: How Different is Each Topic Each Other - Term Frequency Across Topic

A.Wilmarth > 30-07-2024, 11:16 PM

I appreciate the support and guidance on how I can take this study further, thank you all.

It seems it may be useful, at the very least, to run the program looking at Currier language, and see how and if this alters the results. I already have things set up to look at character n-grams; looking for ways to confirm spaces/identify places where vords had been compressed for space concerns, was my original intent, it just spiraled out from there.

@Emma. I believe I missed the deadline for the talks this Sunday, but hopefully someone posts a recap or video, it will be exciting to see what is discussed.
RE: How Different is Each Topic Each Other - Term Frequency Across Topic

ReneZ > 31-07-2024, 12:57 AM

The best known example (to me) of a study of topic words is:

Montemurro, Marcelo and Damian Zanette: Keywords and Co-Occurrence Patterns in the Voynich Manuscript: An Information-Theoretic Analysis, PLoS ONE 8(6): e66344. doi:10.1371/journal.pone.0066344 (2013)

Link:
You are not allowed to view links. Register or Login to view.
RE: How Different is Each Topic Each Other - Term Frequency Across Topic

RobGea > 31-07-2024, 04:28 PM

Nice work.

There is a slighly related thread here: "Categorizing the text-only pages" :: You are not allowed to view links. Register or Login to view.
Next Oldest Next Newest

How Different is Each Topic Each Other - Term Frequency Across Topic

Index

How Different is Each Topic Each Other - Term Frequency Across Topic

RE: How Different is Each Topic Each Other - Term Frequency Across Topic

RE: How Different is Each Topic Each Other - Term Frequency Across Topic

RE: How Different is Each Topic Each Other - Term Frequency Across Topic

RE: How Different is Each Topic Each Other - Term Frequency Across Topic

RE: How Different is Each Topic Each Other - Term Frequency Across Topic

RE: How Different is Each Topic Each Other - Term Frequency Across Topic