(20-12-2025, 04:10 PM)oshfdk Wrote: You are not allowed to view links. Register or Login to view.then I see no way to meaningfully interpret the results of PCA at all.
First, there is nothing magic about PCA. The goal is to find a projection of the N-dimensional raw data space (the frequencies of N chosen digraphs or words on each page) onto a 2- or 3 dimensional space, so that we can visualize the clustering of pages by section, language, or whatever. PCA gives the 2 or 3 axes in N-space along which the cloud of points has maximum spread; but those are not necessarily the best axes for the purpose.
Back in the mailing-list days I did such an You are not allowed to view links.
Register or
Login to view. using the frequencies of 50 chosen words, or frequencies of 'elements" like
a,
y,
iin,
Ch,
Che, etc. I picked the projection axes specifically to maximize the distance between the clouds of certain sections. You are not allowed to view links.
Register or
Login to view. are the details. The results are broadly consistent with the posts above. Namely:
* Each section (counting Herbal-A and Herbal-B as separate sections) produces a relatively compact cluster that is visibly distinct from other clusters. Thus the A/B split is merely the most dramatic difference, but similar distinctions exist between sections
within each language.
* It is not surprising that
word frequencies are different in each section. Even for the most common words, which may or may not be "function" words like "much", "is", "find", "good"; not to mention "content" words like "herb", "star", "blood", etc.
* If word frequencies change, digraph frequencies will change too, since they are determined by the digraphs that appear in the most common words. As I mentioned before, "rb" is probably much more frequent in an herbal text, (Latin or English) than in a text about astrology. (Unless the latter it talks a lot about "orbits"...)
* What is surprising is that (IIUC) Herbal-A differs from Herbal-B noticeably more than either differs from Bio or Stars. Thus, besides the difference of topic, we indeed have a difference of language or spelling (or encryption). Maybe the texts of Herbal-A and Herbal-B were taken from sources in two different dialects, or two very similar languages. Like Northwest Lower West Bavarian and Southwest Lower West Bavarian...
* Chances of getting a useful insight from these analyses will improve if one uses
fewer data so as to reduce the number of factors that affect them. Like comparing Herbal-A and Herbal-B
only, thus hopefully eliminating the "topic" factor. Then maybe one can figure out whether the difference is a change of spelling, or something else. Once one gets some insight on Herbal-A vs. Herbal-B, one can then consider what is happening in other sections.
* And the chances of obtaining useful insights will improve a lot if one uses the good old "scientific method": make an hypothesis, then devise the simplest and most effective way to test for it, and do that.
Al the best, --stolfi