23-12-2025, 06:55 AM
(22-12-2025, 08:06 PM)oshfdk Wrote: You are not allowed to view links. Register or Login to view.(22-12-2025, 07:49 PM)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view.* And the chances of obtaining useful insights will improve a lot if one uses the good old "scientific method": make an hypothesis, then devise the simplest and most effective way to test for it, and do that.
I think this is the missing piece for me in all this discussion about PCA, Currier languages, etc. Without a hypothesis and clearly stated assumptions PCA cannot be an argument for or against anything. PCA results are just a bunch of numbers with zero explanatory power by themselves. What is more, I think a popular approach to the manuscript where someone would just run a bunch of statistical tests and then try finding an explanation for the results is akin to divination, with graphs and charts replacing the crystal ball and tarot cards, but essentially it's the same attempt of finding pieces of a coherent story in random data and then creating a plausibly looking narrative based on those.
The purpose of PCA isn't to "explain" anything. It's a data visualization tool. The purpose of PCA is to figure out how to optimally position a hyperdimensional flashlight relative to an N-dimensional point clould so that the shadow the point clould casts on a 2-D plane (or, by analogy, 3-D volume) captures the maximal amount of the covariance in the data. It's like an X-ray -- an X-ray is a projection of the structure of a 3-D body onto a 2-D image in the same way that a PCA plot is a projection of the structure of an N-D point cloud onto a 2-D plane or 3-D volume. PCA plots don't "explain" anything any more than X-ray images "explain" anything -- they have to be interpreted.
I strongly suspect that when you complain about "PCA" here and in your previous #19 & #21 you aren't really complaining about PCA as such. In fact, I strongly suspect that when you complain about "PCA" you aren't even really complaining about applying PCA to the relative frequency distribution of the 40 most-common glyph bigrams (or in Rene's analysis, all the glyph bigrams) in the text. I strongly suspect that what you're really complaining about is analyzing the text in terms of relative bigram frequency behavior at all. The reason I strongly suspect that is because it would explain why you and Rafal are talking past each other in #19, #20, and #21, and why you and Stolfi are talking past each other in #39 and #40.
No one is suggesting that it's not a good idea to interrogate the data to test specific hypotheses. That doesn't make exploratory data analysis useless. You don't do exploratory data analysis to "explain" the data; you do exploratory data analysis to look for structure in the data. If you find structure in the data, then that structure (hopefully) usefully constrains the set of hypotheses it makes sense to test. For instance, if you want to look for potential cribs by comparing word usage in the Pharma pages and the Herbal pages, it probably makes sense to confine your analysis to the Herbal A pages that cluster (statistically and physically) with the Pharma pages. That's not something you'd know without doing the exploratory data analysis first.
As for whether there is value in analyzing the distribution of relative glyph bigram frequencies:
Here's a clearly stated assumption: The Currier transcription alphabet imposes an appropriate set of equivalence classes on the collections of ink strokes drawn on the vellum of MS 408.
Here's a hypothesis: The pages of MS 408 containing a big picture of a single plant have text drawn from a single underlying distribution in the relative frequency space of the 40 most common glyph bigrams in the text regardless of whatever the unknown process generating that distrubution is.
Since my ability to visualize 40-D space isn't any better than anyone else's, I'll use PCA to optimally (in a specific sense) linearly project the data onto a 2-D plane. When we do that, what do we see? There are two clearly defined linearly separable clumps of points in the 2-D projection, which also means those two clumps of points are clearly defined and linearly separable in the full 40-D space. So that hypothesis is pretty clearly falsified.
You don't appear to find that result very interesting. I do, because it seems to me that if someone wants to try to decipher or translate the text it would be really, really useful to know up front if it was a smart idea to treat the entire text as a single homogeneous bucket of whatever-it-is.
Here's another hypothesis: There is no correlation between the scribal hand identified by Lisa Fagin-Davis as writing a given page with a big picture of a single plant and which of those two underlying distributions in relative bigram frequency space the text on that page belongs to. If we look at the data, what does that tell us? All the pages in one clump are written by her Scribe 1, and all the pages in the other are written by her Scribes 2, 3, and 5. So that hypothesis is pretty clearly falsified. That's something any theory about the history of the object should account for. It may not "explain" the history, but it certainly constrains it.
Here's another hypothesis: the quantitative differences in relative bigram frequency between the text on the pages with a big picture of a plant on them and the text on the pages with pictures of strange women lying in ponds (but not distributing swords)* are larger than the typical quantitative differences in relative bigram frequencies between texts on different subjects in a common natural language. That is also a perfectly reasonable and perfectly testable hypothesis, although no one has done so that I know of. I really wish someone would, as it would put to bed one way or another a whole bunch of hand-wavey arguments about the text.
So I'm very comfortable defending the utility of analyses in relative bigram frequency space.
(* If you don't get the reference, you should watch Monty Python and the Holy Grail.)
)