"
(23-12-2025, 11:08 AM)dashstofsk Wrote: You are not allowed to view links. Register or Login to view.Am I right in understanding that these PCA plots are computed just on character pair frequencies?
Yes.
Quote:And that these plots are being used to judge how A-like or B-like certain pages are?
Not quite. Those plots are being used to judge how well whatever structure is visible in the pages when projecting the 40-D most-frequent bigram (non-label) data down onto the first 2 or 3 axes found by PCA corresponds with Currier's assignment of pages, and behold:
[
attachment=13197]
Those two big separable clumps with a clear gap between them in the 3-D PCA projection are also two big separable clumps with a clear gap between them in the full 40-D space (because of how linear algebra works), and that structure would be there even if Prescott Currier had been hit by a bus on his way to the NSA meeting and never gave his talk.
As it happens, with the exception of either two or three out of the 225 pages with some amount of non-label text on them the A language pages coincide with the clump on the left in that image, and the B language pages coincide with the clump on the right. There are a couple points that don't land in the expected cluster given their supposed Currier language label, but there is nothing in that plot to indicate distributions with significantly overlapping tails -- there is no continuous Bayesian likelihood "A-like" or "B-like" value, there's binary "in the clump on the left" or "in the clump on the right."
The exceptions, as discussed previously, are 1) the Scorpio page in the zodiac, and 2) either one or two of the three pages with text on the Scribe 3-written f58/65 bifolio that Currier(? D'Imperio?) classified as A-language, depending on whether you think the original language label for those pages was in error. It also has to be noted that there may well be a separating hyperplane in the 40-D space that correctly classifies all the pages, but that's a question that can only be answered with a different set of tools.
Quote:It seems to me that this might lead to wrong conclusions.
[...]
Surely you need to include some such additional measures alongside character pairs when deciding if groups of pages are related.
I'm going to respond with an imperfect analogy:
Suppose there's a covered cage in front of you containing a bunch of adult animals, and all you're told for each animal is (what % of surface area is covered with black hair or fur, what % of surface area is covered with white hair or fur). Looking at the data, you see there are two clearly defined clusters, one centered around (50%, 50%) and one centered close to (100%, 0%). The cover is pulled off the cage, and there are zebras and black bears in it, and unsurprisingly they match the two clusters and are fully separable in the feature space given.
Then I come along and say, "wait a minute, there are a whole host of other features you could measure those animals along -- weight, say, or nose-to-tail length." You add those features into the mix^1, do PCA, and all of a sudden discover that you can't separate zebras from black bears^2 when looking at the plot of the two dominant axes found by PCA. Is that because zebras can't be distinguished from black bears, or is that because you've added inappropriate features to separate them into the mix but PCA still has to capture the overall variance contributed by those variables?
Fn 1: correcting for issues relating to the relative scaling of the features -- I said it was an imperfect analogy
Fn 2: actually, according to Google's AI Overview the weight ranges for adult zebras and adult black bears don't really overlap (to my surprise, zebras are bigger) -- I said it was an imperfect analogy...
I don't want you to think I just dismissed the plots you posted -- I didn't, and in fact I put about 10 hours into extending my feature generation code (in ways I had been meaning to do anyways), running tests, and looking at plots. There's a limit to how informed an opinion I can offer without doing a true replication, but I'll make a few observations:
* Extending the number of most-frequent bigrams from the top 40 to top 100 (at which point that hits ~96% of the bigrams without a non-Currier "weirdo" glyph in them), here's what I get for the results of applying PCA to Quires 8, 13 & 20. I haven't labeled the individual pages, but the f58/65 bifolio pages can be picked out from the f57r/f66 pages because they're labeled Herbal A rather than Herbal B:
[
attachment=13195][
attachment=13196]
There are some differences, but that may be due to your using a different transcription alphabet (you didn't specify). Just looking at those plots, you can see how treacherous it is to judge where the herbal pages land relative to the Q20 pages based on the 2-D plot of the 1st two axes found by PCA -- looking at the 3-D plot, 5 of the 6 Herbal pages land below the Quire 20 cloud (and the one that lands vertically with the Quire 20 points in the side view of the cloud is actually one of the f58 pages). "You are not allowed to view links.
Register or
Login to view. and You are not allowed to view links.
Register or
Login to view." are only "in the middle of the quire 20 domain of points" in the 2-D projection.
[*]* What you call "bridge pairs" are probably the most interesting of the additional features you included in the data your second plot was generated from because all the other features relate to individual words. The 50 most common bridge pairs (over the whole mss, ignoring uncertain spaces, treating end-of-line/paragraph and drawing interuptions as spaces) cover 85% of the spaces in the mss where there isn't a non-Currier "weirdo" on either side. Adding those to the feature space produces the following:
[*][
attachment=13198][
attachment=13199]
If anything, that pulls the set of five Herbal B pages even further from the Q20 pages while leaving the remaining one roughly where it was.
* It's unlikely any shift seen in your second plot (assuming it's not an artifact of the specific POV the cloud is being projected relative to) is a result of including relative word frequencies to the bigram frequencies. If you look at Rene's word frequency based analysis at You are not allowed to view links.
Register or
Login to view. you'll see he splits the B language into what he calls "Herbal-B", "Stars-B (low correlation with Bio-B)", "Stars-Bio (high correlation with Bio-B)," and "Biological-B." That seems pretty well aligned with the structure visible in the bigrams-only and bigrams+bridge pairs plots (albeit I didn't look at actual page labels to verify).
* You mentioned including relative frequencies of prefixes & suffixes, but didn't describe how they were defined or which ones were included. I have the infrastructure to throw relative frequencies of word-initial and word-final bigrams in as features, but without knowing how that correlates with what you were doing there's not a lot of value in adding it into the mix here.