(19-02-2026, 12:42 AM)ReneZ Wrote: You are not allowed to view links. Register or Login to view. (18-02-2026, 07:53 PM)kckluge Wrote: You are not allowed to view links. Register or Login to view.Also, any posited gradual transition from the A language dialect pages to the B language dialect pages requires interpreting the label text of the Zodiac folios -- and only the label text -- as the stepping stones between the two (if you didn't follow the thread, see the discussion in You are not allowed to view links. Register or Login to view.). The non-label text on the Zodiac pages (with one exception) falls firmly on the A language side of what I've come to think of as "Currier gulch" in the bigram distribution space.
I intended to follow up on this, but haven't gotten round to it yet.
The short summary is that I do not consider the tiny gap that one can see if one removes the labels to be a barrier between two distinct groups.
"Tiny" is a bit of a semantic bludgeon word, but you're entitled to your subjective judgement. All I can do is (1) point people to the view of the 3-D point cloud at You are not allowed to view links.
Register or
Login to view. to judge for themselves whether it is "tiny", while (2) reminding you that there is substructure within the A and B language clouds which makes judging how "tiny" it is relative to the width of the overall A & B clouds somewhat misleading.
Quote:First of all I do not see a reason to ignore the labels.
Neither do I. I do, however, see a reason to exercise a certain degree of caution lumping them in with non-label text because their statistics don't look like the statistics of the non-label text on the same pages. Using the Zodiac pages from ZL_ivtff_1b.txt, converting to Currier, and ignoring uncertain spaces, the 40 most common within-word bigrams are:
OP AE CO CC OF PC AR FC 89 OE C9 SC AM O8 8A PA EA FA E9 C8 ZC SO RA OR O2 C2 AJ PO 2A CA 9F OC OB CF E8 FO AT 9P EO ZO
For the EVA fans out there, those correspond to:
ot al eo ee ok te ar ke dy ol ey che aiin od da ta la ka ly ed she cho ra or os es am to sa ea yk oe op ek ld ko air yt lo sho
Those account for 85% of the (Currier) text on the pages (not counting bigrams that include "weirdo" glyphs).
Plotting the top three axes returned by PCA applied to their relative frequencies on a given page, here's how the label and non-label text on the Zodiac pages compare:
I think it's fair to say that that is more than suggestive that Zodiac page label text is not sampling from the same distribution as non-label Zodiac page text.
*CORRECTION*: after I originally posted this I realized there was a subtle potential source of bias in what was shown here -- the relative frequency of bigram XY is computed using the total number of bigrams on the page, which currently includes bigrams containing a space if spaces have been left in the text. As a result, if you had a page of nothing but labels and a page of running text with exactly the same words the relative frequency of XY would look lower on the page with running text. For quicker turnaround rather than add the logic to my frequency code to exclude bigrams including a space from the counts I just reran the analysis described above with the same set of bigram features but spaces removed from the text. Here's what those results look like:
The labels still don't look like they're sampling from the same distribution.
End of the correction...
Quote:Secondly, the gap is much, much smaller than the size of the cloud.
Without going into an image editor and using some sort of measuring tool, I'd say the gap looks roughly somewhere between a quarter and a third of the width of the overall A page cloud in that view.
Quote:Thirdly, I remember seeing points in the same colour on both sides of the gap.
One exception is Zodiac page f73v; the other two exceptions are You are not allowed to view links.
Register or
Login to view. & f65v. f58 & f65 are the halves of a biofolio that Lisa Fagin-Davis identifies as by Scribe 3; You are not allowed to view links.
Register or
Login to view. has a plant drawing with no text other than a 2-3 word label. Traditionally those bifolio pages have been labelled as A Language, which would make this the only known non-Scribe 1 Herbal A biofolio. It is plausible that the f58 & f65 bifolio pages are B language pages with atypical relative frequencies of the small number of key bigrams used to make the initial A/B classification by Currier (in which case You are not allowed to view links.
Register or
Login to view. becomes the outlier grouped with the Herbal A pages).
So, depending on where you come down on the "what is going on with the f58/f65 bifolio?" queston there are either 2 or 3 pages on the wrong side of the A/B split.
Quote:All this is made difficult by the inherent noise in the data (statistics based on relatively small sample sizes).
In the case of the herbal pages (for example), absolutely -- they're going to have higher variance just due to the smaller number of bigrams on a page relative to the balneological or starred paragraph pages.