The Voynich Ninja - "The Currier languages revisited" revisited

Pages: 1 2 3 4 5 6 7 8

(24-12-2025, 03:13 PM)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view.Map all instances of k t p f K T P F to f. (Why not t? Because You are not allowed to view links. Register or Login to view. would become t17v... )

You can use bitrans, which would know not to change the metadata in the file.

(25-12-2025, 12:11 AM)ReneZ Wrote: You are not allowed to view links. Register or Login to view.
(24-12-2025, 03:13 PM)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view.Map all instances of k t p f K T P F to f. (Why not t? Because You are not allowed to view links. Register or Login to view. would become t17v... )
You can use bitrans, which would know not to change the metadata in the file.

Yes, of course. Or use three lines of script instead of one line. Big Grin

All the best, and Merry Christmas, --stolfi

(23-12-2025, 11:08 AM)dashstofsk Wrote: You are not allowed to view links. Register or Login to view.Am I right in understanding that these PCA plots are computed just on character pair frequencies?

Yes.

Quote:And that these plots are being used to judge how A-like or B-like certain pages are?

Not quite. Those plots are being used to judge how well whatever structure is visible in the pages when projecting the 40-D most-frequent bigram (non-label) data down onto the first 2 or 3 axes found by PCA corresponds with Currier's assignment of pages, and behold:

[attachment=13197]

Those two big separable clumps with a clear gap between them in the 3-D PCA projection are also two big separable clumps with a clear gap between them in the full 40-D space (because of how linear algebra works), and that structure would be there even if Prescott Currier had been hit by a bus on his way to the NSA meeting and never gave his talk.

As it happens, with the exception of either two or three out of the 225 pages with some amount of non-label text on them the A language pages coincide with the clump on the left in that image, and the B language pages coincide with the clump on the right. There are a couple points that don't land in the expected cluster given their supposed Currier language label, but there is nothing in that plot to indicate distributions with significantly overlapping tails -- there is no continuous Bayesian likelihood "A-like" or "B-like" value, there's binary "in the clump on the left" or "in the clump on the right."

The exceptions, as discussed previously, are 1) the Scorpio page in the zodiac, and 2) either one or two of the three pages with text on the Scribe 3-written f58/65 bifolio that Currier(? D'Imperio?) classified as A-language, depending on whether you think the original language label for those pages was in error. It also has to be noted that there may well be a separating hyperplane in the 40-D space that correctly classifies all the pages, but that's a question that can only be answered with a different set of tools.

Quote:It seems to me that this might lead to wrong conclusions.

[...]

Surely you need to include some such additional measures alongside character pairs when deciding if groups of pages are related.

I'm going to respond with an imperfect analogy:

Suppose there's a covered cage in front of you containing a bunch of adult animals, and all you're told for each animal is (what % of surface area is covered with black hair or fur, what % of surface area is covered with white hair or fur). Looking at the data, you see there are two clearly defined clusters, one centered around (50%, 50%) and one centered close to (100%, 0%). The cover is pulled off the cage, and there are zebras and black bears in it, and unsurprisingly they match the two clusters and are fully separable in the feature space given.

Then I come along and say, "wait a minute, there are a whole host of other features you could measure those animals along -- weight, say, or nose-to-tail length." You add those features into the mix^1, do PCA, and all of a sudden discover that you can't separate zebras from black bears^2 when looking at the plot of the two dominant axes found by PCA. Is that because zebras can't be distinguished from black bears, or is that because you've added inappropriate features to separate them into the mix but PCA still has to capture the overall variance contributed by those variables?

Fn 1: correcting for issues relating to the relative scaling of the features -- I said it was an imperfect analogy
Fn 2: actually, according to Google's AI Overview the weight ranges for adult zebras and adult black bears don't really overlap (to my surprise, zebras are bigger) -- I said it was an imperfect analogy...

I don't want you to think I just dismissed the plots you posted -- I didn't, and in fact I put about 10 hours into extending my feature generation code (in ways I had been meaning to do anyways), running tests, and looking at plots. There's a limit to how informed an opinion I can offer without doing a true replication, but I'll make a few observations:

* Extending the number of most-frequent bigrams from the top 40 to top 100 (at which point that hits ~96% of the bigrams without a non-Currier "weirdo" glyph in them), here's what I get for the results of applying PCA to Quires 8, 13 & 20. I haven't labeled the individual pages, but the f58/65 bifolio pages can be picked out from the f57r/f66 pages because they're labeled Herbal A rather than Herbal B:

[attachment=13195][attachment=13196]

There are some differences, but that may be due to your using a different transcription alphabet (you didn't specify). Just looking at those plots, you can see how treacherous it is to judge where the herbal pages land relative to the Q20 pages based on the 2-D plot of the 1st two axes found by PCA -- looking at the 3-D plot, 5 of the 6 Herbal pages land below the Quire 20 cloud (and the one that lands vertically with the Quire 20 points in the side view of the cloud is actually one of the f58 pages). "You are not allowed to view links. Register or Login to view. and You are not allowed to view links. Register or Login to view." are only "in the middle of the quire 20 domain of points" in the 2-D projection.

[*]* What you call "bridge pairs" are probably the most interesting of the additional features you included in the data your second plot was generated from because all the other features relate to individual words. The 50 most common bridge pairs (over the whole mss, ignoring uncertain spaces, treating end-of-line/paragraph and drawing interuptions as spaces) cover 85% of the spaces in the mss where there isn't a non-Currier "weirdo" on either side. Adding those to the feature space produces the following:

[*][attachment=13198][attachment=13199]

If anything, that pulls the set of five Herbal B pages even further from the Q20 pages while leaving the remaining one roughly where it was.

* It's unlikely any shift seen in your second plot (assuming it's not an artifact of the specific POV the cloud is being projected relative to) is a result of including relative word frequencies to the bigram frequencies. If you look at Rene's word frequency based analysis at You are not allowed to view links. Register or Login to view. you'll see he splits the B language into what he calls "Herbal-B", "Stars-B (low correlation with Bio-B)", "Stars-Bio (high correlation with Bio-B)," and "Biological-B." That seems pretty well aligned with the structure visible in the bigrams-only and bigrams+bridge pairs plots (albeit I didn't look at actual page labels to verify).

* You mentioned including relative frequencies of prefixes & suffixes, but didn't describe how they were defined or which ones were included. I have the infrastructure to throw relative frequencies of word-initial and word-final bigrams in as features, but without knowing how that correlates with what you were doing there's not a lot of value in adding it into the mix here.

(28-12-2025, 09:30 AM)kckluge Wrote: You are not allowed to view links. Register or Login to view.there is no continuous Bayesian likelihood "A-like" or "B-like" value, there's binary "in the clump on the left" or "in the clump on the right."

Well, that is a rather binary way to put it. Big Grin

Another way to describe is is that the pages of each section (counting Herbal-A and herbal-B as two sections) comprise a broad cloud of points. We can approximate each of these clouds by an N-dimensional Gaussian distribution, each with its own mean, eigenvectors, and eigenvalues. Then we find that, when projected along the main overall PCA axes, some of the clouds have substantial overlap, while some are well separated by several standard deviations. Specifically, that there are two main classes of sections, A and B, such that each A cloud is well-separated from every B cloud.

That is important information, but one should also note that the clouds in each set are clearly distinct as well -- just not as distinct as the A-clouds from B-clouds. Thus the A-B "language" distinction is only a matter of degree; not necessarily a "fundamentally different difference". Not necessarily more significant than the Bio-Starred difference.

Moreover, as I explained before, if two clouds are well-separated, that will not necessarily show up in the PCA projection. Even if the PCA is run on those two clouds only. PCA finds the directions along which the set of all points has maximum spread; not the directions along which it can be split into separate clusters.

Thus the question remains of how well separated are (say) Bio and Starred, or Pharma and Herbal-A.

Another point that should be stressed is that one should use statistics of words, not letters and digraphs. The frequency of a letter or digraph is determined by the common words that contain it. Thus letter and digraph statistics are arbitrary projections of word statistics. Analyzing languages by their digraph statistics is like studying zoology by considering only the colors of the animals' fur. Sure, it will show that zebras are very different from black bears and ravens, but...

All the best, --stolfi

(28-12-2025, 09:30 AM)kckluge Wrote: You are not allowed to view links. Register or Login to view.your using a different transcription alphabet

I use the GC transliteration for my analysis work. 101-C characters I convert to ee . I only used paragraph text, including 'Pb' text. Otherwise no labels, radial or circular text.

(28-12-2025, 09:30 AM)kckluge Wrote: You are not allowed to view links. Register or Login to view.relative frequencies of prefixes & suffixes, but didn't describe how they were defined

The prefixes I used were the most frequent prefixes to words on the pages labelled as language B. A word such as okeedy would contribute 5 prefixes to the list. o, ok, oke ,okee ,okeed . There is no harm in adding the longer strings since they can be expected to appear low in the frequency list. Similarly, suffices to contribute to the list would be y, dy, edy, eedy, keedy .

Other measures that might be useful to try: frequency of long words ( 6 or 7 GC-101 characters ), frequency of words containing e or some other character, ratios of t to k , ratios of iin to in .

In my opinion the PCA plots do have some limitations. I just wanted to highlight that they could lead to wrong conclusions. Correlation maps such as the ones I gave on You are not allowed to view links. Register or Login to view. are more useful to me to visualise how closely pages relate to each other.

(28-12-2025, 05:05 PM)dashstofsk Wrote: You are not allowed to view links. Register or Login to view.
(28-12-2025, 09:30 AM)kckluge Wrote: You are not allowed to view links. Register or Login to view.your using a different transcription alphabet

I use the GC transliteration for my analysis work. 101-C characters I convert to ee . I only used paragraph text, including 'Pb' text. Otherwise no labels, radial or circular text.

(28-12-2025, 09:30 AM)kckluge Wrote: You are not allowed to view links. Register or Login to view.relative frequencies of prefixes & suffixes, but didn't describe how they were defined

The prefixes I used were the most frequent prefixes to words on the pages labelled as language B. A word such as okeedy would contribute 5 prefixes to the list. o, ok, oke ,okee ,okeed . There is no harm in adding the longer strings since they can be expected to appear low in the frequency list. Similarly, suffices to contribute to the list would be y, dy, edy, eedy, keedy .

Other measures that might be useful to try: frequency of long words ( 6 or 7 GC-101 characters ), frequency of words containing e or some other character, ratios of t to k , ratios of iin to in .

In my opinion the PCA plots do have some limitations. I just wanted to highlight that they could lead to wrong conclusions. Correlation maps such as the ones I gave on You are not allowed to view links. Register or Login to view. are more useful to me to visualise how closely pages relate to each other.

You're absolutely right that heat maps are a better way to visualize the structure of pairwise distances in a higher dimensional space. The split in the starred paragraph section pops out nicely in the plots there, with one set of pages looking fairly Quire-13-like.

You may want to do more preprocessing of the v101 transcription -- Glen's intent in having (for example) '9' and '(' variants for EVA 'y', or '7' and '8' for EVA 'd' (or, by my count, 6 variants of EVA 'r') wasn't to claim those were actually different glyphs, he was just giving you the option.

(29-12-2025, 07:56 AM)kckluge Wrote: You are not allowed to view links. Register or Login to view.You may want to do more preprocessing of the v101 transcription -- Glen's intent in having (for example) '9' and '(' variants for EVA 'y', or '7' and '8' for EVA 'd' (or, by my count, 6 variants of EVA 'r') wasn't to claim those were actually different glyphs, he was just giving you the option.

I already do that. Here are all my conversions.

You are not allowed to view links. Register or Login to view.

101 conversions

(29-12-2025, 10:21 AM)dashstofsk Wrote: You are not allowed to view links. Register or Login to view.I already do that. Here are all my conversions.

There's a lot more. I have 12 variants of Sh, 7 s, 5 d in my conversion table.

(29-12-2025, 01:25 PM)nablator Wrote: You are not allowed to view links. Register or Login to view.There's a lot more.

Some of those characters are very rare and they probably will not contribute much to our understanding if left as they are.

(18-02-2026, 07:53 PM)kckluge Wrote: You are not allowed to view links. Register or Login to view.Also, any posited gradual transition from the A language dialect pages to the B language dialect pages requires interpreting the label text of the Zodiac folios -- and only the label text -- as the stepping stones between the two (if you didn't follow the thread, see the discussion in You are not allowed to view links. Register or Login to view.). The non-label text on the Zodiac pages (with one exception) falls firmly on the A language side of what I've come to think of as "Currier gulch" in the bigram distribution space.

I intended to follow up on this, but haven't gotten round to it yet.
The short summary is that I do not consider the tiny gap that one can see if one removes the labels to be a barrier between two distinct groups.

First of all I do not see a reason to ignore the labels.
Secondly, the gap is much, much smaller than the size of the cloud.
Thirdly, I remember seeing points in the same colour on both sides of the gap.
All this is made difficult by the inherent noise in the data (statistics based on relatively small sample sizes).

Hopefully I can find the time and the priority to clarify this.

Pages: 1 2 3 4 5 6 7 8