![]() |
|
"The Currier languages revisited" revisited - Printable Version +- The Voynich Ninja (https://www.voynich.ninja) +-- Forum: Voynich Research (https://www.voynich.ninja/forum-27.html) +--- Forum: Analysis of the text (https://www.voynich.ninja/forum-41.html) +--- Thread: "The Currier languages revisited" revisited (/thread-5153.html) |
"The Currier languages revisited" revisited - kckluge - 19-12-2025 One page on Rene's site (You are not allowed to view links. Register or Login to view.) does a bigram frequency level analysis of the pages in the manuscript. This post is specifically addressing the section starting with "Language characteristics". The analysis on that part of the page: * uses Rene's CUVA alphabet to deal with EVA's oversegementation of the glyphs (You are not allowed to view links. Register or Login to view.) * removes uncertain spaces from the transcription, but leaves other spaces * only looks at bigrams within words, not bigrams straddling spaces * starts with a feature space corresponding to the relative frequencies of all 355 CUVA bigrams that occur, then does a dimensionality reduction similar (but not identical) to Principle Components Analysis (PCA -- You are not allowed to view links. Register or Login to view. describes PCA) Plots are shown for the dominant vector vs. the 2nd through 4th vectors found by his method. On the basis of those plots he concludes, "When Currier identified his languages A and B, he did this on the basis of the different statistics of the initial herbal pages in the MS, which are identified by the red ('A') and dark blue ('B') crosses. It is clear that these have distinct properties - the clouds do not overlap. He also checked the other pages, and noted more variations, but his criteria for distinguishing the languages did not allow him to see that the overall statistics demonstrate that there is a continuum, and the other (not herbal) pages actually 'bridge the gap'." It is important to be careful about drawing conclusions from linear projections of higher dimensional data onto lower dimensional spaces. If two clumps of points are separable in the lower dimensional projection then they are also separable in the full dimensional space, but the inverse is not true -- two clumps of points that overlap in some projection do not necessarily overlap in the full space. To examine Rene's conclusion I performed a variation of the analysis described above: * the Currier alphabet is used rather than CUVA, translated from the ZL_ivtff_1b.txt EVA transcription (when multiple proposed reading are given for a glyph the first option is used) * uncertain spaces are removed as per the original experiment * only lines corresponding to running paragraph and "circular" text -- no radial text from diagrams or labels * only the 40 most common bigrams are used -- in Currier these are: 89 OF OE 4O CC C8 SC 8A C9 AM FC OP CO AR FA AE OR ZC SO O8 PC AN PA EF FS ZO PS S9 ES RA S8 9F AJ BS F9 FO PO 2A 9P EO which correspond to EVA dy ok ol qo ee ed che da ey aiin ke ot eo ar ka al or she cho od te ain ta lk kch sho tch chy lch ra chd yk am pch ky ko to sa yt lo * bigrams including spaces (with end-of-line, end-of-paragraph, and plant drawing gaps counted as spaces) are included in the total bigram count for a page when computing relative bigram frequencies for the page The 40 Currier bigrams listed above cover 83% of the bigrams that don't include a space or untranslatable/transcribed non-Currier "wierdo". Applying PCA, the first two dimensions found capture 48% of the covariance in the 40-D data. The resulting plot is: With the exception of three pages, the Herbal B, Bio, Starred paragraph, and Rose foldout pages fall together in one cluster and the Herbal A, Astro, Zodiac, and Pharma pages fall together in another cluster, separated by a clear diagonal gap. One exception is Zodiac page f73v; the other two exceptions are You are not allowed to view links. Register or Login to view. & f65v. f58 & f65 are the halves of a biofolio that Lisa Fagin-Davis identifies as by Scribe 3; You are not allowed to view links. Register or Login to view. has a plant drawing with no text other than a 2-3 word label. Traditionally those bifolio pages have been labelled as A Language, which would make this the only known non-Scribe 1 Herbal A biofolio. It is plausible that the f58 & f65 bifolio pages are B language pages with atypical relative frequencies of the small number of key bigrams used to make the initial A/B classification by Currier (in which case You are not allowed to view links. Register or Login to view. becomes the outlier grouped with the Herbal A pages). The differences between the analyses are: * use of CUVA vs Currier * inclusion of radial and label text elements vs only running paragraph and "circular" diagram text * starts with a 355-D space (all bigram frequencies) vs a 40-D space (only most common, corresponding to 83% of the glyph bigram pairs in the text) * dimensionality reduction using a heuristic PCA-like method rather than PCA The lack of clear separation between the A and B languages in Rene's plots is most likely due to a combination of very low frequency bigrams adding noise into the data with suboptimal choice of basis vectors by his dimensionality reduction method. RE: "The Currier languages revisited" revisited - Koen G - 19-12-2025 Very interesting and comprehensible analysis, thank you. Quire 8 pages misbehaving comes as no surprise. I wonder if the cloud of Zodiac pages can be seen as somewhat bridging the gap? Maybe it behaves differently because it's purely circular text? Likely, which pages of Pharma and Stars are part of the Herbal A cloud? Might this also have something to do with the ratio of paragraph text vs. labels/circular text? RE: "The Currier languages revisited" revisited - MarcoP - 19-12-2025 Quote:dimensionality reduction using a heuristic PCA-like method rather than PCA Why not using PCA? That would make for a more meaningful comparison with Rene's results RE: "The Currier languages revisited" revisited - kckluge - 19-12-2025 (19-12-2025, 01:19 PM)MarcoP Wrote: You are not allowed to view links. Register or Login to view.Quote:dimensionality reduction using a heuristic PCA-like method rather than PCA I was ("Applying PCA, the first two dimensions found capture 48% of the covariance in the 40-D data."). I should probably have done that as a table to avoid ambiguity. What Rene did is described on his page as follows: Quote:The following procedure will not necessarily find this maximum, but it will find something near to the maximum. RE: "The Currier languages revisited" revisited - kckluge - 20-12-2025 (19-12-2025, 10:55 AM)Koen G Wrote: You are not allowed to view links. Register or Login to view.Very interesting and comprehensible analysis, thank you. Quire 8 pages misbehaving comes as no surprise. If only running paragraph text is used with the same 40 bigrams, the results are similar with a couple caveats: f70r2 is the only Zodiac page in the sample; several of the Astro/Cosmicological pages drop out due to lack of running paragraph text; and the text sample sizes on the diagram pages get smaller (and therefore noisier from a variance perspective). Here these two axes capture a hair over 50% of the overall covariance: The f58 & f65 pages still fall in the B cluster (unsurprisingly). The lone Zodiac page is on the border between the A & B clusters, and one of the Rose foldout pages drifts towards the big Pharma group, but that's as likely to be due to the smaller text sample on those pages after circular text is excluded as anything else. With respect to the relationship between the Herbal A, Pharma, Zodiac, and Astro (circular diagram) pages, using the same 40 bigrams on running paragraph and circular text but doing PCA on just those data points yields the following plot (first two dimensions capture ~43.5% of the total covariance of that set of points): While there is some overlap, it looks like the zodiac & circular diagram pages are in a different "dialect" than the pharma & pharma-adjacent herbal pages. That's why it's so important to be careful about interpreting lower dimensional projections -- they look like they fall on top of each other when you use the first two axes returned by PCA over all the pages, but that's an artifact of PCA being forced to find directions that capture the main A/B split. Doing the same thing for the B language pages (the first two axes here capture ~48.3% of the B page covariance), the Bio pages fall in a group towards the upper right; the Herbal B pages (albeit with a few stragglers), the Rose foldout panels, and a chunk of the Starred paragraph pages fall in a group towards center left, and a chunk of the Starred paragraph pages fall in between the two (presumably reflecting the split between Starred paragraph pages Rene found on the basis of differences in most common word vocabulary -- You are not allowed to view links. Register or Login to view.): The same caveat about caution interpreting overlaps in a lower dimensional projection apply -- pairwise comparisons of Herbal B, Bio, Rose foldout, and Starred paragraph sets of pages would probably give a more reliable sense of the substructure in the B language "dialects". With respect to "if the cloud of Zodiac pages can be seen as somewhat bridging the gap", the point of the analysis was to test (and push back on in a data driven way) the whole notion that there is some smooth gradual transition between the A language pages and the B language pages. That's just not what the relative bigram frequency data appears to be saying. While they are probably (almost certainly) instances of some common underlying scheme, they appear to reflect separate discrete versions of that scheme. An interesting question is what the PCA plots would look like if tested on the 40 most common bigrams in comparably sized "pages" of mixed (say) English, French, and Spanish. I suspect the scatter within a given language would be lower and rather more Gaussian looking than the results for the Voynich pages. RE: "The Currier languages revisited" revisited - ReneZ - 20-12-2025 Thanks Karl, for this additional analysis. I find these types of cross-verification very useful, and one of the main purposes of having standardised formats is precisely to allow this. This was a major component of some of the work in my professional life, and the parallel processing of the same or similar data by several independent groups was one of the main reasons for the spectacular improvement in the results that could be achieved over a few decades. [ Unfortunately, that is entirely off-topic here :-) ] Anyway, there is quite a lot that can be said about the commonalities and differences, and for much of it, additional experiments would have to be run. That is something I am not set up to do on short notice, and will have to stay a bit further down on a list..... So, the following are some educated guesses. 1. Currier vs. Cuva: my guess is that this is not likely to cause any significant difference to the result, perhaps not even visible. This could be tried out exactly. For this, note that the results on the linked page were done in 1999 or so, but I repeated them before the 2022 conference, using somewhat different input data, and a slightly different definition of Cuva. I also tried 'real' PCA vs. my own invented logic. All this made hardly any difference to the result. Unfortunately, and as chance will have it, I recently discontinued the page that had these results (You are not allowed to view links. Register or Login to view.). I can temporarily bring it back if there is an interest - it may have formatting issues due to changes I made to the overall site formatting. 2. Using only the most common 40 bigrams vs all: here I also believe that this is not likely to cause any significant or even visible difference to the result. To test this would be slightly more complicated. 3. Not using the text on circles and labels: I strongly suspect that this is the main reason for the visible differences. I especially suspect the circular texts. It may be worth to try this out. I showed on the original page that leaving out the foldout pages creates a major gap in the whole cluster. Now of course this is not the same thing, but still almost all circular text is on foldout pages, so there is some hint for a relationship. (19-12-2025, 04:05 AM)kckluge Wrote: You are not allowed to view links. Register or Login to view.It is important to be careful about drawing conclusions from linear projections of higher dimensional data onto lower dimensional spaces. If two clumps of points are separable in the lower dimensional projection then they are also separable in the full dimensional space, but the inverse is not true -- two clumps of points that overlap in some projection do not necessarily overlap in the full space. I fully agree with this, but I would argue as follows: The projections I used to map the higher-dimensional space onto individual planes were chosen such that the extent (maximum minus minimum) of each dimension is sorted from high to low. The size of the cloud along each later dimension becomes uniformly smaller. Chances that there is a clean separation in a later dimension become smaller and smaller. Of course, reality is more complicated. A separation does not have to be along a plane, and one should really be able to rotate the cloud freely to look at it from more angles to check this. However, I see no indication in the plots I have made, that there is a separation. Finally, there is also this point: while Currier posited two different languages, intermediate forms (which he had not seen or at least not mentioned) exist. However, the amount of text of these intermediate forms is clearly less than both the 'clearly' A and 'clearly' B material. This also affects the appearance of the point clouds. It is interesting to see that you seem to have three clouds, i.e. not just A and B. The overall 'boomerang' shape is also there. I think that this should be attributed to the fact that all coordinates are values between 0 and 1, where most of them are close to 0 , and some stretch out along one of the axes. If the point cloud were not projected along Eigenvectors, but along the base vectors, they are all dominated by points along the two axes. RE: "The Currier languages revisited" revisited - kckluge - 20-12-2025 (20-12-2025, 02:21 AM)ReneZ Wrote: You are not allowed to view links. Register or Login to view.3. Not using the text on circles and labels: I strongly suspect that this is the main reason for the visible differences. I especially suspect the circular texts. It may be worth to try this out. To clarify, the experiment in the thread starter used line types * P -- lines in running paragraphs * C -- text on the circumference of circles in the circular diagram type pages, but not * R -- radial text in circles in the circular diagram type pages * L -- labels Running the bigram frequency stats on all lines of text results in this list of 40 most frequent bigrams (while there is some reordering of relative frequency, it looks like it mostly overlaps the list for just P & C lines): Currier: 89 OF OE CC 4O C8 SC 8A C9 FC OP AM AR CO AE FA OR SO ZC O8 PC AN PA EF FS S9 PS ZO RA ES 9F S8 AJ FO PO F9 BS 2A 9P EO Translated into EVA: dy ok ol ee qo ed che da ey ke ot aiin ar eo al ka or cho she od te ain ta lk kch chy tch sho ra lch yk chd am ko to ky pch sa yt lo Looking at the effects of adding R and L lines of text into the data makes clear that it's the inclusion of labels (especially the Zodiac folio labels) that explains the difference. Here's the plot with "R" lines added to "P" & "C" lines -- there is minimal difference from the result without the "R" lines: When the labels ("L" lines in the transcription file) are added in, it shifts the relative frequencies (as well as changing the PCA axes used for projection), producing a plot that looks very much like yours (f116v winds up being a big outlier on the plot's Y value, so I've clipped the range to keep it from reducing detail): The labels in the zodiac pages are clearly the cause of the "bridge" in your original results. Even with the way they influence the PCA, if you leave them out of the plot the A & B language pages still split (with the exception of three Herbal A pages), just with a narrower gap (in that specific projection): So that opens the whole "labelese" vs. running text issue. Looking at the Zodiac diagrams (f70v2 - f73v), here are the 20 most frequent bigrams in the labels: kgram: AE OP OF AR CC CO 89 PA FC OE Rank: 1 2 3 4 5 6 7 8 9 10 REFreq: 0.0741 0.0599 0.0537 0.0476 0.0435 0.0422 0.0388 0.0320 0.0299 0.0286 kgram: FA EA PC O8 RA E9 AM AJ 8A C9 Rank: 11 12 13 14 15 16 17 18 19 20 REFreq: 0.0252 0.0245 0.0231 0.0218 0.0211 0.0204 0.0197 0.0170 0.0143 0.0136 Here are the 20 most frequent bigrams for the non-"L" lines on those pages: kgram: OP CO CC PC AE OF C9 FC OE SC Rank: 1 2 3 4 5 6 7 8 9 10 REFreq: 0.0698 0.0604 0.0589 0.0546 0.0529 0.0425 0.0391 0.0354 0.0345 0.0339 kgram: AR 89 AM 8A O8 ZC C8 SO EA FA Rank: 11 12 13 14 15 16 17 18 19 20 REFreq: 0.0319 0.0313 0.0282 0.0256 0.0227 0.0175 0.0172 0.0167 0.0167 0.0138 To pick just a couple examples, * "SC" is the 10th most frequent glyph bigram on non-label lines at 3.39%, it doesn't even make the top 20 for labels * "C9" is 3.91% of glyph bigrams for non-label lines of text, 1.36% for labels * "AR" is 3.19% for non-label lines, 4.76% for labels * "CO" is 6.04% for non-label lines, 4.22% for labels Again, this is all within the zodiac plages. I think that makes a case that labelese and running text are different animals. Quote:Finally, there is also this point: while Currier posited two different languages, intermediate forms (which he had not seen or at least not mentioned) exist. However, the amount of text of these intermediate forms is clearly less than both the 'clearly' A and 'clearly' B material. This also affects the appearance of the point clouds. There are clearly "dialects" within the A & B languages, but I don't think the plots without labels included in my earlier posts in the thread support claims of intermediate forms between the two. There are two separable sets of pages (with ~3 exceptions as discussed in an earlier post), with the Herbal A, Pharma, Zodiac, and Astro pages falling on one side of the wall and the Herbal B, Bio, Rose foldout, and Starred paragraph pages falling on the other side. The Zodiac labels appear to be intermediate between them in a statistical sense, but given that they differ from the non-label text on the same pages I'm not sure I buy Zodiac labelese as an intermediate form between A language running text and B language running text. Your mileage may differ. RE: "The Currier languages revisited" revisited - ReneZ - 20-12-2025 Sorry for misreading your opening post w.r.t. the circular texts. Indeed, between the labels and the radial text, the former are more likely to explain the difference. The radial texts are a small group that can hardly influence the statistics. I am not even sure that they are fundamentally different from labels. That the zodiac labels mostly fill the gap confirms the observation that the most distinctive bigram, Eva 'dy' or Currier '89', graduall appears in the course of these pages. It remains a small corpus, so one can argue whethere there is a continuous change or not. If one restricts the plot to Herbal-A, Pharma, Herbal- B, Bio and Stars, one gets two disconnected clusters. There is a progression: Pharma -> Astro/Cosmo -> Zodiac that forms a (full or partial) bridge, the choice between full and partial being somewhat subjective, as the data is rather sparse. RE: "The Currier languages revisited" revisited - MarcoP - 20-12-2025 That's very interesting. If I understand correctly, an implication is that Currier A (as defined by the left-side cluster in the plots) is coincident with Scribe 1 plus Scribe 4. So the A/B split can be entirely explained by the different scribes (but for those three herbal pages)? Within the two main clusters, individual Scribes overlap significantly, I think? I would be curious to see a version of the plot colored by Scribe/Hand. As Rene said, the leftmost cluster appears to be made of two sub-clusters. The top one is mostly Pharma (Scribe 1) and Astro (Scribe 4) pages. So this finer division doesn't appear to depend on just subject or scribe differences. RE: "The Currier languages revisited" revisited - kckluge - 20-12-2025 (20-12-2025, 08:35 AM)MarcoP Wrote: You are not allowed to view links. Register or Login to view.If I understand correctly, an implication is that Currier A (as defined by the left-side cluster in the plots) is coincident with Scribe 1 plus Scribe 4. So the A/B split can be entirely explained by the different scribes (but for those three herbal pages)? With the caveat that Scribe 4 also does one side of the big Rose foldout and those pages all cluster on the B language side with the text on Scribe 2's nine-rosette diagram, yes. |