One page on Rene's site (You are not allowed to view links.
Register or
Login to view.) does a bigram frequency level analysis of the pages in the manuscript. This post is specifically addressing the section starting with "Language characteristics". The analysis on that part of the page:
* uses Rene's CUVA alphabet to deal with EVA's oversegementation of the glyphs (You are not allowed to view links.
Register or
Login to view.)
* removes uncertain spaces from the transcription, but leaves other spaces
* only looks at bigrams within words, not bigrams straddling spaces
* starts with a feature space corresponding to the relative frequencies of all 355 CUVA bigrams that occur, then does a dimensionality reduction similar (but not identical) to Principle Components Analysis (PCA -- You are not allowed to view links.
Register or
Login to view. describes PCA)
Plots are shown for the dominant vector vs. the 2nd through 4th vectors found by his method. On the basis of those plots he concludes, "When Currier identified his languages A and B, he did this on the basis of the different statistics of the initial herbal pages in the MS, which are identified by the red ('A') and dark blue ('B') crosses. It is clear that these have distinct properties - the clouds do not overlap. He also checked the other pages, and noted more variations, but his criteria for distinguishing the languages did not allow him to see that the overall statistics demonstrate that there is a continuum, and the other (not herbal) pages actually 'bridge the gap'."
It is important to be careful about drawing conclusions from linear projections of higher dimensional data onto lower dimensional spaces. If two clumps of points are separable in the lower dimensional projection then they are also separable in the full dimensional space, but the inverse is not true -- two clumps of points that overlap in some projection do not necessarily overlap in the full space.
To examine Rene's conclusion I performed a variation of the analysis described above:
* the Currier alphabet is used rather than CUVA, translated from the ZL_ivtff_1b.txt EVA transcription (when multiple proposed reading are given for a glyph the first option is used)
* uncertain spaces are removed as per the original experiment
* only lines corresponding to running paragraph and "circular" text -- no radial text from diagrams or labels
* only the 40 most common bigrams are used -- in Currier these are:
89 OF OE 4O CC C8 SC 8A C9 AM FC OP CO AR FA AE OR ZC SO O8 PC AN PA EF FS ZO PS S9 ES RA S8 9F AJ BS F9 FO PO 2A 9P EO
which correspond to EVA
dy ok ol qo ee ed che da ey aiin ke ot eo ar ka al or she cho od te ain ta lk kch sho tch chy lch ra chd yk am pch ky ko to sa yt lo
* bigrams including spaces (with end-of-line, end-of-paragraph, and plant drawing gaps counted as spaces) are included in the total bigram count for a page when computing relative bigram frequencies for the page
The 40 Currier bigrams listed above cover 83% of the bigrams that don't include a space or untranslatable/transcribed non-Currier "wierdo". Applying PCA, the first two dimensions found capture 48% of the covariance in the 40-D data. The resulting plot is:
With the exception of three pages, the Herbal B, Bio, Starred paragraph, and Rose foldout pages fall together in one cluster and the Herbal A, Astro, Zodiac, and Pharma pages fall together in another cluster, separated by a clear diagonal gap.
One exception is Zodiac page f73v; the other two exceptions are You are not allowed to view links.
Register or
Login to view. & f65v. f58 & f65 are the halves of a biofolio that Lisa Fagin-Davis identifies as by Scribe 3; You are not allowed to view links.
Register or
Login to view. has a plant drawing with no text other than a 2-3 word label. Traditionally those bifolio pages have been labelled as A Language, which would make this the only known non-Scribe 1 Herbal A biofolio. It is plausible that the f58 & f65 bifolio pages are B language pages with atypical relative frequencies of the small number of key bigrams used to make the initial A/B classification by Currier (in which case You are not allowed to view links.
Register or
Login to view. becomes the outlier grouped with the Herbal A pages).
The differences between the analyses are:
* use of CUVA vs Currier
* inclusion of radial and label text elements vs only running paragraph and "circular" diagram text
* starts with a 355-D space (all bigram frequencies) vs a 40-D space (only most common, corresponding to 83% of the glyph bigram pairs in the text)
* dimensionality reduction using a heuristic PCA-like method rather than PCA
The lack of clear separation between the A and B languages in Rene's plots is most likely due to a combination of very low frequency bigrams adding noise into the data with suboptimal choice of basis vectors by his dimensionality reduction method.