KMeans analysis of Currier A, B, C

KMeans analysis of Currier A, B, C - Printable Version

+- The Voynich Ninja (https://www.voynich.ninja)
+-- Forum: Voynich Research (https://www.voynich.ninja/forum-27.html)
+--- Forum: Analysis of the text (https://www.voynich.ninja/forum-41.html)
+--- Thread: KMeans analysis of Currier A, B, C (/thread-4483.html)

KMeans analysis of Currier A, B, C - Addsamuels - 06-02-2025

I hope these links are useful
You are not allowed to view links. Register or Login to view.
You are not allowed to view links. Register or Login to view.

Generated using ChatGPT

Note the 'transitioning' You are not allowed to view links. Register or Login to view. and You are not allowed to view links. Register or Login to view.
(Ignore You are not allowed to view links. Register or Login to view. since there's not enough writing.)

ChatGPT says: What Do PCA Component 1 and PCA Component 2 Mean?
PCA (Principal Component Analysis) is a technique used to reduce the dimensionality of data while keeping as much important information as possible.
1. Why Do We Use PCA in This Case?
Your text data is transformed into a high-dimensional numerical representation using TF-IDF (Term Frequency-Inverse Document Frequency). This means that each folio is represented as a vector with many dimensions (one for each unique word). Since we can't easily visualize such high-dimensional data, we use PCA to compress this data into just two principal components (PCA Component 1 & PCA Component 2) that still retain most of the information.
2. What Are PCA Components?

PCA Component 1 (X-axis in the plot) → The most important direction of variance in the data.
PCA Component 2 (Y-axis in the plot) → The second most important direction of variance (uncorrelated with Component 1).

These two components are new axes in a transformed space, where:

Similar folios (in terms of text content) are closer together.
Different folios are farther apart.

3. What Does This Mean in the 3D Chart?

X (PCA Component 1): Main variation in folio text.
Y (PCA Component 2): Second major variation in folio text.
Z (Depth): Position of the folio within the document (line number).

This lets us see clusters of similar folios while keeping it visually interpretable.
Would you like to extract the top most important words contributing to PCA 1 and PCA 2? That would tell you what words are driving the differences! ?

RE: KMeans analysis of Currier A, B, C - RobGea - 07-02-2025

Nevermind

RE: KMeans analysis of Currier A, B, C - RobGea - 07-02-2025

As a hopefully helpful reference, check these out.
A Principal Component Analysis of the Voynich Manuscript Pages ->Donald Fisk
You are not allowed to view links. Register or Login to view.

Voynich Manuscript - Basic Analyses ->Sarah Goslee
You are not allowed to view links. Register or Login to view.

RE: KMeans analysis of Currier A, B, C - MarcoP - 07-02-2025

I like the 3d visualization!
What are the colors and shapes in the 2d plot? EDIT: I now understand that the colors are the clusters and the shapes the Currier classification as A (circle), B (triangle), C/undecided (square). If so, it's clear that PCA1 is basically enough to separate A from B. I don't remember if we were ever able to identify an intuitive meaning for PCA2.

At the bottom of this comment, a similar result (same V-shape) is shown for selected bigrams, rather than the entire set of unique words.

You are not allowed to view links. Register or Login to view.

Colors detailed here: You are not allowed to view links. Register or Login to view.

It's interesting that bigrams show Pharma (yellow) as a partially separated cluster.

RE: KMeans analysis of Currier A, B, C - dashstofsk - 07-02-2025

There is another plausible explanation for the cluster plots in You are not allowed to view links. Register or Login to view. and for the observation 'the text correlates more closely with the illustrations than would be expected by chance'.

The explanation is very simple. I can envisage the following scenario.

The authors of the manuscript invented a private alphabet, together with a method for forming the text with the intention of creating an invented work. The 'method' for creating the text had no formal standard and so the individual authors could apply it using their own interpretation. Of the principal authors hand 1's method gave language A, hands 2 and 3 gave language B. This explains the A/B split. Further clustering comes about because the sections of the manuscript were written at different times, with significant gaps of time between them. Between writings the authors lost some fluency in the 'method'. And when they sat down to compose the next sections the writing came out different.

An example. If you look at figure 4 if the document you will see that 103, 108, 111 and 116 of quire 20 are plotted close together. These are pages of the same two sheet, and suggests that these pages might have been written in one sitting. The other pages of quire 20 ( same hand, 3 ) are plotted further away. They were written at some different time.

Because the manuscript is a fabrication it did not trouble the authors if there was variation in the writing. They never felt obliged to have to follow any particular standard.

The document ends 'If the pages are meaningless, its author would have had to vary the method used to generate the pages according to the subjects of the illustrations.' And this is indeed what happened. The method did vary with time, and with author. In the absence of any formal structure the writing unintentionally 'developed', section by section.

RE: KMeans analysis of Currier A, B, C - Addsamuels - 07-02-2025

(07-02-2025, 06:53 AM)MarcoP Wrote: You are not allowed to view links. Register or Login to view.I like the 3d visualization!

What are the colors and shapes in the 2d plot? EDIT: I now understand that the colors are the clusters and the shapes the Currier classification as A (circle), B (triangle), C/undecided (square). If so, it's clear that PCA1 is basically enough to separate A from B. I don't remember if we were ever able to identify an intuitive meaning for PCA2.

At the bottom of this comment, a similar result (same V-shape) is shown for selected bigrams, rather than the entire set of unique words.

You are not allowed to view links. Register or Login to view.

Colors detailed here: You are not allowed to view links. Register or Login to view.

It's interesting that bigrams show Pharma (yellow) as a partially separated cluster.

Top Words for PCA Component 1 (Main Variation in Folio Texts): shedy: 0.3327 chedy: 0.3215 qokeedy: 0.2618 qokain: 0.2556 qokedy: 0.2362 ol: 0.1917 qokeey: 0.1662 qol: 0.1595 qokaiin: 0.1453 qokal: 0.1430 daiin: -0.1378 chor: -0.1231 chol: -0.1229 otedy: 0.1155 lchedy: 0.1152 cthy: -0.0998 okain: 0.0966 aiin: 0.0943 shey: 0.0919 okedy: 0.0895 qokar: 0.0857 qotedy: 0.0840 ar: 0.0791 chey: 0.0781 okeedy: 0.0780 sho: -0.0777 al: 0.0742 chy: -0.0733 qokey: 0.0732 qoteedy: 0.0714 sheedy: 0.0712 otain: 0.0686 oteedy: 0.0686 chdy: 0.0641 or: 0.0630 qoky: 0.0622 chckhy: 0.0609 ain: 0.0597 okeey: 0.0596 cthor: -0.0592
Top Words for PCA Component 2 (Second Major Variation in Folio Texts): ar: 0.2653 aiin: 0.2365 daiin: -0.2322 shedy: -0.1937 qokain: -0.1881 al: 0.1773 qol: -0.1606 qokeedy: -0.1519 qokedy: -0.1432 chor: -0.1392 chol: -0.1335 cthy: -0.1329 dain: -0.1291 or: 0.1129 okar: 0.1094 otar: 0.1085 chy: -0.1044 cthor: -0.0932 sho: -0.0922 chdy: 0.0907 air: 0.0897 shol: -0.0889 oteey: 0.0869 ol: -0.0860 shey: -0.0859 qotchy: -0.0838 okal: 0.0836 am: 0.0816 otaiin: 0.0787 shor: -0.0753 shy: -0.0749 oteody: 0.0738 cthol: -0.0717 qokal: -0.0714 qoty: -0.0705 chaiin: -0.0704 lchedy: -0.0666 cheody: 0.0661 oteos: 0.0631 qokeey: -0.0621
CHATGPT Analysis:
Analysis of Top Words for PCA Component 1 and 2
PCA Component 1 (Main Variation in Folio Texts)
This component strongly separates words related to "qokeedy", "qokain", and "shedy" from words like "daiin" and "chor".

High Positive Loadings (Strong Contribution to One Side)
- Words like "shedy" (0.3327), "chedy" (0.3215), "qokeedy" (0.2618), "qokain" (0.2556), and "qokedy" (0.2362) are dominant.
- Many of these words share prefixes "qok-" and "shedy".
- This suggests that folios containing these words are clustered together, likely forming a distinct linguistic or stylistic group.
- The presence of "ol" (0.1917) and "qol" (0.1595) could indicate a second pattern within this cluster.
Negative Loadings (Opposite Side of the Spectrum)
- Words like "daiin" (-0.1378), "chor" (-0.1231), and "chol" (-0.1229) push in the opposite direction.
- This indicates that folios dominated by "daiin" and "chor" are distinctly different from those dominated by "shedy" and "qokeedy".
- "cthy" (-0.0998) and "chy" (-0.0733) are also found in the negative range, reinforcing this separation.
Patterns Noticed
- The "qok-" prefix is heavily weighted in PCA 1, suggesting a major stylistic or lexical variation in some folios.
- Words ending in "-edy" (shedy, chedy, qokeedy) seem to be characteristic of one category.
- Words with "daiin" and "chor" are defining features of another category.

PCA Component 2 (Second Major Variation in Folio Texts)
This component captures another variation, possibly a structural or phonetic shift, with vowel-heavy words on one side and consonant-heavy words on the other.

High Positive Loadings (One Group of Texts)
- "ar" (0.2653), "aiin" (0.2365), "al" (0.1773), and "or" (0.1129) dominate this component.
- These words tend to begin with vowels, especially A and O.
- This suggests a separate folio style where vowel-initial words are more common.
Negative Loadings (Opposing Text Style)
- "daiin" (-0.2322), "shedy" (-0.1937), "qokain" (-0.1881), "qokeedy" (-0.1519), "chor" (-0.1392) are dominant on this side.
- These words are more consonant-heavy, with strong "qok-" and "shedy" patterns.
- This confirms the trend that "daiin" is in a separate category from vowel-initial words.
Patterns Noticed
- PCA 2 separates texts that favor vowel-heavy words ("aiin", "ar", "al") from those using consonant-heavy words ("qokain", "shedy", "chor").
- The negative side is similar to PCA 1's positive side, reinforcing that the "qok-" and "shedy" words are a distinct cluster.

? Conclusion: What Does This Tell Us?

PCA 1 is mostly driven by whether a folio contains "qok-" and "shedy" words or "daiin" and "chor" words.
PCA 2 shows a vowel vs. consonant-heavy variation, suggesting another structural difference in the text.

RE: KMeans analysis of Currier A, B, C - Addsamuels - 15-02-2025

I have a new chart from the categories You are not allowed to view links. Register or Login to view., I have now managed to place the labels onto the chart so behold: You are not allowed to view links. Register or Login to view. you can clearly, see that the categories are perhaps an even better determinant of language compared to currier