![]() |
KMeans analysis of Currier A, B, C - Printable Version +- The Voynich Ninja (https://www.voynich.ninja) +-- Forum: Voynich Research (https://www.voynich.ninja/forum-27.html) +--- Forum: Analysis of the text (https://www.voynich.ninja/forum-41.html) +--- Thread: KMeans analysis of Currier A, B, C (/thread-4483.html) |
KMeans analysis of Currier A, B, C - Addsamuels - 06-02-2025 I hope these links are useful You are not allowed to view links. Register or Login to view. You are not allowed to view links. Register or Login to view. Generated using ChatGPT Note the 'transitioning' You are not allowed to view links. Register or Login to view. and You are not allowed to view links. Register or Login to view. (Ignore You are not allowed to view links. Register or Login to view. since there's not enough writing.) ChatGPT says: What Do PCA Component 1 and PCA Component 2 Mean? PCA (Principal Component Analysis) is a technique used to reduce the dimensionality of data while keeping as much important information as possible. 1. Why Do We Use PCA in This Case? Your text data is transformed into a high-dimensional numerical representation using TF-IDF (Term Frequency-Inverse Document Frequency). This means that each folio is represented as a vector with many dimensions (one for each unique word). Since we can't easily visualize such high-dimensional data, we use PCA to compress this data into just two principal components (PCA Component 1 & PCA Component 2) that still retain most of the information. 2. What Are PCA Components?
Would you like to extract the top most important words contributing to PCA 1 and PCA 2? That would tell you what words are driving the differences! ? RE: KMeans analysis of Currier A, B, C - RobGea - 07-02-2025 Nevermind RE: KMeans analysis of Currier A, B, C - RobGea - 07-02-2025 As a hopefully helpful reference, check these out. A Principal Component Analysis of the Voynich Manuscript Pages ->Donald Fisk You are not allowed to view links. Register or Login to view. Voynich Manuscript - Basic Analyses ->Sarah Goslee You are not allowed to view links. Register or Login to view. RE: KMeans analysis of Currier A, B, C - MarcoP - 07-02-2025 I like the 3d visualization! What are the colors and shapes in the 2d plot? EDIT: I now understand that the colors are the clusters and the shapes the Currier classification as A (circle), B (triangle), C/undecided (square). If so, it's clear that PCA1 is basically enough to separate A from B. I don't remember if we were ever able to identify an intuitive meaning for PCA2. At the bottom of this comment, a similar result (same V-shape) is shown for selected bigrams, rather than the entire set of unique words. You are not allowed to view links. Register or Login to view. Colors detailed here: You are not allowed to view links. Register or Login to view. It's interesting that bigrams show Pharma (yellow) as a partially separated cluster. RE: KMeans analysis of Currier A, B, C - dashstofsk - 07-02-2025 There is another plausible explanation for the cluster plots in You are not allowed to view links. Register or Login to view. and for the observation 'the text correlates more closely with the illustrations than would be expected by chance'. The explanation is very simple. I can envisage the following scenario. The authors of the manuscript invented a private alphabet, together with a method for forming the text with the intention of creating an invented work. The 'method' for creating the text had no formal standard and so the individual authors could apply it using their own interpretation. Of the principal authors hand 1's method gave language A, hands 2 and 3 gave language B. This explains the A/B split. Further clustering comes about because the sections of the manuscript were written at different times, with significant gaps of time between them. Between writings the authors lost some fluency in the 'method'. And when they sat down to compose the next sections the writing came out different. An example. If you look at figure 4 if the document you will see that 103, 108, 111 and 116 of quire 20 are plotted close together. These are pages of the same two sheet, and suggests that these pages might have been written in one sitting. The other pages of quire 20 ( same hand, 3 ) are plotted further away. They were written at some different time. Because the manuscript is a fabrication it did not trouble the authors if there was variation in the writing. They never felt obliged to have to follow any particular standard. The document ends 'If the pages are meaningless, its author would have had to vary the method used to generate the pages according to the subjects of the illustrations.' And this is indeed what happened. The method did vary with time, and with author. In the absence of any formal structure the writing unintentionally 'developed', section by section. RE: KMeans analysis of Currier A, B, C - Addsamuels - 07-02-2025 (07-02-2025, 06:53 AM)MarcoP Wrote: You are not allowed to view links. Register or Login to view.I like the 3d visualization! Top Words for PCA Component 1 (Main Variation in Folio Texts): shedy: 0.3327 chedy: 0.3215 qokeedy: 0.2618 qokain: 0.2556 qokedy: 0.2362 ol: 0.1917 qokeey: 0.1662 qol: 0.1595 qokaiin: 0.1453 qokal: 0.1430 daiin: -0.1378 chor: -0.1231 chol: -0.1229 otedy: 0.1155 lchedy: 0.1152 cthy: -0.0998 okain: 0.0966 aiin: 0.0943 shey: 0.0919 okedy: 0.0895 qokar: 0.0857 qotedy: 0.0840 ar: 0.0791 chey: 0.0781 okeedy: 0.0780 sho: -0.0777 al: 0.0742 chy: -0.0733 qokey: 0.0732 qoteedy: 0.0714 sheedy: 0.0712 otain: 0.0686 oteedy: 0.0686 chdy: 0.0641 or: 0.0630 qoky: 0.0622 chckhy: 0.0609 ain: 0.0597 okeey: 0.0596 cthor: -0.0592 Top Words for PCA Component 2 (Second Major Variation in Folio Texts): ar: 0.2653 aiin: 0.2365 daiin: -0.2322 shedy: -0.1937 qokain: -0.1881 al: 0.1773 qol: -0.1606 qokeedy: -0.1519 qokedy: -0.1432 chor: -0.1392 chol: -0.1335 cthy: -0.1329 dain: -0.1291 or: 0.1129 okar: 0.1094 otar: 0.1085 chy: -0.1044 cthor: -0.0932 sho: -0.0922 chdy: 0.0907 air: 0.0897 shol: -0.0889 oteey: 0.0869 ol: -0.0860 shey: -0.0859 qotchy: -0.0838 okal: 0.0836 am: 0.0816 otaiin: 0.0787 shor: -0.0753 shy: -0.0749 oteody: 0.0738 cthol: -0.0717 qokal: -0.0714 qoty: -0.0705 chaiin: -0.0704 lchedy: -0.0666 cheody: 0.0661 oteos: 0.0631 qokeey: -0.0621 CHATGPT Analysis: Analysis of Top Words for PCA Component 1 and 2 PCA Component 1 (Main Variation in Folio Texts) This component strongly separates words related to "qokeedy", "qokain", and "shedy" from words like "daiin" and "chor".
PCA Component 2 (Second Major Variation in Folio Texts) This component captures another variation, possibly a structural or phonetic shift, with vowel-heavy words on one side and consonant-heavy words on the other.
? Conclusion: What Does This Tell Us?
RE: KMeans analysis of Currier A, B, C - Addsamuels - 15-02-2025 I have a new chart from the categories You are not allowed to view links. Register or Login to view., I have now managed to place the labels onto the chart so behold: You are not allowed to view links. Register or Login to view. you can clearly, see that the categories are perhaps an even better determinant of language compared to currier |