![]() |
|
"The Currier languages revisited" revisited - Printable Version +- The Voynich Ninja (https://www.voynich.ninja) +-- Forum: Voynich Research (https://www.voynich.ninja/forum-27.html) +--- Forum: Analysis of the text (https://www.voynich.ninja/forum-41.html) +--- Thread: "The Currier languages revisited" revisited (/thread-5153.html) |
RE: "The Currier languages revisited" revisited - kckluge - 23-12-2025 (22-12-2025, 08:06 PM)oshfdk Wrote: You are not allowed to view links. Register or Login to view.(22-12-2025, 07:49 PM)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view.* And the chances of obtaining useful insights will improve a lot if one uses the good old "scientific method": make an hypothesis, then devise the simplest and most effective way to test for it, and do that. The purpose of PCA isn't to "explain" anything. It's a data visualization tool. The purpose of PCA is to figure out how to optimally position a hyperdimensional flashlight relative to an N-dimensional point clould so that the shadow the point clould casts on a 2-D plane (or, by analogy, 3-D volume) captures the maximal amount of the covariance in the data. It's like an X-ray -- an X-ray is a projection of the structure of a 3-D body onto a 2-D image in the same way that a PCA plot is a projection of the structure of an N-D point cloud onto a 2-D plane or 3-D volume. PCA plots don't "explain" anything any more than X-ray images "explain" anything -- they have to be interpreted. I strongly suspect that when you complain about "PCA" here and in your previous #19 & #21 you aren't really complaining about PCA as such. In fact, I strongly suspect that when you complain about "PCA" you aren't even really complaining about applying PCA to the relative frequency distribution of the 40 most-common glyph bigrams (or in Rene's analysis, all the glyph bigrams) in the text. I strongly suspect that what you're really complaining about is analyzing the text in terms of relative bigram frequency behavior at all. The reason I strongly suspect that is because it would explain why you and Rafal are talking past each other in #19, #20, and #21, and why you and Stolfi are talking past each other in #39 and #40. No one is suggesting that it's not a good idea to interrogate the data to test specific hypotheses. That doesn't make exploratory data analysis useless. You don't do exploratory data analysis to "explain" the data; you do exploratory data analysis to look for structure in the data. If you find structure in the data, then that structure (hopefully) usefully constrains the set of hypotheses it makes sense to test. For instance, if you want to look for potential cribs by comparing word usage in the Pharma pages and the Herbal pages, it probably makes sense to confine your analysis to the Herbal A pages that cluster (statistically and physically) with the Pharma pages. That's not something you'd know without doing the exploratory data analysis first. As for whether there is value in analyzing the distribution of relative glyph bigram frequencies: Here's a clearly stated assumption: The Currier transcription alphabet imposes an appropriate set of equivalence classes on the collections of ink strokes drawn on the vellum of MS 408. Here's a hypothesis: The pages of MS 408 containing a big picture of a single plant have text drawn from a single underlying distribution in the relative frequency space of the 40 most common glyph bigrams in the text regardless of whatever the unknown process generating that distrubution is. Since my ability to visualize 40-D space isn't any better than anyone else's, I'll use PCA to optimally (in a specific sense) linearly project the data onto a 2-D plane. When we do that, what do we see? There are two clearly defined linearly separable clumps of points in the 2-D projection, which also means those two clumps of points are clearly defined and linearly separable in the full 40-D space. So that hypothesis is pretty clearly falsified. You don't appear to find that result very interesting. I do, because it seems to me that if someone wants to try to decipher or translate the text it would be really, really useful to know up front if it was a smart idea to treat the entire text as a single homogeneous bucket of whatever-it-is. Here's another hypothesis: There is no correlation between the scribal hand identified by Lisa Fagin-Davis as writing a given page with a big picture of a single plant and which of those two underlying distributions in relative bigram frequency space the text on that page belongs to. If we look at the data, what does that tell us? All the pages in one clump are written by her Scribe 1, and all the pages in the other are written by her Scribes 2, 3, and 5. So that hypothesis is pretty clearly falsified. That's something any theory about the history of the object should account for. It may not "explain" the history, but it certainly constrains it. Here's another hypothesis: the quantitative differences in relative bigram frequency between the text on the pages with a big picture of a plant on them and the text on the pages with pictures of strange women lying in ponds (but not distributing swords)* are larger than the typical quantitative differences in relative bigram frequencies between texts on different subjects in a common natural language. That is also a perfectly reasonable and perfectly testable hypothesis, although no one has done so that I know of. I really wish someone would, as it would put to bed one way or another a whole bunch of hand-wavey arguments about the text. So I'm very comfortable defending the utility of analyses in relative bigram frequency space. (* If you don't get the reference, you should watch Monty Python and the Holy Grail.) RE: "The Currier languages revisited" revisited - dashstofsk - 23-12-2025 Am I right in understanding that these PCA plots are computed just on character pair frequencies? And that these plots are being used to judge how A-like or B-like certain pages are? It seems to me that this might lead to wrong conclusions. For example look at the following PCA plot, from quires 8, 13, 20. It shows You are not allowed to view links. Register or Login to view. and You are not allowed to view links. Register or Login to view. in the middle of the quire 20 domain of points. You might conclude that these two pages are representative of quire 20. Yet a different PCA plot shows these two pages outside of the quire 20 domain. Also You are not allowed to view links. Register or Login to view. and You are not allowed to view links. Register or Login to view. which were previously distant from quire 20 and are now much closer. The difference in the plots is that the second was computed using additional measures: frequencies of whole words, prefixes, suffices, character end pairs ( first character of word + last character ), bridge pairs ( last character of word + first character of following word ), gallows frequencies, frequency of words containing r, d . Surely you need to include some such additional measures alongside character pairs when deciding if groups of pages are related. RE: "The Currier languages revisited" revisited - oshfdk - 23-12-2025 (23-12-2025, 06:55 AM)kckluge Wrote: You are not allowed to view links. Register or Login to view.So I'm very comfortable defending the utility of analyses in relative bigram frequency space. I did watch it a few times but it was long ago. I didn't get the reference, but I hope it's not essential for the discussion. I can't really reply to your points one by one, because I don't think I fully understand many of them, I'm not so wise in the ways of science and I'm not really involved with the whole languages/scribes/hands topics. Maybe that's the reason I keep "taking past" other people in my posts, I don't know. I can try to explain my dissatisfaction with the exploratory use of PCA and similar advanced statistical techniques without some guiding hypothesis, using a simple example. Suppose you run PCA on some data from the manuscript and you get a curiously good separation between something and something. 1) Does this reflect a specific underlying feature of the manuscript or is this a spurious random pattern? After all, absolutely any pattern can arise purely through chance and a random pattern that clearly separates two classes can still be meaningless. 2) Does this reflect some feature of the internal design of the manuscript (intentional and will help getting useful information about the manuscript) or a by product of some external factors (say, temperature and sunlight in the scriptorium, the shoes the scribe is wearing, whether the scribe is anxious or calm, any of this affecting the writing patterns, which won't tell us anything about the content)? In most STEM studies it would be easy to handle the question of "spurious or real": you just take another sample of data, another preparation and try reproducing. So, if we were working with a natural phenomenon, we would just take 10 more of those Voynich Manuscripts, we would repeat the study on them, we would find out that, say, 8 of 10 show the same pattern in this experiment and we'll call this result significant. Maybe even later someone would investigate the two Voynich Manuscripts that failed the test, and would find out that one of them is a XX century forgery and another was soaked in water and poorly retraced, turning our original study into the pinnacle of reproducibility. However, there is only one Voynich Manuscript and with the higher and higher number of tests performed it is expected that by change there will be more and more curiously interesting results which don't reflect any real features of the manuscript. And this makes each of these results less and less convincing. I'm not sure how to address the question of intentional or a by-product, but I think it requires some model of what the manuscript is and how and why it was created. And I think this model should be created before running the tests and the tests should be designed to serve as the evidence for or against the model. Retrofitting the model after the test results are known is pointless, since the space of all possible models is huge. You can invent many conflicting scenarios that would explain any distribution of the data. Now why I singled out PCA specifically: it provides visual patterns. Human brains are hardwired to extract and boost and extrapolate visual patterns. So, on the top of the possibility that the original result is spurious or a by product we add the possibility of visually misinterpreting the significance of the result. And now if I'm following the thread correctly, there is the idea that 2D PCA is not enough and 3D is the way to go, which would further blur the significance of the result. I think the reason why there are so few interesting and (more or less) universally accepted statistical results for the manuscript is because these results better be 1) simply presented (mono dimensional or scalar); 2) unimodal (do not compare different domains open to interpretations - like texts and images); 3) using relatively well defined feature of the manuscript and a large sample size. An example ticking all of the above would be Stolfi's binomial distribution of the word and token lengths, which seems useful even without some model explaining it (I don't remember if the Chinese hypothesis came to be before this result or after). I'm not sure whether the above is coherent, but this is my view of the matter. Overall, I don't mind reading exploratory statistical studies if there is some clear understanding of their essential limitations when dealing with a single immutable object of study. RE: "The Currier languages revisited" revisited - MarcoP - 23-12-2025 (23-12-2025, 12:18 AM)kckluge Wrote: You are not allowed to view links. Register or Login to view.Since the Zodiac pages have an ordering, I plotted the "no labels" coordinates for the 1st two eigenvectors with lines connecting each page to the next one (the last one, Sagittarius, is the point at the bottom at x ~= 0.025): Thank you, Karl, your trajectory plot is quite interesting. Maybe I am hallucinating and biased, but it seems to me that, if one takes y=x as a quick and dirty approximation of You are not allowed to view links. Register or Login to view., the zodiac pages tend to get progressively closer to the gap (finally jumping across with the last sample Sagittarius). Since the first two circular diagrams f70r1/2 have a different format, comparing them with the zodiac proper (from Pisces to Sagittarius) isn’t necessarily significant in this context. The plot on the right shows Y-X (i.e. PCA2-PCA1) as a measure of “A-ness”, with 0 (PCA2=PCA1) being the hypothetical boundary between A and B. RE: "The Currier languages revisited" revisited - Jorge_Stolfi - 23-12-2025 (23-12-2025, 03:25 AM)ReneZ Wrote: You are not allowed to view links. Register or Login to view.All theoretically possible, but this is not what we are looking at here. To be on the safe side I created plots involving six different 'base' vectors, which showed nothing of interest - just the cloud getting narrower and narrower. That is what I did too, back in the day. We now have mappings to N-space and projections to 2-space that show the A and B pages neatly separated by a hyperplane, except for one or two special cases. That is good. My quibble is just that there may be other magical projections that show equally neat separation between some pair of sections with the same language, like Herbal-B from Pharma. Even if there is such a projection, PCA is not guaranteed to find it, even if we give it just those two sections. The centroid-to-centroid vector will not necessarily work either. All the best, --stolfi RE: "The Currier languages revisited" revisited - Jorge_Stolfi - 23-12-2025 (23-12-2025, 06:55 AM)kckluge Wrote: You are not allowed to view links. Register or Login to view.The purpose of PCA is to figure out how to optimally position a hyperdimensional flashlight relative to an N-dimensional point clould so that the shadow the point clould casts on a 2-D plane (or, by analogy, 3-D volume) captures the maximal amount of the covariance in the data. That s correct; and, lacking better ideas, one may as well run a PCA and use the 2 or 3 components with largest spread. But one must be aware that those may not be the best axes for the purpose of understanding the data, e.g. to see the shape of the cloud, or to identify clusters and determine criteria to separate clusters. Even if every cluster is a neat spherical Gaussian cloud with the same radius... All the best, --stolfi RE: "The Currier languages revisited" revisited - ReneZ - 24-12-2025 (23-12-2025, 07:23 PM)MarcoP Wrote: You are not allowed to view links. Register or Login to view.Thank you, Karl, your trajectory plot is quite interesting. Given the limited number of words, there is quite a lot of noise on top of the trend. The trend is less clear in the circular zodiac texts, for which one can imagine several reasons, but these are all speculative. RE: "The Currier languages revisited" revisited - Jorge_Stolfi - 24-12-2025 Re the nature of the difference between Languages A and B. I suspect that may people have made the following observation already, but here it is again anyway. This is the plot of the relative frequencies frA(w), frB(w) of the most common words in Herbal-A and Herbal-B. (The plot only shows words that have frequency >= 0.5% in both sets): And here is the same plot when each word in either file is mapped to its "root", as defined below: The mapping from word to "root" entails the following substitutions:
These rules were motivated partly by guesses at what are the "important" letters, and partly by observation that some pairs (like o and a, r and s, Ch and ee) are often swapped by the Scribe, eventual Retracers, and the Transcribers. The frequencies here are computed with the algorithm I described earlier in some other thread, where uncertain spaces ',' are treated as '.' or no-space with 50% probability, independently. So that in ky.Cho,d,Shy.ol the words ky and ol count as 1, Cho and Shy count as 0.5 each, and Chod d dShy ChodShy count as 0.25 each. The transcription is a merge of my own new partial transcription from the BL 2014 images with Rene's IVTFF transcription. The processing above considers only paragraph text (excluding "labels" and "titles") of pages that are definitely "herbal" (thus excluding f1r). Discussion: I interpret this result as evidence that Herbal-A and Herbal-B are not really different languages, but the same language spelled or encoded in fairly different ways. The differences would involve different usage of the letters that are deleted by the rules above, and possibly different choices between alternatives that are mapped to the same letter. All the best, --stolfi PS: the "fr«" in the second plot is a bug in my script. Please disregard. It should be counted as just "fr" RE: "The Currier languages revisited" revisited - nablator - 24-12-2025 There was a thread in 2019-2020: You are not allowed to view links. Register or Login to view.. RE: "The Currier languages revisited" revisited - Jorge_Stolfi - 24-12-2025 (Yesterday, 03:38 PM)nablator Wrote: You are not allowed to view links. Register or Login to view.There was a thread in 2019-2020: You are not allowed to view links. Register or Login to view.. Thanks! Maybe @tavie should move my post above to that thread? Otherwise I would repost it there too. All the best, --stolfi |