(09-08-2023, 10:27 PM)Torsten Wrote: You are not allowed to view links. Register or Login to view. (04-08-2023, 12:42 PM)kckluge Wrote: You are not allowed to view links. Register or Login to view.1) It is fatally methodologically flawed to do any sort of cluster analysis on 102 data points ...
Concerning the "failure on three fronts" of our "cosine distance" analysis:
1. The "curse of dimensionality" does exist, but in many cases it is a blessing rather than a curse. Several successful methods of statistical physics would not work in less than billion-dimensional space. Two vectors always define a plane, regardless of dimensionality, which, coarsely speaking, "focuses" the statistics; a situation different from simple data distribution. Unfortunately, we were not able to find any website dealing with advanced mathematical statistics of this kind in a more than superficial way, so your hat may remain uneaten. But perhaps you might consider this argument: If you were right, then how could topic modeling programs work at all? Basically, they use the same principle.
To illustrate the potential complications of working in a 7000-dimensional space, consider the question "what fraction of the volume of an N-dimensional unit hypersphere lies inside radius (1 - ep)?" The answer (by a straightforward extension of the square-cube law) is (1 - ep)^N. In the case of a 7000-dimension space the fraction of the volume of the unit hypersphere that lies more than 0.001 inside the radius of the hypersphere is
0.999^7000 = 0.000908693836
In other words, 99.9% of the volume of a unit hypersphere in 7000-d space lies in the thin shell of thickness 0.001 between r = 0.999 and r = 1.0 (for a non-unit hypersphere, ep scales with the radius). As a result, if you randomly pick a point inside the unit hypersphere with very high probability it will be at a distance very close to 1.0 from the origin because that's where almost all the volume of the hypersphere is.
This is related to the property that in very high dimensional spaces all the points in a data set will be close to the same distance from each other -- pick a random point in the data and find the smallest hypersphere containing all the other points; as a consequence of the result above, with high probability the other points will lie very close to the radius of the bounding hypersphere from the point at the center.
For those interested in more detailed/analytic discussions and demonstrations of the non-intuitive wierdnesses of high dimensional spaces, here are some references:
+ Barum Park, "The Curse of Dimensionality," You are not allowed to view links.
Register or
Login to view.
+ Bill Shannon, "The Curse of Dimensionality," You are not allowed to view links.
Register or
Login to view.
+ Avrim Blum, John Hopcroft, and Ravindran Kannan, _Foundations of Data Science_, Chapter 2 ("High-Dimensional Space"), You are not allowed to view links.
Register or
Login to view.
+ Charu C. Aggarwal, Alexander Hinneburg, and Daniel A. Keim, "On the Surprising Behavior of Distance Metrics in High Dimensional Space," You are not allowed to view links.
Register or
Login to view. (also available at You are not allowed to view links.
Register or
Login to view.)
+ Kevin Beyer, Jonathan Goldstein, Raghu Ramakrishnan, Uri Shaft, "When Is 'Nearest Neighbor' Meaningful?," You are not allowed to view links.
Register or
Login to view. (also available at You are not allowed to view links.
Register or
Login to view.)
Having said that...those results are geometric or geometric/distributional arguments; it turns out that real-world high-dimensional data sets apparently often behave as if they were much lower dimensionality:
+ You are not allowed to view links.
Register or
Login to view.
So, in the immortal words of Andy Rooney, "“Always keep your words soft and sweet, just in case you have to eat them.” Other things being equal I'd still recommend caution working in high-dimensional spaces, but "fatally methodologically flawed" was clearly off-base. Mea culpa.
(09-08-2023, 10:27 PM)Torsten Wrote: You are not allowed to view links. Register or Login to view.Thus:
a) Topic modeling works (despite the "curse of dimensionality").
b) In the VMS it cannot even correctly identify the two Currier clusters.
c) The non-existence of separated sections (and topics) requires an explanation.
The problem is that when you say, "Let us, for the moment, assume two well-separated domains, A and B." there is an extent to which that is setting up a strawman. It is entirely possible to have multiple populations with overlapping tails in some feature space; the existence of overlapping tails isn't evidence against the existence of multiple underlying distributions.
The analysis of the rank-ordered pairwise distance curve depends heavily on the assumption of "well-separated domains". The lack of a clear inflection point does support the proposition that the A pages and B pages do not form "well separated" clusters in this feature space. It does not, however, support your conclusion that because "the curve descends smoothly, almost linearly, with increasing rank. This behavior confirms the hypothesis of a continuous evolution from Currier A to B..."
To illustrate that this is not the case, I'm going to present an example using Euclidean distance. For the purposes of the example, the paper's use of cosine similarity is a distinction without a difference. As the Wikipedia page on cosine similarity (You are not allowed to view links.
Register or
Login to view.) explains with admirable clarity, measuring the similarity of two vectors in an N-dimensional space using cosine similarity is mathematically equivalent to measuring their similarity using the Euclidean distance between the points where the vectors intersect with the unit hypersphere centered at the origin.
Plot1_GaussClusters.png is a plot of two clusters (500 points each) generated by circular Gaussians (both with standard deviation = 7.5, one centered at (15, 15) and the other centered at (45, 15):
[
attachment=7533]
While there is a small overlap in the tails, there are still clearly two distinct underlying distributions. Plot2_GaussClustRankOrdPairDist.png is the corresponding plot of the rank-ordered pairwise distances:
[
attachment=7531]
You'll notice the same behavior of "the curve descends smoothly, almost linearly, with increasing rank" -- despite that, it would not be correct to conclude that this reflected "a continuous evolution" -- the points come from two distinct (if slightly overlapping) distributions.
If instead of looking at the rank ordering of all pairwise distance you look at the within-cluster vs. between-cluster distributions of pairwise distances, you get this:
[
attachment=7534]
This shows that the average distance between a pair of data points from within a cluster is substantially lower than the average distance between a pair of points from different clusters. Replicating your experiment using the version of Takahashi's transcription from the interlinear file (You are not allowed to view links.
Register or
Login to view.), using all text (running text, labels, diagrams, etc.) and ignoring uncertain spaces), this is what I get for the corresponding plot using cosine similarity on the A and B language folios -- note the same pattern of higher average within-group similarity compared to between-group similarity.
[
attachment=7535]
With respect to the heatmap, Rene is correct that smaller average token counts explains the lower within-group similarity of the herbal folios. Comparing Herbal A with the Bio and Stars sections, this show the effect on the cosine similarity if you have 4 pages/sample rather than 2 pages/sample (= folio). Note the within-group similarity using two folios/sample become comparable to the Stars section using one folio/sample. (While it would make sense to use the herbal bifolios as samples, it was quicker to code it combining pairs of sequential folios, separating the Herbal A and Herbal B folios).
[
attachment=7536]
(09-08-2023, 10:27 PM)Torsten Wrote: You are not allowed to view links. Register or Login to view.2. We also were using ed-statistics in our 2020 paper (see Timm & Schinner 2020, p. 6). However, when considering not only the Herbal A+B and Bio sections, but rather all sections, then a different picture (without dramatic jumps!) emerges for the frequencies of ed-tokens: [...]
See also the discussion by Zandbergen: "… the overall statistics demonstrate that there is a continuum, and the other (not herbal) pages actually 'bridge the gap'." (Zandbergen You are not allowed to view links. Register or Login to view.)
I'm familiar with that page of Rene's (and now that I have a good processing pipeline set up to process the whole manuscript will probably look more deeply into that). His plots are based on doing something similar to Principle Components Analysis. Not sure why he didn't just do PCA -- my best guess would be that finding eigenvectors of a 355^2 matrix was too slow. The friendly caveats I'd put on what he did are:
* An advantage of doing conventional PCA is that it lets you answer the question "what fraction of the covariance (scatter) in the data is captured by the top N axes?" As it is, we don't know how much of the overall spread in the data is preserved by the 4 axes his method finds.
* PCA (and PCA-like) methods are trying to find the N dimensions that maximally capture the overall scatter in the data; those may not be the N dimensions that maximally tease out structure within the data. Knowing that that's the case, if I was looking for substructure in the A and B language folios, I'd do PCA on just those folios because doing PCA on everything forces it to focus on capturing the big A/B scatter, potentially at the expense of better bringing out substructure within the languages.
* A related issue is that the problem with looking at plots of 2-D projections of the data is they inherently squash structure that isn't aligned with the axes. Applying a standard cluster analysis method on the 4-D data would provide better insight into how "clustery" groups of folios are in the space.
* To the extent that there is overlap in the space of points from different groups of folios, is that because there aren't discrete dialects, or is that because they have overlapping tails in the feature space?
On a completely unrelated note, in Rene's breakdown of dialect substructure, he finds
* Bb used in the Biological section
* Bb' used on the central bifolio of the biological section
* Bh used in about half the Herbal-B section
* Bhb used in the other half the Herbal-B section, more similar to Bb.
I wonder if the Bhb folios are the Herbal B folios Lisa Fagin Davis' analysis attributes to the same scribe who did the Bio B folios?
On a related unrelated note, in writing a function to return manuscript section (including breaking the herbal pages into A & B), I noticed that Lisa Fagin Davis identifies f58 & f65 -- which are Herbal A pages -- as having been written by Scribe 3. This is interesting because it is the only example of anyone other than Scribe 1 working in an 'A' dialect.
Karl
(P.S. What wines pair well with crow?)