The Voynich Ninja

Full Version: A new Timm & Schinner publication regarding the Malta conference
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Pages: 1 2 3 4 5
@Torsten:

Quote:Discussion of Voynich Paleography, Torsten Timm∗ and Andreas Schinner, August 13, 2023

"We fear that even the serious Voynich manuscript research is currently splitting into two schools of thought, most likely drifting apart to a point where communication is no longer possible."

I did not quite understand what is meant by "two schools" in Voynich MS research. I am not aware that such a thing exists. In my opinion, the "Voynich community" is far too fragmented for something like this to emerge. Maybe you can explain this in more detail ?
(13-08-2023, 02:51 PM)bi3mw Wrote: You are not allowed to view links. Register or Login to view.@Torsten:

Quote:Discussion of Voynich Paleography, Torsten Timm∗ and Andreas Schinner, August 13, 2023

"We fear that even the serious Voynich manuscript research is currently splitting into two schools of thought, most likely drifting apart to a point where communication is no longer possible."

I did not quite understand what is meant by "two schools" in Voynich MS research. I am not aware that such a thing exists. In my opinion, the "Voynich community" is far too fragmented for something like this to emerge. Maybe you can explain this in more detail ?

Yes, I would be curious to know what these two schools are. There do seem to be a lot of schools and I find it hard to see how Voynich research can be resolved into two specific schools. Maybe on a specific issue there are differences of opinion.
(13-08-2023, 04:18 PM)Mark Knowles Wrote: You are not allowed to view links. Register or Login to view.There do seem to be a lot of schools ...

Yes, here are three universally known examples ( I'm sure there are many more ):

Hoax vs. authentic
Cipher vs. natural language
Gibberish vs. meaning ( here I would classify, beside Gordon Rugg`s Theory, also Timm & Schinner`s Hypothesis ).
(09-08-2023, 10:27 PM)Torsten Wrote: You are not allowed to view links. Register or Login to view.Two vectors always define a plane, regardless of dimensionality, which, coarsely speaking, "focuses" the statistics;

This is not the point, and it is not even correct.
If you have many more dimensions than vectors, then every new vector will always add a new dimension, and this is  precisely where the "curse of dimensionality" sets in.

Two vectors do not define a plane if they are co-linear and this is the most extreme case where you have more vectors than dimensions. All vectors could be along a single line and set up a one-dimensional space. This is not usually what one has in mind when talking about clusters, but the principle remains the same.

I am sure that there is a large group of people who have not yet made up their mind whether the text is meaningful or not. I certainly haven't.
What I do have a stronger opinion about is that it hasn't been demonstrated either way.

Saying that I don't accept the proof that the text is meaningless, does not imply that I won't accept that the text could be meaningless.

Finally, yes, I have seen statistical evidence for text properties in between A and B, but these are text properties, not handwriting properties and we don't yet know that the two can be safely linked to each other.
(14-08-2023, 01:45 AM)ReneZ Wrote: You are not allowed to view links. Register or Login to view.Finally, yes, I have seen statistical evidence for text properties in between A and B, but these are text properties, not handwriting properties and we don't yet know that the two can be safely linked to each other.

You also write "Herbal-B and Bio-B are quite different". You write further about the Recipes section: "The pages that also appear to be in Bio-B 'dialect' are those of ff. 103, 107, 108, 111,112 and 116, which are three bifolia. The other three bifolia with ff. 104, 105, 106, 113, 114 and 115 use a more varied vocabulary" [Zandbergen You are not allowed to view links. Register or Login to view.].

Doesn't that mean that your observations indicate that the Astro/Cosmos pages 'bridge the gap' between Herbal A and Herbal B and the Recipes pages 'bridge the gap' between Herbal B and Bio B? If not, where exactly do you see a contradiction between your observations and our claim of a "gradual evolution of a single system from 'state A' to 'state B'"? [Timm & Schinner 2020, p. 6]

How would you explain the following token frequencies for the most common tokens in Herbal A within each section, if not as a gradual decrease?

Code:
          Herbal (A)  Pharma (A)   Astro+Cosmo  Herbal (B)  Stars (B)    Bio (B)
daiin     403 (5.0%)    99 (3.9%)    48 (1.0%)   72 (2.2%)   122 (1.1%)  84 (1.2%)
chol      228 (2.8%)    45 (1.8%)    25 (0.5%)   13 (0.4%)    62 (0.6%)  14 (0.2%)
chor      155 (1.9%)    24 (0.9%)    10 (0.2%)    6 (0.2%)    19 (0.2%)   1 (0.0%)
s         133 (1.6%)    26 (1.0%)    16 (0.3%)   23 (0.7%)    12 (0.1%)  13 (0.2%)
shol      104 (1.3%)    11 (0.4%)     8 (0.2%)   11 (0.3%)    24 (0.2%)  18 (0.3%)
dy        102 (1.3%)    17 (0.7%)    31 (0.6%)   35 (1.1%)    11 (0.1%)  52 (0.8%)
cthy       98 (1.2%)     3 (0.1%)     4 (0.1%)    3 (0.1%)     1 (0.0%)   1 (0.0%)
sho        96 (1.2%)     8 (0.3%)     6 (0.1%)    2 (0.1%)    12 (0.1%)     (0.0%)
chy        95 (1.2%)     6 (0.2%)    11 (0.2%)   11 (0.3%)    11 (0.1%)   9 (0.1%)
dain       80 (1.0%)    13 (0.5%)     7 (0.1%)   11 (0.3%)    53 (0.5%)  47 (0.7%)
Let me start by saying that I have full confidence in Lisa Fagin Davis' identification of the five hands.

The second important point is that it is methodically wrong to base the number of scribes on textual statistics. There is no reason to believe that the two are linked to each other. In fact, people have been trying to find such a correlation since Lisa's results were published, so far without a clear correlation, apart from, of course, the original Currier A vs. Currier B correlation with his Hand 1 vs. Hand 2.

(22-08-2023, 02:23 AM)Torsten Wrote: You are not allowed to view links. Register or Login to view.If not, where exactly do you see a contradiction between your observations and our claim of a "gradual evolution of a single system from 'state A' to 'state B'"?

I did not make any statement about that claim. What I did say is that the evidence presented in the first paper mentioned in this post is not valid. The A and B languages are easy to distinguish. The intermediate forms (depending on which statistic is used) cover only a small part of the text.
Strangely enough, these correlate very strongly with the use of normal sized folios vs. foldout folios, which still lacks an explanation.

From what I have seen, there are a few jumps in the textual statistics rather than a continuous change, but the data are too noisy to be specific about it. I certainly don't feel confident about stating a cause (or excluding a cause) based on what we have.
I'll add that it is almost certain that the bifolia, quires, and sections are not in their original order, so we really can't base conclusions about changes in token frquency on the current order of folios. As I've said above, I'm working now on a detailed codicological study that will result in a theoretical reconstruction of the original sequence, at least partially (there is not enough evidence to re-order the entire manuscript). I hope that that will help others in their work.
I have doubts about the smooth transition from language A (botanicals and recipes with jars) to b (q13 and q20) and transitional Astro. But for a completely different reason.
The book is composed of independent blocks of topics (Botany of at least 2. Now arranged in discord. One of the natural drawings of plants. The second of the idiomatic drawings).
Judging by the fact that q20 has a preliminary stitching ( You are not allowed to view links. Register or Login to view.)  , and q13 has a pronounced isolated vertical dimension of the notebook , and this is considered to be the language b . Those are independent books.
The presence of q9, with all the violations of the location in the book and the non-correspondence of the unused holes with the existing racks, also testifies to this. 

+ cutting from A scroll (?) q19 sheets to the desired size.
(09-08-2023, 10:27 PM)Torsten Wrote: You are not allowed to view links. Register or Login to view.
(04-08-2023, 12:42 PM)kckluge Wrote: You are not allowed to view links. Register or Login to view.1) It is fatally methodologically flawed to do any sort of cluster analysis on 102 data points ...



Concerning the "failure on three fronts" of our "cosine distance" analysis:




1. The "curse of dimensionality" does exist, but in many cases it is a blessing rather than a curse. Several successful methods of statistical physics would not work in less than billion-dimensional space. Two vectors always define a plane, regardless of dimensionality, which, coarsely speaking, "focuses" the statistics; a situation different from simple data distribution. Unfortunately, we were not able to find any website dealing with advanced mathematical statistics of this kind in a more than superficial way, so your hat may remain uneaten. But perhaps you might consider this argument: If you were right, then how could topic modeling programs work at all? Basically, they use the same principle.




To illustrate the potential complications of working in a 7000-dimensional space, consider the question "what fraction of the volume of an N-dimensional unit hypersphere lies inside radius (1 - ep)?" The answer (by a straightforward extension of the square-cube law) is (1 - ep)^N. In the case of a 7000-dimension space the fraction of the volume of the unit hypersphere that lies more than 0.001 inside the radius of the hypersphere is




0.999^7000 = 0.000908693836




In other words, 99.9% of the volume of a unit hypersphere in 7000-d space lies in the thin shell of thickness 0.001 between r = 0.999 and r = 1.0 (for a non-unit hypersphere, ep scales with the radius). As a result, if you randomly pick a point inside the unit hypersphere with very high probability it will be at a distance very close to 1.0 from the origin because that's where almost all the volume of the hypersphere is.




This is related to the property that in very high dimensional spaces all the points in a data set will be close to the same distance from each other -- pick a random point in the data and find the smallest hypersphere containing all the other points; as a consequence of the result above, with high probability the other points will lie very close to the radius of the bounding hypersphere from the point at the center.




For those interested in more detailed/analytic discussions and demonstrations of the non-intuitive wierdnesses of high dimensional spaces, here are some references:




+ Barum Park, "The Curse of Dimensionality," You are not allowed to view links. Register or Login to view.




+ Bill Shannon, "The Curse of Dimensionality," You are not allowed to view links. Register or Login to view.




+ Avrim Blum, John Hopcroft, and Ravindran Kannan, _Foundations of Data Science_, Chapter 2 ("High-Dimensional Space"), You are not allowed to view links. Register or Login to view.




+ Charu C. Aggarwal, Alexander Hinneburg, and Daniel A. Keim, "On the Surprising Behavior of Distance Metrics in High Dimensional Space," You are not allowed to view links. Register or Login to view. (also available at You are not allowed to view links. Register or Login to view.)




+ Kevin Beyer, Jonathan Goldstein, Raghu Ramakrishnan, Uri Shaft, "When Is 'Nearest Neighbor' Meaningful?," You are not allowed to view links. Register or Login to view. (also available at You are not allowed to view links. Register or Login to view.)



Having said that...those results are geometric or geometric/distributional arguments; it turns out that real-world high-dimensional data sets apparently often behave as if they were much lower dimensionality:





+ You are not allowed to view links. Register or Login to view.




So, in the immortal words of Andy Rooney, "“Always keep your words soft and sweet, just in case you have to eat them.” Other things being equal I'd still recommend caution working in high-dimensional spaces, but "fatally methodologically flawed" was clearly off-base. Mea culpa.




(09-08-2023, 10:27 PM)Torsten Wrote: You are not allowed to view links. Register or Login to view.Thus:

a) Topic modeling works (despite the "curse of dimensionality").


b) In the VMS it cannot even correctly identify the two Currier clusters.


c) The non-existence of separated sections (and topics) requires an explanation.




The problem is that when you say, "Let us, for the moment, assume two well-separated domains, A and B." there is an extent to which that is setting up a strawman. It is entirely possible to have multiple populations with overlapping tails in some feature space; the existence of overlapping tails isn't evidence against the existence of multiple underlying distributions.




The analysis of the rank-ordered pairwise distance curve depends heavily on the assumption of "well-separated domains". The lack of a clear inflection point does support the proposition that the A pages and B pages do not form "well separated" clusters in this feature space. It does not, however, support your conclusion that because "the curve descends smoothly, almost linearly, with increasing rank. This behavior confirms the hypothesis of a continuous evolution from Currier A to B..."




To illustrate that this is not the case, I'm going to present an example using Euclidean distance. For the purposes of the example, the paper's use of cosine similarity is a distinction without a difference. As the Wikipedia page on cosine similarity (You are not allowed to view links. Register or Login to view.) explains with admirable clarity, measuring the similarity of two vectors in an N-dimensional space using cosine similarity is mathematically equivalent to measuring their similarity using the Euclidean distance between the points where the vectors intersect with the unit hypersphere centered at the origin.




Plot1_GaussClusters.png is a plot of two clusters (500 points each) generated by circular Gaussians (both with standard deviation = 7.5, one centered at (15, 15) and the other centered at (45, 15):

[attachment=7533]



While there is a small overlap in the tails, there are still clearly two distinct underlying distributions. Plot2_GaussClustRankOrdPairDist.png is the corresponding plot of the rank-ordered pairwise distances:

[attachment=7531]



You'll notice the same behavior of "the curve descends smoothly, almost linearly, with increasing rank" -- despite that, it would not be correct to conclude that this reflected "a continuous evolution" -- the points come from two distinct (if slightly overlapping) distributions.



If instead of looking at the rank ordering of all pairwise distance you look at the within-cluster vs. between-cluster distributions of pairwise distances, you get this:

[attachment=7534]

This shows that the average distance between a pair of data points from within a cluster is substantially lower than the average distance between a pair of points from different clusters. Replicating your experiment using the version of Takahashi's transcription from the interlinear file (You are not allowed to view links. Register or Login to view.), using all text (running text, labels, diagrams, etc.) and ignoring uncertain spaces), this is what I get for the corresponding plot using cosine similarity on the A and B language folios -- note the same pattern of higher average within-group similarity compared to between-group similarity.

[attachment=7535]



With respect to the heatmap, Rene is correct that smaller average token counts explains the lower within-group similarity of the herbal folios. Comparing Herbal A with the Bio and Stars sections, this show the effect on the cosine similarity if you have 4 pages/sample rather than  2 pages/sample (= folio). Note the within-group similarity using two folios/sample become comparable to the Stars section using one folio/sample. (While it would make sense to use the herbal bifolios as samples, it was quicker to code it combining pairs of sequential folios, separating the Herbal A and Herbal B folios).

[attachment=7536]



(09-08-2023, 10:27 PM)Torsten Wrote: You are not allowed to view links. Register or Login to view.2. We also were using ed-statistics in our 2020 paper (see Timm & Schinner 2020, p. 6). However, when considering not only the Herbal A+B and Bio sections, but rather all sections, then a different picture (without dramatic jumps!) emerges for the frequencies of ed-tokens: [...]




See also the discussion by Zandbergen: "… the overall statistics demonstrate that there is a continuum, and the other (not herbal) pages actually 'bridge the gap'." (Zandbergen You are not allowed to view links. Register or Login to view.)



I'm familiar with that page of Rene's (and now that I have a good processing pipeline set up to process the whole manuscript will probably look more deeply into that). His plots are based on doing something similar to Principle Components Analysis. Not sure why he didn't just do PCA -- my best guess would be that finding eigenvectors of a 355^2 matrix was too slow. The friendly caveats I'd put on what he did are:



* An advantage of doing conventional PCA is that it lets you answer the question "what fraction of the covariance (scatter) in the data is captured by the top N axes?" As it is, we don't know how much of the overall spread in the data is preserved by the 4 axes his method finds.



* PCA (and PCA-like) methods are trying to find the N dimensions that maximally capture the overall scatter in the data; those may not be the N dimensions that maximally tease out structure within the data. Knowing that that's the case, if I was looking for substructure in the A and B language folios, I'd do PCA on just those folios because doing PCA on everything forces it to focus on capturing the big A/B scatter, potentially at the expense of better bringing out substructure within the languages.



* A related issue is that the problem with looking at plots of 2-D projections of the data is they inherently squash structure that isn't aligned with the axes. Applying a standard cluster analysis method on the 4-D data would provide better insight into how "clustery" groups of folios are in the space.



* To the extent that there is overlap in the space of points from different groups of folios, is that because there aren't discrete dialects, or is that because they have overlapping tails in the feature space?



On a completely unrelated note, in Rene's breakdown of dialect substructure, he finds

* Bb used in the Biological section

* Bb' used on the central bifolio of the biological section

* Bh used in about half the Herbal-B section

* Bhb used in the other half the Herbal-B section, more similar to Bb.

I wonder if the Bhb folios are the Herbal B folios Lisa Fagin Davis' analysis attributes to the same scribe who did the Bio B folios?



On a related unrelated note, in writing a function to return manuscript section (including breaking the herbal pages into A & B), I noticed that Lisa Fagin Davis identifies f58 & f65 -- which are Herbal A pages -- as having been written by Scribe 3. This is interesting because it is the only example of anyone other than Scribe 1 working in an 'A' dialect.



Karl



(P.S. What wines pair well with crow?)
P.S. Sorry for the weird spacing in that last post -- I know it's a known bug, but with the mixture of quotes and inserted figures I wanted to preview it before posting...
Pages: 1 2 3 4 5