A common idea for the analyses of the Voynich text is that it is enough to look only into some statistics to learn something about the Voynich text. This way the Voynich text is handled as a black box. For instance Sterneck et al. write "statistical approaches offer a novel way of analyzing the Voynich blackbox. Statistical methods offer tools that capture relevant features of the text without understanding its meaning, and more importantly, allow a certain degree of flexibility with the accuracy of the transcription itself" [You are not allowed to view links.
Register or
Login to view. et al., p. 1].
The problem with this type of black box research is that for learning something about the Voynich text it is required to interpret statistic results with the Voynich text in mind. Furthermore for using a statistic method like Natural Language Processing, Artificial Intelligence, or Topic modeling it is often required to check if all necessary requirements are fulfilled. For instance many statistic methods expect some level of consistency across the text. But without knowing the text itself it gets unnoticed that the Voynich text isn't homogenous and therefore doesn't fulfills this requirement.
For instance the paper of Sterneck et al. warns that "topic modeling relies on word frequencies and expects consistency across texts" [Sterneck et al., p. 4]. The paper assumes that it would be enough to "consider the topic distributions in conjunction with Currier language" [Sterneck et al., p. 4]. However if we look into the text itself it becomes evident that "no obvious rule can be deduced which words form the top-frequency tokens at a specific location, since a token dominating one page might be rare or missing on the next one." [Timm & Schinner, p. 3]. Sterneck et al. are not aware of this problem and interpret their statistical results in the context of a hypothetical homogenous text. They conclude "We find that computationally derived clusters match closely to a conjunction of scribe and subject matter (as per the illustrations)" [Sterneck et al., p. 1].
Claire Bowern then uses this interpretation of their statistic results to argue that it is possible to explain the difference between Currier A and B as "two methods of encoding at least one natural language" [Bowern and Lindemann, p. 289], and other differences within the Voynich text as "This result suggests that different scribes may have used different encipherment strategies or written about different subjects" [Bowern and Lindemann, p. 303].
Is it indeed possible to explain the variation within the Voynich text this way? What does their statical results mean in the context of the Voynich text?
- In Currier A 0.32% (36 out of 11348 = 0.32%) of the word tokens contain the sequence 'ed'.
- In the Cosmologial section already 9.5% (257/2691 = 9.55%) of the word tokens do use the sequence 'ed'.
- In Herbal in Currier B there are 16.3 % (528/3233=16.3%) of the word tokens containing 'ed'.
- In Quire 20 (Stars - Currier B) the number is 19.4% (2073/10673 = 19.4%).
- And for Quire 13 (Biological - Currier B) the number is 27.9% (1925/6911 = 27.9%).
After Claire Bowern it is possible to explain the differences between Currier A and B as the result of "two different encoding methods", the differences between different hands "with scribal differences and with different subjects" and the differences between different illustrations with different subjects. However the differences for EVA-'ed' between Herbal B and Quire 13 are with 16.3% vs. 27.9% as dramatic as the difference between Currier A and Herbal B (0.32% vs. 16.3%). But after Lisa Davis the Herbal B part as well as the whole Quire 13 are written by the same hand (16.3% vs. 27.9%). The difference between Herbal B and Quire 20 is with 16.3% vs. 19.4% even smaller whereas both sections did use different illustrations and after Lisa Davis also different hands. Claire Bowern interpretation of their statistical results obviously doesn't fit with the Voynich text.
Let's look also at the distribution of a single word type. Let's take EVA-'shedy' as an example.
- EVA-'shedy' only occurs once in Herbal A (1/10616 = 0%)
- it is the 12 most frequent word in Herbal B (35/32333=1.1%)
- it is the 10th most frequent word in Quire 20 (113/10673=1.05%)
- it is the second most frequent word in Currier B (395/20817 = 1.9%)
- it is the most frequent word in Quire 13 (247/6911=3.56%)
- it is the most frequent word for folio 103v (15/449 = 3.34%), but it doesn't occur on folio 105v (0/390 = 0%)
The frequency counts for Herbal A and B and for the quires behave as the counts for 'ed' would suggest. However the counts for individual folios also differ. How can we explain the word counts for 'shedy' on folio 103v and 105v? Both folios belong to Quire 20, they both share the same illustration type and after Lisa Davis also the same scribal hand, but the difference between both folios is with 0% vs. 3.34 % as dramatic as between Currier A and Quire 13. Therefore none of the three explanation attempts used by Claire Bowern does fit in this case.
Note: The distribution of 'shedy' is not an exception. Even for EVA-'daiin' it is possible to point to pages without a single instance of 'daiin' (see You are not allowed to view links.
Register or
Login to view.).
This means that even by applying three different explanation attempts ("different encoding methods", "different scribal hands", and "different topics") it is obviously not possible to explain the properties of the real Voynich text in a satisfactory way. So if you want to apply some statistic methods to the VMS you should also ask yourself: Can I apply this method to the Voynich text and what does my results mean in the context of the Voynich text?