The Voynich Ninja

Full Version: Orlov YN - Language recognition methods and Voynich Manuscript analysis
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Don't know it this is really news, feel free to move or delete this if old news.


Orlov Yurii Nikolaevich of Keldysh Institute of Applied Mathematics of Russian Academy of Sciences has published a short linguistical analysis of VMS, for the following conclusion:

"The fact that the eigenvalues of MV transcriptions lie in a circle rather than an ellipse
distinguishes texts without vowels. The twice larger radius of this circle indicates that the possible
neighborhoods of pairs of symbols are more variable than for one language. 
Thus, the results obtained in this section do not contradict the proposed concept of the MV compound language
and supplement it with one more statistical argument. It seems important to emphasize that all
these arguments are fundamentally different; express the features of independent statistics
indicating that the interpretation of the MV as a composite manuscript is quite admissible"


The original Russian version here: You are not allowed to view links. Register or Login to view. 
Google translator take on it attached.
So if i got this right, this guy uses 4 independant statistical methods.
(VMS = Voynich manuscript)

1. Using the Logarithmic symbol distribution model developed by S.M. Gusein-Zade:
  - an argument is given in favor of the fact that VMS is written in some language without the use of vowels.

2. Using the Hurst exponent,
  - shows that, the manuscript contains entries in various languages.

3. Taking the following assumptions:
a1. The manuscript is a bilingual text with a common alphabet.
a2. Vowels were removed from the text before recoding.
a3. Recoding consisted in the unambiguous replacement of a letter with a symbol.
a4. Spaces in text are not considered characters.

and doing some statistical analysis of frequencies in modern texts gives us:
  - the statistical hypothesis that [the VMS was written in two languages but in the same alphabet] can be considered quite acceptable.

4. Use of spectral portraits of bigram matrices.
  - points to text without vowels
  - points to more than one language

Overall these 4 independent statistics indicating that the interpretation of the VMS as a composite manuscript is quite admissible.
 - 'composite' here means (i think) the VMS text has no vowels and uses 2 (or maybe more) different languages.
(23-08-2022, 04:37 PM)RobGea Wrote: You are not allowed to view links. Register or Login to view.Overall these 4 independent statistics indicating that the interpretation of the VMS as a composite manuscript is quite admissible.
 - 'composite' here means (i think) the VMS text has no vowels and uses 2 (or maybe more) different languages.
Thanks for the summary. Admissible assumptions in a way, maybe, but insufficient to explain the paradoxical nature of Voynichese: a case of "ex falso quodlibet".
- About the abjad hypothesis see: You are not allowed to view links. Register or Login to view.
- About languages/dialects: there are many glaring statistical inconsistencies that don't fall neatly in 2, 3 or 4 categories, and yet the text looks mostly homogeneous to the naked eye over large parts.
a1. The manuscript is a bilingual text with a common alphabet.

a2. Vowels were removed from the text before recoding.


There is one subtlety in research. The conclusions in the work are confirmed for the bilingual writing of the text on the example of only distantly related languages of the Indo-European group. For relatives, the language (English-German, French-Italian, Russian-Bulgarian) does not work!
I write this without having read the article, but I have a significant doubt.

The approach of mixing two languages and removing vowels does not seem to be able to explain either the low entropy or the word structure. Entropy does not seem to have been looked at in the experiments....

Reducing entropy is possible, but this comes with rather significant changes at the character level, and I believe that these would invalidate the results of the test in the paper (but again with the same caveat).
There's also the You are not allowed to view links. Register or Login to view. of the presentation - also in Russian.

I watched it briefly, the idea of the author, if I got him right, is that vowelless rank vs frequency letter distributions tend to cluster for languages from same language groups. E.g. distributions for German and English (both belonging to the Germanic group) are closer to each other than e.g. English and French.

Ultimately he concludes that the VMS seems to be bilingual (some folios one language, some in another) with Danish and Latin being the best candidates (selected from European languages). There's a slide (Figure 7 in the paper) plotting the supposed languages against the folio space - it would be interesting to check this supposition versus Currier A/B, but apparently the author does not do that.

He also employs other statistical methods as indicated by RobGea (Hurst and spectral portraits) shown to support the idea of bi- (or multi-) lingualism of the VMS text.

It is true that he does not touch the entropy issue, at least in the video.

There's been a previous publication by the same author, and for sure it was discussed here in the News section. I had a recollection that back then I had been sceptical about it for some reason, but could not recall the exact one. Upon the glance on this new paper, however, I instantly recollected what the matter was - he still claims that the VMS was produced in 16th century. It is a bad sign, in my opinion, for any serious investigation of the VMS to be mistaken in such very basic facts about it.
Apologies for overlap with prior comments that showed up after I composed this...

* There is a related paper (in English) which Orlov is a co-author on at You are not allowed to view links. Register or Login to view., which should also be read by anyone looking at this paper who doesn't speak Russian (but does speak English) -- same techniques applied.

* Obvious but necessary caveat that my review is based on the Google translation from Russian to English, so apologies for any errors in my comments caused by flaws in the automated translation.

* Paper appears to have been presented at a conference that ran Feb 4-5, 2021, so it's unclear that there's any reasonable accessibility-based explanation for lack of awareness of/reference to key prior work.

* Very first sentence of intro (in translation): "The Voynich Manuscript(hereinafter MV) [1] is a manuscript dated by researchers of the 16th century." (sic) -- lack of awareness of C-14 dating.

* "Numerous studies to decipher this text have been carried out for more than a hundred years, but without success. The existing versions about the authorship, content, and language of the manuscript, a review of which can be found in [2–4], are not sufficiently convincingly supported by full-fledged statistical studies." -- references 2 - 4 are Nick Pelling's _The Curse of the Voynich_, J. G. Barabe's report for McCrone on the materials analysis, and Levitov's '87 book describing his "solution." Along with a reference to Yale's catalog entry and Landini & Zandbergen '98, those are the only references to anything to do with the manuscript. No reference to existing overviews of the statistical properties of the text such as Bowern & Lindemann or Reddy & Knight.

* "There is also no consensus on how many and what signs are in the MV. There is a so-called "European transcription" (EVA [6]) mapping characters of the manuscript into the Latin alphabet. In addition, there is a transcription of Takahashi [7] - also in Latin, but with different frequencies." Where to begin? EVA isn't a transcription, it's a transcription alphabet; the Takahashi transcription uses EVA. Raw EVA is used in the analysis without any recognition/discussion of the commitments that's making regarding the nature of the underlying script (are ligatured gallows single characters? "iin"/"iiin"?) -- the results are only going to be as valid as those implicit assumptions are...

* First statistical analysis performed: comparison of L1 norm (taxicab distance) between rank-ordered character frequency vectors compared with various languages -- GIGO issue wrt use of raw EVA.

* Second statistical analysis performed: comparison of "the Hurst exponent for a series of the number of letters enclosed between the two most frequently occurring identical letters" -- unfortunately, the labels for the graph on p. 10 got stripped out during the translation process. Again, the issue of how EVA affects those distance counts is a potential problem. The Arxiv paper reference above presents the same or a similar analysis; part of the conclusion from that analysis given there is, "In case of the Manuscript observed distributions are shifted to the right and have much less acute maximum compared to all other curves on Fig.8. This means that statistics of the Manuscript does not agree with statistics of texts written in one particular language. Roughly speaking, symbols in the Manuscript are placed 'more randomly' compared to the latter. Further analysis of these issues will be presented in the following sections of the paper. There are two main options here: the Manuscript is written in a special constructed language or it is written in several languages."

* Here we get to one of the key failings of the paper(s): the authors show no awareness of (or at the very least do not engage in any way with) any of the prior observations regarding statistically distinct "languages" in the mss. going back to Currier's paper and confirmed/refined by multiple published cluster analysis studies over the intervening decades.

* They then compare the spectral properties of the digram frequency matrices for the two EVA-based transcriptions they use vs. languages in the Germanic and Romance families with and without vowels. Not clear what text corpora are being used (i.e., are they using 16th (sic) century or earlier texts in the various languages or more modern samples?).

* One of their conclusions is that the differences between the character statistics examined in the different sections of the mss. are more comparable in magnitude to the differences between languages than within languages (again, with the caveats that go with using raw EVA as input). Without prejudice to what that *means*, it points to an issue with other analyses that merrily assume that the differences between "languages" in the mss merely reflect changes in topic/author/etc. rather than difference in language/cipher key or system/etc. That should be demonstrated, not assumed. BTW, having fed Herbal A & Bio B (in Currier) into my monoalphabetic cipher solver in the past, that is consistent with what I have observed wrt within-/between-language differences in the matching metric used there (a chi^2 statistic on the digram frequency stats) -- the magnitude of the difference is more consistent with two different languages than variation within a language -- not that I think Voynichese is a monoalphabetic substitution at the glyph level...

* The bottom line of the work (as given in the Arxiv paper) is, "Concerning the Manuscript, it seems most plausible that it was written in two languages having the same alphabet without vowel letters: 30% of the text is written in one of the Germanic languages (Danish or German) and the rest 70% – in one of the Romance languages (Latin or Spanish)."

* The bottom line of my impression of the paper:

1) The statistical analyses per se seem fine, subject to all the appropriate caveats about using raw EVA.

2) With regard to their overall conclusion (as given in the Arxiv paper), there's an old joke about two economists who are walking down the street when they see what appears to be a $20 bill lying on the sidewalk. One of them starts to bend down to pick it up, and the other one says, "Don't bother -- if that really was a $20 bill, someone would have picked it up by now." I hate to be that economist, but...if the Voynich text were a monoalohabetic cipher with EVA characters mapping to consonants in a devoweled Germanic or Romance language I'd think it'd have been solved by now -- especially with the increasing availability of historical text corpora. On the other hand, _chacun a son gout_.

3) The lack of any apparent awareness of prior work on different "languages" within the text is troubling, although independent confirmation by different means still has value.

4) The lack of any apparent awareness of the broader array of prior work on statistical characteristics of the text is also troubling. They don't address the question of what the entropy statistics of devoweled European languages look like (I genuinely don't remember how that sorts out compared to the Voynich text). Referencing/replicating Reddy & Knight's observations on word length distribution ("However, Stolfi (2005) show that Pinyin Chinese, Tibetan, and Vietnamese word lengths follow a binomial distribution, and we found (Figure 3) that certain scripts that do not contain vowels, like Buckwalter Arabic and devoweled English, have a binomial distribution as well.3 The similarity with devoweled scripts, especially Arabic, reinforces the hypothesis that the VMS script may be an abjad.") as an independent line of support for Voynichese as a devoweled European language would have helped make their case.

5) As an aside regarding the results of apply Sukhotin's algorithm to the Voynich text, it's pretty clear that its identification of Currier O, A, and C as vowels is an artifact of the verbose glyph combinations that start with them (which also are the main contributors to pulling down the 1st and 2nd order entropy values). I don't consider that result strong evidence for the existence of vowels in "Voynichese".
(25-08-2022, 11:17 PM)kckluge Wrote: You are not allowed to view links. Register or Login to view.There is a related paper (in English) which Orlov is a co-author on at You are not allowed to view links. Register or Login to view., which should also be read by anyone looking at this paper who doesn't speak Russian (but does speak English) -- same techniques applied.

Thanks for the link,I don't remember this one to be referenced in the forum!

Reference #1 therein (the 2016 preprint) is, I think, the very article that I mentioned (that that we earlier discussed).