TLDR at the bottom. I may have misunderstood something and/or made errors. I hope I did.
The discussion started by Bakker proceeded on the post I linked yesterday You are not allowed to view links.
Register or
Login to view.:
You are not allowed to view links.
Register or
Login to view. by Victor Mair. From You are not allowed to view links.
Register or
Login to view., I see that Mair is a Sinologist that has been at UPenn since 1979; the page also has a list of selected publications. Peter Bakker's publications are listed You are not allowed to view links.
Register or
Login to view..
Mair's post was also commented by You are not allowed to view links.
Register or
Login to view. (a computational linguist at Google):
Quote:The problem with injunctions such as Bakker's (or Victor's) that it is not worth trying to decipher something because it is probably a hoax, or there isn't enough text for Shannon unicity, or whatever, is that such appeals universally fail to have any stopping power on the enthusiast. And why should they? If the existence of dozens or hundreds of equally plausible previous "decipherments" of a corpus fail to dissuade them, why should other considerations?
Witness the hundreds of attempts to decipher the Phaistos Disk, or the Indus Valley corpus.
I am not sure I fully understand all that these linguists say, but I think they agree that the text is not decipherable.
Bakker points out that the script appears to be alphabetical: characters alternate like vowels and consonants, word length is comparable with alphabetically written languages. But if one tries to match the text with a written language, no matches are found. In addition to what Bakker says, two of the problems are the binomial distribution of word lengths and the high frequency of reduplication (in particular the fact that the most frequent word
daiin often appears as
daiin.daiin).
As You are not allowed to view links.
Register or
Login to view., one can then consider one of these ideas:
1. words are numbers: i.e. they are entries in a nomenclator, as in Rene's mod2 system
2. verbose ciphers
I understand that Bakker did not discuss these options because they are outside the scope of linguistic investigation. If one of these two hypotheses is true, the text as it appears tells us nothing at all about the underlying language, its morphology and grammar, with the possible exception that (if there are no null symbols) reduplication should still be a feature of the plaintext. In this perspective, the problem is purely cryptographic.
1. Number-based nomenclator
I think it is obvious that the "numbers" idea means "undecipherable". Almost 70% of the 8,000 ca Voynich word types are hapax legomena, they account for about 14% of word tokens. 20% of the tokens belong to types that appear less than 5 times. As is well known, also relatively frequent word types do not form long repeating sequences: many of the repeating sequences include reduplication. The You are not allowed to view links.
Register or
Login to view. for a moderately inflected language is more than 100,000.
The 14% of word tokens (hapax) are totally hopeless. But the main problem is the size of the search space ~8k^100k ~1.0E+390000 (for each language one wants to consider).
2. Verbose cipher
The verbose cipher idea looks better, but I think it also is unapproachable. A first implication of the idea is that labels are too short to be verbosely-encoded words (see Koen's comment You are not allowed to view links.
Register or
Login to view.). We should then conclude that labels are meaningless, but with a leap of faith we can still hope that there is something meaningful in the text. Anyway, since You are not allowed to view links.
Register or
Login to view., we should deduce that spaces are not significant, and treat the whole text as a uniform string. If we don't ignore spaces, words are too short for a verbose cipher.
We should set an upper bound on the length of the verbose-encoding of a plaintext character: without such a limit, the search space is infinite, but the smaller the threshold the higher the risk of missing the true decipherment. Bigrams could be a good starting point.
We have the beginning of our coded string:
fachysykalarataiinsholshorycthresykorsholdy
without a hint of how many words it encodes or in what language. We know that about 300 different bigrams appear in the manuscript and now we can try mapping each bigram to a character in the different alphabets of all the different languages and dialects we want to consider.
Again, the search space is huge, though immensely smaller than for the nomenclator: something like 300^26 ~ 2.5E+64 (for each target language, assuming they have an alphabet size similar to English). If one can process 1,000 options per nanosecond, searching this space will still take billions of billions of years.
We had to give up labels, so we are not in the happy position of You are not allowed to view links.
Register or
Login to view. and have small groups of words with a specific context. We have no words and no context. Mappings will have to be evaluated on the basis of the quality of the decoding of the main text, something difficult to automate and suspiciously similar to what Yokubinas, Cheshire and Ardic do.
With such a huge number of variables, one will likely get several locally good decipherments for various languages and it would be impossible to reliably judge without significant knowledge of XV Century forms of Georgian or Scottish Gaelic etc. I guess this is the problem that Sproat refers to when he mentions Shannon Unicity and the many different but equally plausible "decipherments".
Also, the Currier A/B drift means that the mapping will not be homogeneous: does a locally good decipherment break because it is not correct, or is it because the drift has an impact on the mapping?
This is much simplified: we don't know if 2 characters is a reasonable value for the verbose encoding of a single plaintext character, actually, a there are EVA trigrams that look like good candidates to encode a single character; different sequences could have different lengths, some plaintext characters could not be encoded at all (abjad), there could be null characters, null sequences, etc.
The idea of a verbously-encoded abjad could maybe allow us to preserve labels and word spaces (since the effects on word length could compensate). The search space would also be smaller (3.5E+49) but still many orders of magnitude too large and more ambiguity will be introduced by the absence of vowels.
TLDR I think that, if we must totally discard the phonetically-written-weird-language hypothesis, these three experts are right and the Voynich manuscript is either totally meaningless or undecipherable.