The Voynich Ninja

Full Version: The Linguistics of the Voynich Manuscript (Bowern et al. 2020)
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Pages: 1 2 3 4 5
There's a new paper on the VM at You are not allowed to view links. Register or Login to view.:

Quote:You are not allowed to view links. Register or Login to view.
You are not allowed to view links. Register or Login to view., You are not allowed to view links. Register or Login to view.
September 2020
 

The Voynich Manuscript is a 15th Century illustrated cipher manuscript. In this overview of recent approaches to the Voynich manuscript, we summarize and evaluate current work on the language that underlies this document. We provide arguments for treating the document as natural language (rather than a medieval hoax) and show how we can make statistical arguments about the phonology, morphology, and structure of the document, even though the contents remain undecipherable.

ETA: Claire Bowern is a legit linguist, so I would expect this paper to be a serious effort.
Now that I've a had a chance to read the paper, I'd say that it is clear and generally well-written but it is best viewed as setting forth the status quaestionis of the VM rather than advancing a key new insight.

The paper does not consider Schinner and Timm's work, which in my view is a major oversight since their auto-citation proposal, if I understand it correctly, can generate some of the supposed linguistic features discussed in the paper (like the Zipfian behaviors).

They are also probably too quick to a late medieval/early modern fake here:
Bowern et al. 2020:5 Wrote:We also consider it unlikely that the The Voynich Manuscript is an ancient hoax. The cost associated with the production of such a manuscript and the number of people involved make it unlikely that it was created purely to deceive. A much smaller hoax would have served the same purpose with much less expense. Moreover, people who assume that the manuscript is a medieval gibberish hoax massively underestimate the amount of effort required to produce sustained language-like nonsense.5

In my view, arguments that something can't be a fake because it involves too much effort are poor. The Hilter Diaries impressed people because the forger produced 60 volumes, but it merely showed the industriousness of the forger. Besides "purely to deceive," is too narrow of a reason to produce. Many fakes are sold (or attempted to be sold) for solid money. Indeed the purchase history of this codex, in script no one can read, is good confirmation that an unreadable book can fetch a pretty penny.

The footnote 5 is pretty interesting: "5. We tested this point in an undergraduate class and found that beyond about 100 words, the task of writing language-like non-language is very difficult. It is too easy to make local repetitions and words from other languages."

I'd like to know more about this undergraduate experiment--as well as others in asemic writing--especially on the nature of local repetitions (how are they comparable to the VM's repetitiousness?). Again, here, some engagement with the Schinner & Timm proposal may be enlightening. Can undergraduates, briefly trained in auto-citation, simulate features of the text?

Very helpful is their conclusions on character entropy:
Bowern et al. 2020:10 Wrote:The entropy of Voynichese is unlike any other language or script. Plausible manipulations of the script were investigated, including various shorthand abbreviations and devoweling the script. These do affect the character entropy, but not to the extent that would be required to bring Voynichese to the level of other languages. The only manipulation of this type that brings the conditional entropy to Voynich levels is systematic conflation of phonemic distinctions, such as conflating all vowels to a single character, recoding based on dividing characters into whether they occur in the first or second half of the alphabet, or sorting all characters in the word into alphabetical order.
If these options for the character entropy are correct, I would say that it does not bode well for having a now-meaningful text (as opposed to a once-meaning text), since "systematic conflation of phonemic distinctions" and "sorting all characters in the word into alphabetical order" loses information. Perhaps the second option, "recoding based on dividing characters into whether they occur in the first or second half of the alphabet," does not lose information, depending on how it's done.
Here are some observations after a quick first read of the paper. I wrote this before reading Stephen's comments and then I edited my post: there may still be some repetition of things that Stephen already pointed out.

The paper largely is a survey of the best literature about Voynichese. It is quite extensive and can in part be seen as an update to Reddy and Knight "What we know about the Voynich manuscript" (2011).

The authors address all possible angles of a linguistic interpretation of Voynichese. I agree with Stephen that their treatment of "meaningless hoax" ideas is more superficial. 

When presenting various proposed solutions, the authors note that "When [solvers] discuss the data, they focus almost entirely on the lexicon, ignoring morphology and syntax". This is an excellent remark: the process we see over and over is VMS to word-salad to "meaningful" text; but languages are not word-salads and an actual translation requires more than a dictionary.

So Bowern and Lindemann give much space to discussion of structure in the ms: this is done with ample reference to researchers like Stolfi, Guy** and Gheuens, whose work with MATTR receives great attention (bravo Koen! and bravo Nablator, who wrote the software!).

Another excellent detail of this work is that they compared abbreviated and full versions of a Latin text (Secreta Secretorum) finding that entropy is higher in the abbreviated text.

This passage about the low entropy lists a number of interesting ideas:
Quote:The entropy of Voynichese is unlike any other language or script. Plausible manipulations of the script were investigated, including various shorthand abbreviations and devoweling the script. These do affect the character entropy, but not to the extent that would be required to bring Voynichese to the level of other languages. The only manipulation of this type that brings the conditional entropy to Voynich levels is systematic conflation of phonemic distinctions, such as conflating all vowels to a single character, recoding based on dividing characters into whether they occur in the first or second half of the alphabet, or sorting all characters in the word into alphabetical order.

Another small bit that happens to overlap with my (obvioulsy non original) opinions is that EVA:m and EVA:g are line-final variants of endings that are written differently in the text (the authors propose '-iin' and '-y' respectively). I think of these characters as 'abbreviations' but the views of the authors are slightly different.

Bowern & Lindemann appear to be convinced that Pelling's axiom is indeed valid: each Voynichese word largely corresponds to a word in the underlying language.

I am happy to also see a quantitative discussion of reduplication (with measures that are quite close to the 1% we often discussed on the forum):
Quote:Full reduplication, in which the entire word is repeated, is also common in Voynich. However, it is still within the realm of plausibility for natural language texts. In Voynich A each word has a 0.84% chance of repeating while in Voynich B that chance is 0.94%. The range among the samples in our language corpus is 0.02%-4.8%, with an average of 0.63%

I find this result interesting and I am looking forward to look into the corpus used to examine the details. I had missed the footnote mentioned by Stephen and of course also the repetition in human attempts at meaningless text is of the greatest interest. I hope all the benchmark texts are available for download but I did not check yet.

I am not sure I fully understand their explanation for quasi-reduplication (which here is simply called "repetitiveness").

Quote:This repetitiveness is at least partly the result of the relatively limited set of character
combinations and the predictable structure of Voynich words.

About the "not structure-preserving script" mentioned in the final Summary: I think that an option that fits this description is Rene's mod2 cipher system (which assumes a nomenclator). I am not sure that a verbose-cipher would be enough to produce the rigid word-structure of Voynichese, but it seems that the authors are going to further investigate these ideas, so hopefully we will read more in the future.
Some passages (e.g. "4.2 Phrases") seem to suggest that an artificial language is also being seriously considered, but they seem to be thinking of a Latin-based language, rather than an a-priori language as proposed by Friedman.

Of course there also are statements that do not match my own opinions, but I no longer believe everything I think, so they don't seem to be worth mentioning at the moment.


**  Apparently, there is a typo in the discussion of Guy's application of Sukhotin's algorithm: EVA:y y ('g' in Guy's transliteration) was identified as a vowel but is missing from the list in the pre-print.

Another note:
Quote:[Some characters] closely resemble numbers: cf. q, d, y.
l should be added to the list (it was a frequent variant shape of 4).
Just for completeness (and to get as many possibilities on the table as possible), I think it could be argued that all the following glyph-shapes are similar to medieval numerals:

o i s l q v d y (0 1 2 4 4 7 8 9)
(07-09-2020, 08:13 AM)MarcoP Wrote: You are not allowed to view links. Register or Login to view.Of course there also are statements that do not match my own opinions, but I no longer believe everything I think, so they don't seem to be worth mentioning at the moment.

Hi, Marco:

Thank you for your prior comments.  

In general, not believing everything you think is a very wise approach, only hard won experience brings this realization which is not an easy process, but a fair amount of objective truth is gained through this.
 
When you feel like it, I would be quite interested in what you have to say.  You have spent many years actively* looking at this problem and those years bring with it certain views that are justified.

Also, if these differences are major, you could consider publishing a discussion of further considerations not present in this work. I think it would be tough to get it published in the same journal -- but it could be done on an open platform.  I, for one, would support such an effort, if you (and maybe Emma?) wanted to do that.

Frankly, I would give the same advice to Torsten.  

*the amount of work you (and many others on the board and the amateur community) have put in is staggering.  I also greatly appreciate your willingness to jump in and provide data related to a problem posed.   The amount of work involved is sobering, to those, like me, who are quite new to the issues involved.  I realize the need to push things forward but I am still in the process of understanding what has gone before and the reasons behind those conclusions, so any discussion is welcomed!
I haven't yet read the full article, though skimming it I can see it is more a recap of existing research than advancing new theories. I shall comment later when I've had a further look. I can only say I'm immensely glad to see that "language" solutions are much more respectable and popular than they were only give years ago.
Great read. Thank you for posting this, Stephen. I get the sense Prof Bowern really did her homework, and tried to be as informed and up to date as possible about the state of VMs research. And congrats to Koen and nablator for the citations! You guys have come up with two pretty interesting tools for analyzing Voynichese glyph placement at the levels of the line and the vord, repsectively, and I'm happy to see both of you getting the recognition you deserve for your work.

I keep hearing rumors, both here and on Nick Pelling's blog, that someone has attempted to execute Torsten Timm's self-citation method using low-tech methods, and failed to replicate his results. I consider my Google-fu fairly decent, as I find all kinds of obscure pieces of medical research and deals on medications for people all the time. But I was completely unable to find anything published substantiating even an attempt to replicate TT's method, to say nothing of any results.
The dismissal of the text as fake or gibberish is good and solid. They used the comparisons of Zipf's law, proportional frequency, and MATTR to show that the Voynich text is broadly similar to natural language. While these measures are measuring related things, all three would be hard to fake.

Quote:As with the Zipfian word distribution, we find Voynich to be well within the expected values for natural language texts, and far from random gibberish. If the Voynich text is meaningless, its creators mimicked natural language in a sophisticated way.

I don't find it credible that a hoaxer before 1910 could have achieved this either by design or by luck.
As I am not a linguist, I did not understand what is the share of the authors' analyzes? Do the graphs reflect their own results? On what documents and from what period do they base their analyzes? Is it marked somewhere?
Thank you for sharing the paper. The paper is interesting. But it is also possible to point to some issues.

These are my main findings so far:

1) There is a reference to the paper of Daruka (2020) on page 3 but there is only a comment: "Daruka (2020) likewise comes to the conclusion that the Voynich Manuscript is a hoax and contains gibberish,  though created by different means than those suggested by Rugg". By which different means Daruka comes to the same conclusions is not shown or discussed.

2) Daruka (2020) is referencing to arguments presented by Schinner (2007), Timm (2014), Timm (2016) and Timm & Schinner (2020). These arguments are not mentioned and the results presented there are not discussed. This way some important arguments for a structured pseudo-text hypotheses are not discussed.
For instance observations like the unusually random walk results (Schinner 2007), or "the deep correlation between frequency, similarity, and spatial vicinity of tokens" (Timm & Schinner 2020, p.4) are not addressed.

3) Daruka (2020) is referencing to the text generation method presented by Timm (2014) and Timm & Schinner (2020). This text generation method is not mentioned.

4) The paper argues that "gibberish is by nature random" (Bowern & Lindemann 2020, p. 4). There is no reference given for this statement. Moreover, this statement is contradicted by an experiment that is described on p. 5. The result for this experiment is described this way: "It is too easy to make local repetitions and words from other languages." But local repetition is exactly what we see in the Voynich manuscript and it is at least possible to interpret local repetitions as some kind of structure. Moreover this experiment confirms a statement by D'Imperio: "The scribe, faced with the task of thinking up a large number of such dummy sequences, would naturally tend to repeat parts of neighboring strings with various small changes and additions to fill out  the line until the next message-bearing word or phrase" (You are not allowed to view links. Register or Login to view. p. 31).

5) The paper argues that the text behaves "non-language like at the character level" but "above the word to line and paragraph, as well as in the distribution of words across the manuscript, it looks like a natural language"  (Bowern & Lindemann 2020, p. 4). This two observations contradict each other. The presented conclusion is therefore at least surprising: "This strongly implies that the manuscript is encoded natural language".

6) The paper argues that word tokens can be divided into 'prefixes', 'roots/midfixes', and 'suffixes' (Bowern & Lindemann 2020, p. 11f). This interpretation is contradicted by another observation: characters depend on the previous character. To illustrate the second observation even a table is presented "Frequency of each Voynich Character given the previous character" (Bowern & Lindemann 2020, p. 12). But if a character did depend on a previous character this fact alone explains that glyphs are typical for certain positions. There is no need to for 'prefixes', 'roots/midfixes', and 'suffixes'. The paper also didn't discuss how the assumed affixes behave compared to affixes used in natural languages.

7) The paper names Tiltman (1950) as source for the 'prefixes', 'roots/midfixes', and 'suffixes' hypotheses. But at least in 1967 Tiltman actually did "divide words into what I call 'roots' and 'suffixes'" (Tiltman 1967, p. 7). Moreover Tiltman described glyph sequences like 'ok', and 'qok' as roots and sequences like 'aiin' as suffixes:
[attachment=4749]

8) The paper argues that "the most common word in many natural languages is a definite article like ‘the’, a connective like ‘and’, or a preposition like ‘in’/’of’" (Bowern & Lindemann 2020, p. 14). The paper therefore suggests that <daiin> is a connective (Bowern & Lindemann 2020, p. 14). This is contradicted by the fact that such words are normally equally distributed but that equally distributed words doesn't exist for the Voynich text (see Timm & Schinner 2020, p. 5).

9) The paper argues that Currier A and B did use two different methods to encode natural language or did encode two different languages. Unfortunately the observation that common words used in Currier A like <daiin> also occur frequently in Currier B but not vice versa is not addressed  in this context (see Timm & Schinner 2020, p. 7).

10) The paper argues with "Moving average type token ratio" (MATTR) (Bowern & Lindemann 2020, p. 14) and even argues "Voynich most closely resembles the averages for Germanic and Iranian, and least resembles those for Turkic, Dravidian, and Kartvelian". But the paper does not say if the MATTR analyses was done for the whole manuscript or only for Currier A or B. Since the paper argues that Currier A and B did encode language differently it would be important to know what the MATTR-results stand for and if Currier A and B behave differently.

11) The paper argues again "we find Voynich to be well within the expected values for natural language texts, and far from random gibberish" (Bowern & Lindemann 2020, p. 16). Unfortunately, nobody argues that the text represents random gibberish.

12) On page 17 the line and the paragraph are discussed as functional units. The paper suggests that the words are ordered or that the "same word will be written differently depending on where it appears in the line" (Bowern & Lindemann 2020, p. 17). There is no discussion if such patterns could be observed in natural languages as well (they doesn't behave this way).

13) The paper argues on p. 19 the "repetitiveness is at least partly the result of the relatively limited set of character combinations and the predictable structure of Voynich words". A deeper analyses of this phenomenon can be found in Timm & Schinner 2020. But this paper is not mentioned.

14) On p. 18f the paper discusses "phrases" but didn't mention the fact that longer repeated word sequences are missing. The paper of Akinori Ito from 2002 "Observation of left and right entropy in Voynich MS" argues: "This result suggests that words in VMS are quite context-independent." A similar result was presented by Cárdenas et. al. in 2014. The paper with the title "Does network complexity help organize Babel’s library?" argues: "It can be observed that the distributions of these 'randomized texts' are similar to those of the originals, which implies that inhomogeneous distribution of connectivity depends not on the way words are linked, but rather on the frequency of words in the text". The results of both papers are highly relevant for the question if phrases exists but they are not mentioned or discussed. Instead the paper only refers to a single pattern found by Currier in 1976: "A word that begins with qo- is usually preceded by a word that ends with -y" (Bowern & Lindemann 2020, p. 19). But this pattern occurs on character level and not on word level.

15) The paper argues that the word type <aiin> "is usually preceded by a short one or two-letter word (e.g. ar aiin, or aiin, s aiin). The paper suggests that "short words may represent articles or prepositions" (Bowern & Lindemann 2020, p. 19). No concrete numbers are given. I have counted 20 instances of 'ar aiin', 47 instances of 'or aiin' and 30 instances of 's aiin'. The pattern behind this observation is that words occur together if at the same time both word build also a commonly used word (see Timm 2015, p. 23). A word <araiin> is indeed used 10 times, the word <olaiin> occurs 52 times and the word <saiin> occurs 144 times. 

16) The paper argues "we would expect ok- to be feminine singular, ot- to be masculine singular, op- to be feminine plural, and of-  to be masculine plural." (Bowern & Lindemann 2020, p. 19). This directly contradicts the three part structure presented earlier. Now the paper assumes a two part structure as suggested by Tiltman. Moreover, earlier in the paper gallows are seen as part of the 'root/midfix' now it is assumed that Gallows are used as prefixes.

17) The paper argues "We do find certain constraints on what can follow gallows characters. For example, p/f p/f are never followed by e" (Bowern & Lindemann 2020, p. 19). But words like <shefeeedy> on folio 48v or <qopeeedar> on folio 50r are only extremely rare (see You are not allowed to view links. Register or Login to view.).

18) The paper argues "pages that are nearest neighbors in topic modeling tend to be adjacent to one another in the manuscript" and concludes the text is "inconsistent with a hoax" (Bowern & Lindemann 2020, p. 20). An explanation for this observation is given in Timm & Schinner 2020 on page 7: "Now, reordering the sections with respect to the frequency of token <chedy> replaces the seemingly irregular mixture of two separate languages by the gradual evolution of a single system from 'state A' to 'state B'". The alternative explanation is not addressed.

19) The paper concludes "The higher structure of the manuscript itself is completely consistent with natural language" (Bowern & Lindemann 2020, p. 23). The paper only discusses some patterns on character level and that the text is structured in some way. Then the paper tries to explain the patterns found as evidence for the natural language hypotheses. But the paper did not discuss the patterns that are missing. Typical for some kind of morphological structure are relations between different word classes, a set of grammatical rules, sentence structures or the existence of common phrases. Such patterns are not addressed.

In the end the paper argues that since the text is structured in some way and since some kind of context dependency exists that the text is "far from random gibberish". This is basically the same argumentation as in You are not allowed to view links. Register or Login to view.. But "simply because the VM has some structure, even one that resembles language in some ways, does not entail that it is likely to have a genuine linguistic structure" (You are not allowed to view links. Register or Login to view.).
Pages: 1 2 3 4 5