The Voynich Ninja

Pages: 1 2 3 4 5

Thanks for this - many things to comment on.

Quote:1) There is a reference to the paper of Daruka (2020) on page 3 but there is only a comment: "Daruka (2020) likewise comes to the conclusion that the Voynich Manuscript is a hoax and contains gibberish, though created by different means than those suggested by Rugg". By which different means Daruka comes to the same conclusions is not shown or discussed.

yep - word limits.

Quote:2) Daruka (2020) is referencing to arguments presented by Schinner (2007), Timm (2014), Timm (2016) and Timm & Schinner (2020). These arguments are not mentioned and the results presented there are not discussed. This way some important arguments for a structured pseudo-text hypotheses are not discussed.

I've added reference to your work. But please understand (like in the language theories section) there are many many many possible papers to discuss and naturally many members of this forum are likely to feel left out that we weren't able to give their work citation or extensive discussion.

Quote:4) The paper argues that "gibberish is by nature random" (Bowern & Lindemann 2020, p. 4). There is no reference given for this statement. Moreover, this statement is contradicted by an experiment that is described on p. 5.

I don't see a contradiction here. "Gibberish" is a cover term for many different text creation processes where there's no underlying syntactic structure.

Quote:5) The paper argues that the text behaves "non-language like at the character level" but "above the word to line and paragraph, as well as in the distribution of words across the manuscript, it looks like a natural language" (Bowern & Lindemann 2020, p. 4). This two observations contradict each other. The presented conclusion is therefore at least surprising: "This strongly implies that the manuscript is encoded natural language".

No, it implies that there is discourse and syntactic structure but the method of forming words makes the character-level statistics look non-language-like. Which is in itself a result. Furthermore, these sort of discourse properties are difficult (if not impossible) to fake, whereas it's pretty easy to make character-level changes that affect bigram frequency statistics (nulls, and mergers, for example).

Quote:9) The paper argues that Currier A and B did use two different methods to encode natural language or did encode two different languages. Unfortunately the observation that common words used in Currier A like <daiin> also occur frequently in Currier B but not vice versa is not addressed in this context (see Timm & Schinner 2020, p. 7).

We don't actually argue that - in that we don't present much information about it. We assert it based on others' work.

Quote:10) The paper argues with "Moving average type token ratio" (MATTR) (Bowern & Lindemann 2020, p. 14) and even argues "Voynich most closely resembles the averages for Germanic and Iranian, and least resembles those for Turkic, Dravidian, and Kartvelian". But the paper does not say if the MATTR analyses was done for the whole manuscript or only for Currier A or B. Since the paper argues that Currier A and B did encode language differently it would be important to know what the MATTR-results stand for and if Currier A and B behave differently.

I will check this but from memory we did look at this, and I don't think the results were very different.

Quote:11) The paper argues again "we find Voynich to be well within the expected values for natural language texts, and far from random gibberish" (Bowern & Lindemann 2020, p. 16). Unfortunately, nobody argues that the text represents random gibberish.

If one argues that the text was created with no underlying syntax structure (that is, simply word strings without meaning) then by definition that's random from a linguistic standpoint.

Quote:12) On page 17 the line and the paragraph are discussed as functional units. The paper suggests that the words are ordered or that the "same word will be written differently depending on where it appears in the line" (Bowern & Lindemann 2020, p. 17). There is no discussion if such patterns could be observed in natural languages as well (they doesn't behave this way).

Capitalization is a way in which words are written differently depending on where they appear in a functional unit. There are languages with stressed and unstressed clitics - same functional material with different realizations in different syntactic contexts.

Quote:16) The paper argues "we would expect ok- to be feminine singular, ot- to be masculine singular, op- to be feminine plural, and of- to be masculine plural." (Bowern & Lindemann 2020, p. 19). This directly contradicts the three part structure presented earlier. Now the paper assumes a two part structure as suggested by Tiltman. Moreover, earlier in the paper gallows are seen as part of the 'root/midfix' now it is assumed that Gallows are used as prefixes.

If you read the paper carefully, you'll see we do not assume this at all. We say that gallows are not articles and use this as an example of how one needs to take the consequences of linguistic arguments seriously. And when you do, things like assumptions that gallows are articles in some contexts (superficially an idea, since there are four of them, etc) just don't stand up.

Quote:18) The paper argues "pages that are nearest neighbors in topic modeling tend to be adjacent to one another in the manuscript" and concludes the text is "inconsistent with a hoax" (Bowern & Lindemann 2020, p. 20). An explanation for this observation is given in Timm & Schinner 2020 on page 7: "Now, reordering the sections with respect to the frequency of token <chedy> replaces the seemingly irregular mixture of two separate languages by the gradual evolution of a single system from 'state A' to 'state B'". The alternative explanation is not addressed.

sorry we didn't cite you.

Gibberish is not necessarily random. There are many ways to construct gibberish in a rational orderly way and still have the result be unintelligible or meaningless.

On the other hand, something with poor syntax or low syntax is sometimes still intelligible. Some things have a very low level of syntax, like information in charts or in note form.

Maybe this section of the paper could be explained more specifically to make the meaning/context more clear to readers.

Thanks for the response.

(08-09-2020, 05:51 PM)cbowern Wrote: You are not allowed to view links. Register or Login to view.Thanks for this - many things to comment on.

I've added reference to your work. But please understand (like in the language theories section) there are many many many possible papers to discuss and naturally many members of this forum are likely to feel left out that we weren't able to give their work citation or extensive discussion.

If in your opinion your paper covers all relevant arguments for the pseudo-text hypothesis there indeed no need to change anything.

(08-09-2020, 05:51 PM)cbowern Wrote: You are not allowed to view links. Register or Login to view.I don't see a contradiction here. "Gibberish" is a cover term for many different text creation processes where there's no underlying syntactic structure.

My point was about the statement that gibberish must be "random". Another term for gibberish is "mumbo-jumbo" since if someone tries to generate something meaningless out of his head he will repeat himself. That manually generated gibberish is repetitive was also a result of the experiment the paper describes on p. 5.

(08-09-2020, 05:51 PM)cbowern Wrote: You are not allowed to view links. Register or Login to view.No, it implies that there is discourse and syntactic structure but the method of forming words makes the character-level statistics look non-language-like. Which is in itself a result. Furthermore, these sort of discourse properties are difficult (if not impossible) to fake, whereas it's pretty easy to make character-level changes that affect bigram frequency statistics (nulls, and mergers, for example).

So the observation that the character-level statistics look non-language-like doesn't matter?

(08-09-2020, 05:51 PM)cbowern Wrote: You are not allowed to view links. Register or Login to view.

Quote:9) The paper argues that Currier A and B did use two different methods to encode natural language or did encode two different languages. Unfortunately the observation that common words used in Currier A like <daiin> also occur frequently in Currier B but not vice versa is not addressed in this context (see Timm & Schinner 2020, p. 7).

We don't actually argue that - in that we don't present much information about it. We assert it based on others' work.

My point was that a lot more about Currier A and B is known today.

(08-09-2020, 05:51 PM)cbowern Wrote: You are not allowed to view links. Register or Login to view.If one argues that the text was created with no underlying syntax structure (that is, simply word strings without meaning) then by definition that's random from a linguistic standpoint.

There is no doubt that there is an underlying syntax structure. The question is if there is also some semantic structure.

(08-09-2020, 05:51 PM)cbowern Wrote: You are not allowed to view links. Register or Login to view.
Quote:12) On page 17 the line and the paragraph are discussed as functional units. The paper suggests that the words are ordered or that the "same word will be written differently depending on where it appears in the line" (Bowern & Lindemann 2020, p. 17). There is no discussion if such patterns could be observed in natural languages as well (they doesn't behave this way).

Capitalization is a way in which words are written differently depending on where they appear in a functional unit. There are languages with stressed and unstressed clitics - same functional material with different realizations in different syntactic contexts.

Indeed in different syntactic and semantic contexts words can be written differently. But the paper comes to the conclusion: "All of these observations lead to generalizations which appear to be typographical rather than linguistic in nature." So the "character-level statistics look non-language-like" and the patterns described on line and paragraph level "appear to be typographical rather than linguistic in nature."

Since the paper comes to the conclusion that "the word and line level metrics show it to be regular natural language" I would be interested in reading an explanation for the described typographical features.

(08-09-2020, 05:51 PM)cbowern Wrote: You are not allowed to view links. Register or Login to view.If you read the paper carefully, you'll see we do not assume this at all. We say that gallows are not articles and use this as an example of how one needs to take the consequences of linguistic arguments seriously. And when you do, things like assumptions that gallows are articles in some contexts (superficially an idea, since there are four of them, etc) just don't stand up.

My wording was incorrect. What I meant was that in 'ok-', 'qok-' the gallows are part of the prefixes whereas on page 12 gallows are listed as part of the "Roots/Midfixes".

I simply don't understand what the paper want to say with the three field structure described on page 11f. Do you want to argue that such a structure exists or not?

Gibberish / random / mumbojumbo seem to be words which can be used to prove or disprove any given Voynich-related assertion, depending on how you (re)define them.

If you can't be bothered to define what you actually mean by such imprecise / multivalent words, then you really shouldn't be using them in the first place. :-(

Hi Claire, thank you for answering questions about your paper here. I am looking forward to reading your oncoming paper as well.

About character entropy, I wonder if you considered verbose ciphers? Quite a few current researchers consider this to be a possibility. The idea would be that a group of Voynichese characters, like [aiin], would represent one source text character. It is in a way the opposite of abbreviation and is in my opinion much more promising.

I demonstrated recently (with a lot of help from Marco) that treating Voynichese like a verbose cipher can raise character entropy values to the very edge of normality:
You are not allowed to view links. Register or Login to view.

The major issue with a verbose cipher hypothesis is that it makes words very short. For example if [aiin] is one character in the source text, then [daiin] is now only two characters. This implies that the source text would be something like a syllabic language or another language with words split, in syllables or otherwise.

(08-09-2020, 09:36 PM)Koen G Wrote: You are not allowed to view links. Register or Login to view.The major issue with a verbose cipher hypothesis is that it makes words very short. For example if [aiin] is one character in the source text, then [daiin] is now only two characters. This implies that the source text would be something like a syllabic language or another language with words split, in syllables or otherwise.

Right, but a syllabic language would be inconsistent with the relatively high number of hapax legomena in the text.

Gibberish refers specifically to language, while random is a statistical (=mathematical) term.

Random is commonly used to signify "arbitrary", "unpredictable" or "not based on any model".

Since the Voynich MS text was definitely created by a human being, using the term "random" is hard to justify.

Also, since gibberish is partly subjective (something can be gibberish to person A and not to person B), one could even argue that the Voynich MS is gibberish and we are all like person A. At least for the time being.

(Not a very helpful argument, though).

(08-09-2020, 09:31 PM)nickpelling Wrote: You are not allowed to view links. Register or Login to view.Gibberish / random / mumbojumbo seem to be words which can be used to prove or disprove any given Voynich-related assertion, depending on how you (re)define them.

If you can't be bothered to define what you actually mean by such imprecise / multivalent words, then you really shouldn't be using them in the first place. :-(

It's a perennial problem when using language to talk about language.

Hi Claire,
I have been looking into the details of what you wrote in this passage (p.19):

Bowern and Lindemann Wrote:Full reduplication, in which the entire word is repeated, is also common in Voynich. However, it is still within the realm of plausibility for natural language texts. In Voynich A each word has a 0.84% chance of repeating while in Voynich B that chance is 0.94%. The range among the samples in our language corpus is 0.02%-4.8%, with an average of 0.63%.

I have run reduplication counts on the Corpora you published here:
You are not allowed to view links. Register or Login to view.

The numbers I get are reasonably comparable with yours, though not identical:
range: 0.0%-4.4%
average: 0.58%

I have looked into some of the files that result in a reduplication rate higher than 1%. My impression is that they have been extracted from Wikipedia without considering the layout of the pages. For instance, these are two instances of reduplication in this file, with what appears to be the corresponding Wikipedia page:
You are not allowed to view links. Register or Login to view.

"entre los endividus lengoues lengoues"
[attachment=5153]

"la rua des âlpes a fribôrg fribôrg"
[attachment=5152]

Is this interpretation correct? Have you found any case of reduplication above 0.5% in any long European text that can be interpreted as linguistic, rather than an artifact of how the text was extracted from Wikipedia? If so, could you please share them? I think such cases likely exist and I would be interested in knowing where they originated.

Thank you once again for sharing your results and the corpora you have collected. I am very happy that there are linguists looking into the subject of Voynich reduplication! I am fascinated by it, but I lack the linguistic competence to tackle the problem but from the simple quantitative point of view.

I have checked some of the language samples collected from different wikipedia websites. The reason for the unusual high number of word repetitions are probably commonly used code snippets like "f f f border aaa solid font size" or "border style margin em em em". In the case of the You are not allowed to view links. Register or Login to view. there is even a sequence using the word "bar" 209 times in succession.

I wonder if Bowern and Lindemann were aware of this problem. At least they wrote: "However, it is not always possible to completely remove all English metadata from language written in the Latin script. This may have an effect on languages with very little text and short entries."

Pages: 1 2 3 4 5

cbowern

-JKP-

Torsten

nickpelling

Koen G

Stephen Carlson

ReneZ

cbowern

MarcoP

lurker