[Article] Lindemann and Bowern (2020) is available

[Article] Lindemann and Bowern (2020) is available - Printable Version

+- The Voynich Ninja (https://www.voynich.ninja)
+-- Forum: Voynich Research (https://www.voynich.ninja/forum-27.html)
+--- Forum: News (https://www.voynich.ninja/forum-25.html)
+--- Thread: [Article] Lindemann and Bowern (2020) is available (/thread-3408.html)

Pages: 1 2

Lindemann and Bowern (2020) is available - cbowern - 29-10-2020

You are not allowed to view links. Register or Login to view. has a new paper by Luke Lindemann and myself, where we pick apart some of the details of character entropy. Here is the abstract. The paper also includes links to some corpus materials which might be useful (freely available from You are not allowed to view links. Register or Login to view.). This paper expands on some of the material from our other 2020 paper (overview for the Annual Review of Linguistics). We have another couple in the works.

This paper outlines the creation of three corpora for multilingual comparison and analysis of the Voynich manuscript: a corpus of Voynich texts partitioned by Currier language, scribal hand, and transcription system, a corpus of 294 language samples compiled from Wikipedia, and a corpus of eighteen transcribed historical texts in eight languages. These corpora will be utilized in subsequent work by the Voynich Working Group at Yale University. We demonstrate the utility of these corpora for studying characteristics of the Voynich script and language, with an analysis of conditional character entropy in Voynichese. We discuss the interaction between character entropy and language, script size and type, glyph compositionality, scribal conventions and abbreviations, positional character variants, and bigram frequency. This analysis characterizes the interaction between script compositionality, character size, and predictability. We show that substantial manipulations of glyph composition are not sufficient to align conditional entropy levels with natural languages. The unusually predictable nature of the Voynichese script is not attributable to a particular script or transcription system, underlying language, or substitution cipher. Voynichese is distinct from every comparison text in our corpora because character placement is highly constrained within the word, and this may indicate the loss of phonemic distinctions from the underlying language.

RE: Lindemann and Bowern (2020) is available - Ruby Novacna - 29-10-2020

Thanks for the comparison with historical texts, although I would aulso like to see the statistics of the Old Slavic texts.

P.S. Just for the form: the headings of two columns in the table on page 46 have been reversed?

RE: Lindemann and Bowern (2020) is available - cbowern - 29-10-2020

If you can point us to digitized versions we'd be very happy to include other texts!

RE: Lindemann and Bowern (2020) is available - Ruby Novacna - 29-10-2020

A small example: Gospel of Matthew in Greek and Old Church Slavonic You are not allowed to view links. Register or Login to view.

RE: Lindemann and Bowern (2020) is available - nablator - 29-10-2020

(29-10-2020, 03:49 PM)Ruby Novacna Wrote: You are not allowed to view links. Register or Login to view.A small example: Gospel of Matthew

That's Mark.

Matthew: You are not allowed to view links. Register or Login to view.

RE: Lindemann and Bowern (2020) is available - Koen G - 29-10-2020

Wonderful, I'm looking forward to subsequent papers. Some remarks:

1) This one is not aimed at the paper but rather at general practice:

"The Recipes section is distinguished by pages with paragraphs of text separated by assortments of labelled herbs, leaves, or roots."

--> this is one of the two reasons why I find the traditional section names problematic: the name "recipes section" is usually reserved for Q20 (stars), but sometimes people understandably apply it to the preceding section instead (traditionally known as "pharma section").

The other reason is that especially to newcomers, the traditional section names may appear as an actual assessment of their contents rather than just names.

2) The colored overview of section - hand - language on p.6 is very handy.

3) Why did you use character set size (h0) instead of h1? If I understand correctly, h1 is the more telling value since it also takes frequency into account. And isn't the exact h0 of a handwritten manuscript almost impossible to determine? Capitals, ligatures, abbreviations, positional variants... should all be counted, since we cannot eliminate these in Voynichese either.

4) Some time ago I mentioned my blog post on "improving" h2 by merging frequent n-grams. This post contained a mistake in the numbers, but I corrected and expanded upon it recently: You are not allowed to view links. Register or Login to view.
Basically, I take this, as you phrase it in your paper, to the extreme: "If certain glyphs occur primarily in a particular sequence, this may be evidence that the sequence of glyphs represents a single character." It is interesting to see how different VM sections react differently.

I cannot stress enough that my entropy posts are meant as experiments and not as proposed solutions or transcription systems. And I do agree with your statement that "one cannot simply assume that the low character entropy is due to our over-splitting of characters; even when they are grouped together, Voynichese is still unusual compared to other language samples."

Still, I think it must be noted that Voynichese's h2 can be lifted higher than is the case for any of the standard transliterations, without losing information and, importantly, while keeping h1 in check.

5) I am very happy that you explain why abjad solutions are a bad idea Smile

Having read the entire paper, I can only say that I certainly agree with the conclusions, and it should be mandatory reading for anyone interested in Voynichese statistics.

RE: Lindemann and Bowern (2020) is available - RenegadeHealer - 30-10-2020

Koen, I thought of you and your entropy-increasing ngram substitution experiments when I read this paper. I can see you recommending people who don't quite understand your work first read Lindemann & Bowern 2020 for some background. In fact, I think L&B should be required reading for anyone who posts a VMs theory that assumes a natural language plaintext (written out *or* conventionally abbreviated), or a simple substitution cipher. L&B explains fairly convincingly why scribal abbreviations, colloquialisms, dialect / register variations, and novel writing systems cannot possibly account for how predictable and rigid Voynichese glyph placement is.

cbowern Wrote:this may indicate the loss of phonemic distinctions from the underlying language

These are the lines I'm thinking along these days: systematic but idiosyncratic lossy compression. I'm thinking of notebook jottings over an extended period of time, from one person to himself, composed of ungrammatical, barely legible words written line-by-line. Enough information to jog the writer's memory, but not enough for anyone else to unambiguously figure out what the writer meant. Dotless Arabic and cursive Chinese characters are good historical comparisons. A manuscript written in either would probably not be hard for the original writer to read aloud. But if this same manuscript were found years later, stripped of author or cultural context, it's very possible two different readers would read it two very different ways, with two mutually exclusive meanings.

Wildstyle graffiti makes for an interesting modern parallel, though it's never been used for long-form writing. Wildstyle is very legible to the writer, or to anyone who watches a time-lapse video of him writing it. But being able to look at someone else's wildstyle piece and figure out what it says takes a lot of practice. Legible to those in-the-know, illegible to those not.

RE: Lindemann and Bowern (2020) is available - Emma May Smith - 06-11-2020

Very clear argument which addresses (and potentially settles) a number of outstanding issues. The conclusion of poor phonemic distinction doesn't surprise me but is welcome when so well argued. The size of the core script hints at either a very tiny inventory or lack of distinction. I think we've had several debates on this forum where Book Pahlavi has been mentioned as an example of what's actually plausible. Arabic rasm is another (and may be worth adding in to future analyses, as it can easily recreated using a modern text with a recoded transliteration).

I suppose the question is whether markers of phonemic distinction survive? Was the reader expected to bring all the missing linguistic knowledge needed to successfully interpret the manuscript, or are there other guides still in the text? We know, for example, that words beginning [ch] often appear beginning [dch] or [ych] at the start of a line. Maybe this distinction represents a "tell" of which value [ch] has in each case. There may be other examples, but I feel it's outside the scope of this thread. I'm very intrigued to consider what remnants of distinction can be found by the interaction between glyphs.

RE: Lindemann and Bowern (2020) is available - RobGea - 06-11-2020

(06-11-2020, 01:33 AM)Emma May Smith Wrote: You are not allowed to view links. Register or Login to view.Maybe this distinction represents a "tell" of which value [ch] has in each case. There may be other examples, but I feel it's outside the scope of this thread. I'm very intrigued to consider what remnants of distinction can be found by the interaction between glyphs.

Hi Emma May Smith, I would be very interested to hear you expand a bit more on this topic.

Is this lack of phonemic distinction a bit like this ( Random, hopefully pertinent, example from wikipedia ):

"The earliest [Czech] texts were written in primitive orthography, which used the letters of the Latin alphabet without any diacritics,
resulting in ambiguities, such as in the letter c representing the k /k/, c /ts/ and č /tʃ/ phonemes."
You are not allowed to view links. Register or Login to view.

RE: Lindemann and Bowern (2020) is available - RenegadeHealer - 06-11-2020

Emma, I started an experiment where I took a You are not allowed to view links. Register or Login to view., rendered it with Japanese katakana, and then used a simple algorithm to swap each katakana glyph with another in the standard 5x10 grid (gojūon) that Japanese speakers use to order their language's phonology. I wanted to see whether anyone could figure out that it was Psalm 23 after all of these steps. I was motivated to try this experiment after reading Mark Knowles' thread about making an unbreakable cipher. I didn't end up finishing it or posting it, but I have no doubt you guys here would have little trouble decoding it. I ended up concluding that it's very hard to make a code that no one without the key can decode. However, it's not hard to make a code that's relatively easy for people with specific background knowledge to decode, but dauntingly difficult for anyone who lacks that background knowledge, and has no idea where to find someone who has it.

Emma May Smith Wrote:I'm very intrigued to consider what remnants of distinction can be found by the interaction between glyphs.

What immediately comes to mind here are the hints that the VMs text may have been written and retouched in multiple passes. Specifically JKP and Brian Cham's observations that many "plumes" or "flourishes" (the stroke that turns EVA e into s, or EVA i into r) appear to have been added later, in a different shade of ink. This makes me wonder if the creator of the VMs text found the encoding system too lossy even for him/herself to decode reliably, such that (s)he went back later and added in some extra hints.