[Article] Lindemann and Bowern (2020) is available

[Article] Lindemann and Bowern (2020) is available - Printable Version

+- The Voynich Ninja (https://www.voynich.ninja)
+-- Forum: Voynich Research (https://www.voynich.ninja/forum-27.html)
+--- Forum: News (https://www.voynich.ninja/forum-25.html)
+--- Thread: [Article] Lindemann and Bowern (2020) is available (/thread-3408.html)

Pages: 1 2

RE: Lindemann and Bowern (2020) is available - Emma May Smith - 06-11-2020

The conclusion of the paper is that the Voynich script has poor phonemic distinction. The language has more phonemes than the script can accurately describe.

An Example Using English

We can think of it in the following way: English has the phonemes /p b/ /t d/ /k g/, which are three pairs distinguished presence/absence of voicing. A hypothetical new English script might only have three glyphs, one for each pair, but not show whether the voiced or unvoiced sound is meant. So 'p' might mean /p/ and /b/, 't' might mean /t/ and /d/, and 'k' could mean /k/ and /g/. Thus in English, "dog", "tog", "tock", and "dock" would be written "tok". A read would need to work out which word is mean by the context, such as "the tok chases the kat" or "the ship sails from the tok".

We kan assume that the reater of the Voynich text unterstants the unterlying language well enough and prings their knowledge of the language to the text in orter to work out the kontext, the possiple options for each wort, and kan choose which one is korrekt. (Any second language readers, please forgive me.)

Let's further imagine that in this new English script sibilants are distinguished by voice. So there are separate glyphs for /s/ and /z/, which are 's' and 'z', respectively. In English the plural of final 's' has a different sound depending on the preceding phoneme. So a word ending with a voiced sound will have a plural with /z/ and a word ending with an unvoiced sound will have an /s/ plural (it's more complex than this, but we'll keep it simple for this example).

The plural of "tok" written in this new English script could be "toks" or "tokz". While "tok" is ambiguous, "toks" and "tokz" are less so, because the final 's' and 'z' give a clue to the voicing of the 'k'. Effectively there are two 'k's: one unvoiced taking the 's' plural and the other voiced taking the 'z' plural. Thus "toks" must be "docks" or "tocks" in normal English, while "tokz" must be "dogs" or "togs".

An Example from the Voynich Manuscript

In the Voynich text we have many words beginning [ch]. They occur at the start of lines less often than they should. We also see words beginning [ych] and [dch], which occur at the start of lines more often than they should. We could suggest that words beginning [ch] sometimes add [y] or [d] when they occur at the start of lines. But how is it decided whether they take [d], [y], or nothing?

We could propose that there are three different values for [ch] (we'll call them [ch1] and [ch2] and [ch3]), and that they act differently at the start of lines. [ch1] occurs at the start of lines without an additional glyph before it, [ch2] takes [y] before it at the start of lines, and [ch3] takes [d]. The reason why they act differently is explained above in the English example: the script doesn't make a distinction between certain phonemes, but the writer knows the distinction and knows that they interact differently at the start of lines. So even though the distinction isn't in the script it is still in the text.

Please remember: the above is just an example of the idea, not a statement of what I think is true.

In any contexts where glyphs interact (such as line start, line end, word break combinations, and maybe others) we might be able to pick up the linguistic knowledge of the writer leaving hints about the phonemic distinction. The knowledge extracted from these hints, if it fits together coherently, could allow us to restore the distinction lost in the text.

I have ideas of how this could be done, but already this comment is quite long. So I'll leave it here for others to respond. I'm wary that we might want a different thread to discuss as we are leaving the original topic somewhat.

Late addition: I wonder if we should call this "orthographic cheshirization"?

RE: Lindemann and Bowern (2020) is available - MarcoP - 09-11-2020

Many thanks to Claire and Luke for sharing their work! I agree with what others wrote: the paper presents a great quantity of significant information. The future papers that are mentioned will touch other great subjects, some of which have received very little attention.
I was particularly intrigued by the figure at p.30, where diplomatic and normalized texts are compared. As far as I know, almost nothing has been done in this area. My very limited You are not allowed to view links. Register or Login to view. suggest with a copy of a text by Bonaventura suggested that scribal inconsistency results in a large increase in MATTR values (the same word type can be rendered in several different ways).

I would be grateful if Claire could help me understanding some details about the corpus.
I downloaded the github corpora in order to have a look at MATTR for the historical texts. When looking at the files, I noticed that for the Codex Wormianus and the Necrologium the two files are labelled dip and fac, diplomatic and facsimile, with no "normalized" file.
In particular, Necrologium_dip, though based on a smaller character set than Necrologium_fac, uses more characters than the normal Latin alphabet, e.g. the "long-s" is represented by a special character (see "monacus" in the attached image).
I understand that, at p.30, the left Necrologium point corresponds to "dip" and the right point corresponds to "fac". Is it so? Or is the figure based on a normalized file that was not included in the corpus?

Filename: dip.jpg Size: 62.04 KB 09-11-2020, 09:24 AM

Anyway, here are my results for MATTR 50 vs 200. As always, it is possible I made errors along the way.
I added to the samples from the Yale corpus the normalized and diplomatic transcriptions of the Bonaventura text I analysed a while ago.

For the Latin Necrologium, Secretum and Bonaventura, the abbreviated/diplomatic version has much higher values than the "normalized" transcription
For the Latin Casebook and the Icelandic Codex the variation is much smaller
Hebrew with and without vowels also shows a small difference

I include Voynich EVA samples for Currier A and B.

RE: Lindemann and Bowern (2020) is available - MarcoP - 26-12-2020

Apparently, there is a problem with some of the texts.

I first spotted this on the Italian text by Brunetto Latini. The text comes from here:
You are not allowed to view links. Register or Login to view.
This is the version in the corpus:
You are not allowed to view links. Register or Login to view.

The problem is that words are joint across lines.
lo quale è ritratto in vulgare

becomes:
lo quale èritratto in vulgare

The problem also affects the Georgian text:
You are not allowed to view links. Register or Login to view.
You are not allowed to view links. Register or Login to view.

When dealing with character entropy (the subject of this paper) the impact will be minimal. But it would be great to have a reliable corpus that can be used for higher level statistics also (e.g. word entropy).

RE: Lindemann and Bowern (2020) is available - MichelleL11 - 26-12-2020

Marco says
“Apparently, there is a problem with some of the texts.”

This is really too bad. Is there any way to quantify the issue so it could be factored in? Or is the impact variable on your results depending on the details?

I guess it’s good that the character entropy results aren’t too heavily hit.

RE: Lindemann and Bowern (2020) is available - MarcoP - 27-12-2020

(26-12-2020, 03:35 PM)MichelleL11 Wrote: You are not allowed to view links. Register or Login to view.Marco says
“Apparently, there is a problem with some of the texts.”

This is really too bad. Is there any way to quantify the issue so it could be factored in? Or is the impact variable on your results depending on the details?

I guess it’s good that the character entropy results aren’t too heavily hit.

Hi Michelle,
I am not sure it's such a big deal: I don't think this corpus has been much used yet and most files are likely correct.

In my opinion, the best way to go is checking the whole corpus (in particular the Titus files) and uploading a corrected version.

RE: Lindemann and Bowern (2020) is available - Torsten - 09-06-2021

Lindemann and Bowern have updated their paper about "Character Entropy" (see You are not allowed to view links. Register or Login to view.):

Quote:For this update we developed an improved method for extracting language text from Wikipedia, removing metadata and wikicode, and we have rebuilt our corpus based on current wikipedia dumps.

Their results for the Wikipedia corpus is now more plausible. As expected Hawaiian has the lowest h2 value (see p. 28) and there is also less overlapping for different script types (see figure 11 on p. 20). Instead languages using the same type of script now tend to build clusters.

Some of the results differ dramatically. See for instance the results for languages like Wu and Zhuang. But also for languages like english the results have changed. The character set size for english is now 27 instead of 28 and the h2 value is 3.525 instead of 3.448. For some reason the general trend is that the h2 values are now lower for the Wikipedia corpus. Until today the corpus material was not updated (see You are not allowed to view links. Register or Login to view.). Therefore it is not possible to check the new results.

Also the values for the Voynich manuscript have changed. The authors explain this with a "minor alteration to the Maximal Voynich transcription system".

There are still some mistakes. For instance it would be expected that the h2 value for labels should be higher than for paragraphs. However the table on p. 38 shows that this is not the case for Hand 5. For Hand 5 Lindemann counts 2111 labels whereas I only count 15 on folio f66r. The reason for this result is probably the way the authors interpret the interlinear transcription file. The file is using markers like 'P' for paragraph and 'L' for labels. It seems as if the authors also interpreted markers like 'R' for right column as marker for labels.

They also added a short response to the review I and Andreas Schinner have published at Cryptologia (see Timm & Schinner 2021, You are not allowed to view links. Register or Login to view.). However they only reiterate their conclusion: "Voynichese appears unnatural only below the word level. At the level of page and paragraph, Voynichese is comparable to natural language and structured text" (Lindemann & Bowern 2021).
It is more than easy to point to non language like features for the word level and above. In our review we do point to some of them (see Timm & Schinner 2021). The most obvious feature is the existence of Currier A and B and the permanent shift from Currier A to Currier B (see Timm & Schinner 2020, p. 6 You are not allowed to view links. Register or Login to view.).
Moreover, our argumentation is literally that the Voynich text is more structured than natural language: "the level of context dependency is on a higher level than expected for a linguistic system" (Timm 2016, p. 7 You are not allowed to view links. Register or Login to view.).
The Voynich text is clearly structured. But simply because the text has some structure does not mean that it is likely to have a genuine linguistic structure. There are for instance repetitive text fragments like "shol chol shoky okol sho chol chol chal shol chol chol shol" on folio 42r or "qokeedy qokeedy qokedy qokedy qokeedy" on folio 75r. But this doesn't mean that this type of artificial text fragments must represent some type of linguistic structure. See for instance the decorative pseudo texts as described in "Writing that isn’t. Pseudo-scripts in comparative view" (Houston, S. 2018, p. 21-48. You are not allowed to view links. Register or Login to view. or You are not allowed to view links. Register or Login to view.).