The Voynich Ninja

Pages: 1 2 3 4 5 6

On the term reduplication, it comes from a Late Latin compound re- (back, again) + duplicare (to bend; to double, be composed of two elements, bear two (paternal and maternal) arms; to duplicate, prepare in duplicate; to make a certain reply or rejoinder called a 'duply'; to double, i.e. lengthen, a feast; to line a garment, sail, or shelf with (another) lining). As you can see, the the problem is that duplicare came to mean so many things that in order to signify the particular sense of 'duplicate', they disambiguated it by adding a seemingly redundant prefix.

(22-02-2020, 05:19 PM)davidjackson Wrote: You are not allowed to view links. Register or Login to view.I now get CORREL to 0.8535.
Calculating the associated p-value (because I can) we get the p-Value of .30621 (the result is significant at p < .05)

...

[attachment=4038]

Hi David,
thank you again for this discussion, which is extremely helpful to me! I don't know much about statistics and it's great to have a chance to look into new things.
I am not familiar with the concept of P-value, but I guess it is similar to You are not allowed to view links. Register or Login to view. mentioned by Rene at the end of You are not allowed to view links. Register or Login to view.. Visual inspection of the curve suggests that for 6 samples a 0.85 correlation should be significant, but I am not sure I am getting this right.
Also, if I enter r=0.8535 and N=6 in You are not allowed to view links. Register or Login to view., I get a P-value of .030621, which is exactly 10 times smaller than what you found, and just smaller than the 0.05 threshold. So maybe this correlation measure is significant after all?

Thank you also for sharing a plot of %Red vs %Quasi. I attach a similar plot with sample labels added. As you say, evidence "suggests that reduplication and quasi reduplication are linked", but I am not sure we are on solid ground yet (your P-value argument looks like a great way to look at these results).
The plot points out two problems:

the obvious one that, working at such a coarse granularity as whole sections, we only have 6 samples - this implies that only a very high correlation is significant and a single outlier can make the results unreliable;
the diagram-intensive AstroCosmoZodiac pages behave differently from the other more paragraph-oriented pages.

I think I will try a folio by folio analysis, using the normalization system described by Nablator You are not allowed to view links. Register or Login to view., so that we can look at things at a more detailed level.

I think the folio by folio analysis is the way forwards on this topic, good suggestion.

I got the same result as you, just left off the initial zero... Sorry, I've got a terrible strep throat and am on painkillers and antibiotics, so I'm still a bit fuzzy. I'll be stepping out of this topic until later in the week when my heads clearer.

Please do keep on posting your ideas here Marco. Although I'm a bit doubtful we can bring anything out of the murk with pure stats, it is fascinating.

It might be useful to prepare a sheet with page-by-page data anyway, in anticipation of more details about Lisa's scribal hands analysis.

Before tackling page (and/or folio) stats, I experimented with the scrambling idea suggested by Koen.
I attach a graph based on the original/scrambled ratio for reduplication (X axis) and quasi-reduplication (Y) counts. For comparison, I added two ordinary texts: King James Genesis (English) and Mattioli (Latin). I also include Timm and Schinner generated text (TT-AS).
For the language texts, the original has a lower rate of reduplication and quasi-reduplication than the scrambled version (orig/scrambled ratio < 1). In the original texts, reduplication and quasi-reduplication are close to 0%.
For the VMS samples (and TT-AS) the ratios are greater than 1: more reduplication and quasi-reduplication in the original then in the scrambled versions.
The BioQ13 section has high reduplication and quasi reduplication percentages, yet the scrambled version is particularly close to the original (both ratios between 1.0 and 1.5). In the case of perfect-reduplication, this can be explained by the low MATTR, but the result for quasi-reduplication is more difficult to explain: I guess it means that many of the words are extremely similar to each other. Overall, the VMS sections display a great variety in the values for these two ratios, but they consistently are >1. I guess this means that there is some principle in how Voynichese select consecutive words that causes reduplication and quasi-reduplication: they are not explained by the nature of the vocabulary alone.
TT-AS behaves quite similarly to Voynichese (Herbal_A in particular): since we know how this text is generated, it could be interesting to see exactly how it manages to replicate this feature.

Thank you Marco, it's good to have this out of the way. So there is some system that more or less doubles the amount of reduplication compared to a shuffled text.

As you say, there is the matter that Voynichese words are very similar to each other. And this in turn may be inherent to the system that governs reduplication. Hmm...

Has it ever been studied which words exactly reduplicate? Is there a "class" of words that does this more often? Do the same kinds of words quasi-duplicate? Are some words exempt from duplication? (Getting close to LAAFU effects here).

Hi Koen,
there is so much to do in this area!

You can find some data about perfectly-reduplicating words here:
You are not allowed to view links. Register or Login to view.
Possibly, the most interesting thing is that aiin never reduplicates.
It seems that frequently reduplicating words are quite different in A (short bench-words) and B (qo-words).

Here I posted some details about quasi-reduplication, but I focussed on the difference between the two words, rather than the words themselves:
You are not allowed to view links. Register or Login to view.
What I found is that q- words and (at a lesser extent) o-words seem to be particularly active in the phenomenon.
But checking if words that reduplicate more than expected (i.e. more frequently than in the scrambled text) also quasi-reduplicate with a similar rate seems like another interesting idea. Thank you!

I am also thinking of checking these patterns:
X X X'
X' X X
i.e. perfect-reduplication followed by quasi-reduplication and vice-versa. I suspect this could be something I did in the past, but I cannot find it at the moment.

The word tokens in the VMs are not randomly distributed. There "exists an inherent relation between word similarity and context: when we look at three most frequent words on each page, for more than half of the pages two of three will differ in only one detail" (You are not allowed to view links. Register or Login to view., p. 3). "All pages containing at least some lines of text do have in common that pairs of frequently used words with high mutual similarity appear" (Timm & Schinner 2019, p. 3).

This increases the chance that frequently used word tokens on a page occur duplicated or next to similar ones. There are even sequences with four or more similar word tokens in succession:
<f1r.P3.15> kol.chol.chol.kor.chal
<f15v.P.5> otchor.chor.chor.ytchor
<f27v.P.1> fochof.chof.cho.sho
<f42r.P3.20-21> sho.chol.chol.chal-shol.chol.chol.shol
<f75r.P.38> qokeedy.qokeedy.qokedy.qokedy.qokeedy

[The three most frequently used word tokens on folio You are not allowed to view links. Register or Login to view. are <daiin> (7), <chol> (7), and <dain> (6). The most frequently used word tokens on folio f15r are <chor> (5), <chol> (4), and <daiin> (4). On folio You are not allowed to view links. Register or Login to view. <sho> (3), <dchy> (3), and <shy> (2) are the most frequently used word tokens. The most frequently used word tokens on folio You are not allowed to view links. Register or Login to view. are <shol> (11), <chol> (8), and <daiin> (5). The most frequently used word tokens on folio You are not allowed to view links. Register or Login to view. are <qokain> (22), <shedy> (16), <qokeedy> (14) and <qokedy> (14).]

These are the results of a page-by-page analysis. As always, I might have made errors in the process.
I normalized page measures according to the process You are not allowed to view links. Register or Login to view.: the measure is the difference with the expected number of occurrences based on the overall average measures and the number of word-couples in each page. A page that has exactly the expected number of Perfect-Reduplication and Quasi-Reduplication will be plotted at 0,0.

I have also processed individual sections. I used the Image classification in the ZL transliteration file: a few text-only pages are not classified and are not coloured in the plot (there only are a handful of these, so they do not significantly affect the results).

Theis table summarizes the results and lists the colours used in the attached plot:

__ALL__ pgs:207 corr: 0.338 p-val:0.0000006

------------------
HerbalA pgs:95 corr:-0.039 p-val:0.7046817 Blue
Pharma_ pgs:16 corr: 0.252 p-val:0.3464567 Orange
AstroCZ pgs:15 corr:-0.225 p-val:0.4199572 Purple
HerbalB pgs:32 corr: 0.311 p-val:0.0833608 Red
StarQ20 pgs:23 corr: 0.673 p-val:0.0004313 Yellow
Bio_Q13 pgs:20 corr:-0.137 p-val:0.5657872 Green

As can be seen, the overall correlation is not very high (0.34), but the number of samples is large enough to cause a quite low p-value. Here I have included the 19 pages that have a 0 count for both measures. If I remove them, I get correlation: 0.321, p-value:0.000007.
Since correlation is regarded as significant when p-value<0.05, these results confirm that the two phenomena are linked in some unknown way.

Some sections have correlations close to zero (HerbalA,AstroCosmo,Q13). All have high p-values, with the important exception of the Stars-Q20 section, with a 0.0004 p-value. While (Q20 aside) pages inside a section do not show much correlation between Perfect ant Quasi reduplication, the average values for the individual sections are different and result in a significant correlation (see You are not allowed to view links. Register or Login to view.). The overall result then appears to be due to two factors:

different behaviour of the individual sections; this is similar to what Rene observed when discussing the You are not allowed to view links. Register or Login to view.;
page-by-page correlation in Q20. This also shows in the attached plot, where the yellow squares spread diagonally from bottom-left to top-right.

I have generated reduplication/quasi-reduplication plots including all the language samples in the text corpora by Koen (621 texts) and You are not allowed to view links. Register or Login to view. (54 samples).
Thanks to Koen and Brian who shared their corpora, and to Jonas Alin who pointed out Cham's corpus You are not allowed to view links. Register or Login to view..

This is the overall plot:
[attachment=4259]

The plot is made unreadable by the very high reduplication rate of a single outlier N-PML (all samples labelled with the N- prefix are from Cham's corpus). The text is described as "Sabir, a.k.a. Mediterranean Lingua Franca - Extracts from The Bourgeois Gentleman by Molière (1670) 240 words". It apparently is an attempt to reproduce spoken language.
In this text, almost 12% of all consecutive words are identical. On the other hand, quasi-reduplication never occurs.
A fragment from this text (reduplication highlighted):

Se ti sabir,
Ti respondir;
Se non sabir,
Tazir, tazir.
Mi star Mufti:
Ti qui star ti?
Non intendir:
Tazir, tazir.

Mahametta per Giordina
Mi pregar sera é mattina:
Voler far un Paladina
Dé Giourdina, Dé Giourdina.
Dar turbanta, é dar scarcina,
Con galera é brigantina,
Per deffender Palestina.
Mahametta, etc.

Star bon Turca Giourdina?
Hi valla.
Hu la ba ba la chou ba la ba ba la da.

Ti non star furba?
No, no, no.
Non star furfanta?
No, no, no.

If one removes N-PML from the data-set, the plot looks like this:
[attachment=4258]

Blue samples correspond to the VMS
The orange square is the text generated by Timm and Schinner's software.
Purple samples are from Cham's corpus, yellow samples from Koen's.
The large green circles are samples I have chosen for further discussion.

The red line is X=Y: it makes it easy to see that typically reduplication is more frequent than quasi-reduplication. Though nothing comes close to the VMS, these are a couple of texts that exhibit both phenomena, with more quasi-reduplication than reduplication:

N-EMY is a Mayan (Ch’olti’) dictionary: Bocabulario Grande by Fray Francisco Morán (1695).
From the text:

agua menuda puz puz ; palpal ha
azul - color yax yax
apuntalar tontei ; nostenahib, el tribo
a donde, por donde - tuba
alisar yulyul, yuhlin
agua clara, berde, azul - yaxha
ageno, agena yantal
abergonsar tzubalez. tzublez

It seems that most occurrences of reduplication are Mayan expressions. For instance, in Ch’olti’ "blue" is "yax yax". Quasi reduplication occurs both in Ch’olti’ and Spanish, where variants for the same words are given (for instance, I guess that the Spanish "ageno agena" are the masculine and feminine for "stranger").

Slav_.m is a text collected by Koen, originally named "Slav_NovumTestamentum.txt". According to google translate, the text is Bulgarian.
Perfect reduplication only occurs twice:
да.да (yes yes)
нет.нет (no no)

All occurrences of quasi reduplication are concentrated in the first few lines, that describe the genealogy of Christ:

фарес родил есрома есром родил арама арам родил аминадава аминадав родил наассона
fares rodil esroma esrom rodil arama aram rodil aminadava aminadav rodil naassona
Perez the father of Hezron, Hezron the father of Ram, Ram the father of Amminadab, Amminadab the father of Nahshon

These are consecutive occurrences of the names of people in two different cases: in all couples, the two words only differ for the final character. Since the file is rather short (less than 3000 words), these occurrences of quasi reduplication have a considerable weight.

I have also investigated sequences of three words in one of these two forms: X X X' (perfect reduplication, followed by quasi-reduplication) and X' X X (quasi-reduplication followed by perfect reduplication). I have compared the rate of the occurrences in the original file with the rate in a random scrambling of the file (averaging on 20 different scrambles for each file).
Here the plot immediately makes clear that the whole VMS is quite different from anything else. Again, the red line is X=Y: the position of the VMS samples means that the two triple patterns are much more frequent in the actual manuscript than in the scrambled versions: this was somehow expected, since You are not allowed to view links. Register or Login to view. that both reduplication and quasi-reduplication appear to be more frequent than in scrambled data.

[attachment=4257]

Anyway, these triple patterns are not easily produced by random order: their extensive presence in the VMS might confirm that reduplication and quasi-reduplication are related and tend to appear consecutively.
The vast majority of the other data fall at 0,0: i.e. triple patterns never appear in the original file and its scrambled versions.
In order to have a look at the detail of the other files, I plotted the data in logarithmic scale. To move samples away from the origin, I simply added a very small quantity (0.0001) to both measures (I understand that this is a clumsy solution). Please remember that logarithmic scale reduces the apparent distance between far away samples: the VMS is not as close to the rest as it appears to be here.

[attachment=4256]

Erasm.h might be the file that comes closer to the VMS (though with one order of magnitude lower frequency). The complete file name in Koen's corpus is ErasmusProverbsLatinEnglish.txt. The samples contains 13417 words, about 1/3 than the VMS, but it includes a single "triple pattern" (X' X X), while 37 appear in the whole VMS.

nothyng stycketh more fastly than that that is receyued and taken of pure youth not yet infected wyth peruerse and croked maners or opinions

This single case looks totally casual. Also, it occurs in a text that does not have a high frequency of reduplication and quasi-reduplication (see the other plots above).

Only two files, both in Cham's corpus, contain 2 occurrences of triple patterns:

N-HAW - A 1861 Hawain text.

aia la ke kau nei hoi
ka hoku pakipika
iluna o ke aouli
ka lani kiekie
lehulehu
lehulehu
lelulehu no lakou

o keia pepa aole ia na kekahi haole aole aole hoi na ka mea hookahi aka na na kanaka

I have not tried to understand what these patterns mean. From the graph above, one can see that reduplication is quite frequent in Hawaian. I guess it may be a feature of the language (as for the Mayan language discussed above).

N-FIN - the Finnish Kalevala

This is the collection of poetry that Jonas Alin pointed out in the thread I linked above. The poems (called "Runes") were transcribed by Elias Lönnrot in the first half of the XIX Century; he travelled through the country in search of the last people to be familiar with an oral tradition that likely dated many centuries back. The content of the poems is mostly pagan, with only limited references to Christianity.
The relatively high number of reduplication and quasi-reduplication is one of the consequences of the strongly alliterative style of the poems.
I had a few exchanges on You are not allowed to view links. Register or Login to view. with a native speaker with a degree in Finnish language (u/Vilmiira) who helped me understand something of the two occurrences of triple patterns in Kalevala.

(Rune 11, 180) Enkä huoli huitukoille, huitukoille, haitukoille;

This means something like "I have no interest in carefree people, carefree people, carefree people"

Vilmiira Wrote:Huitukoille = huitukka + (o)i + lle, where the (o)i indicates plural form, and lle is the ending meaning to or for someone.

huitukka has a connotatiin of a young, carefree girl [actually a person] who runs around and could also indicate somewhat loose morals, but it's not a really bad word. Haitukka is a version of this, it does not have a separate meaning but rather a poetic version, that doesn't really exist otherwise.

(Rune 42, 219) Itse seppo Ilmarinen, toinen lieto Lemminkäinen, nepä tuossa soutelevat, soutelevat, joutelevat selviä selän vesiä, lake'ita lainehia.

The blacksmith, Ilmarinen, with the flighty Lemminkainen, they are rowing, rowing, gliding over the clear waters of the sea, over the waste of waves.

Here I render joutelevat as "gliding" (on the basis of the English translation by Crawford)

Vilmiira Wrote:Soutelevat = soutaa + ele + vat, where soutaa is the root verb "to row", the ele-suffix creates a new frequentative version of the verb (to row continuously or in more relaxed manner), and vat-suffix makes it plural.
...
Here again, the word joutelevat cannot really be translated, at least with my knowledge. It seems to be a similar thing to the previous example, where the word is purposefully taken to look almost the same.

Form what Vilmiira wrote, I understand that the Finnish language had and still has the possibility of slightly altering a word for expressive reasons. This alteration may introduce a slightly different meaning, but sometimes the meaning appears to be unaltered.

In the case of the Kalevala, there is little doubt that reduplication and quasi-reduplication are due to a single cause: alliteration. Another alliterating feature in this text that could make it comparable with Voynichese is a preference to have consecutive words with the same initial sounds. This is an example from the Academia paper You are not allowed to view links. Register or Login to view. - Frog and Eila Stepanova.

[attachment=4255]

But coming back to reduplication, quasi-reduplication and the triple patterns, one should note that the Kalevala is not anyway near to frequencies in the VMS. In Voynichese, reduplication and quasi-reduplication appear in an amazing 3% of consecutive word couples. In the Kalevala, they are one order of magnitude rarer (0.3% total).
Similarly, the two triple patterns occur once every 1000 Voynich words (37occurrences in 36293 words), while in the Kalevala they are 34 times rarer (2 occurrences in 68156 words).

If the VMS is meaningful and if similar Voynichese words correspond to similar words in some language, the text appears to make a uniquely intensive use of alliteration. I am sure that alliterative compositions were traditional in many other languages in addition to Finnish and related languages, but I don't expect that finding actual examples will be easy.

Pages: 1 2 3 4 5 6

Stephen Carlson

MarcoP

davidjackson

Koen G

MarcoP

Koen G

MarcoP

Torsten

MarcoP

MarcoP