The Voynich Ninja

Pages: 1 2 3

Entropy is just a single figure, and one has a better view if one looks at the entire distribution.

The single character entropy condenses the difference between a picture like this:

[Image: cf_matth.gif]

and the case where all frequencies are equal, into a single number. More interestingly, the conditional character entropy condenses into a single number how the left and right squares of the following are different:

[Image: ent_fig07.gif]

Clearly, many different distributions will lead to the same entropy value. The detailed distribution is like a simplified 'fingerprint' of the language.

To convert the Voynich MS text to real Latin, it is not sufficient to increase the entropy by the relevant amount. The entire detailed distribution has to become similar.

With respect to the question of removing Eva-q:

a great number of words only differ in that they start with qo instead of o .
By removing the q , this distinction is lost and the words become indistinguishable.
This would be a reduction of the entropy.

(10-03-2019, 08:51 AM)ReneZ Wrote: You are not allowed to view links. Register or Login to view.With respect to the question of removing Eva-q:

a great number of words only differ in that they start with qo instead of o .
By removing the q , this distinction is lost and the words become indistinguishable.
This would be a reduction of the entropy.

Hi Rene,
I don't trust my entropy scripts too much, but if I try processing modified EVA as described by Emma I get an increase in conditional entropy:

Unmodified EVA (Takahashi):
Entropy1: 3.8605
Entropy2: 5.7293
Conditional: 1.8688

Initial qo- replaced with o- :
Entropy1: 3.7885
Entropy2: 5.7115
Conditional: 1.9231

Bigram entropy (Entropy2) is nearly identical, likely because q is nearly always followed by o: the presence of q does not bring much bigram information / disorder.
Character entropy (Entropy1) decreases, since replacing qo- causes q- to almost disappear: single characters are easier to guess, since you basically get a smaller alphabet.
The result is that conditional entropy (e2-e1) is higher. There are less different words, but their structure is less predictable.

Does this make sense?
I only consider "inside-word" digraphs (digraphs including space are ignored).

(10-03-2019, 02:13 AM)DONJCH Wrote: You are not allowed to view links. Register or Login to view.Wait, surely that procedure described by Rene, by combining glyphs, would REDUCE entropy (degree of disorder)?
Or am I missing something?

As I understand it, that procedure would increase digraph entropy (and consequently conditional entropy) because it would make it harder to guess the next glyph from the current glyph.

E.g. consider:
pochedy shy shokeedar al kedal chokchy

Here, whenever you see 'c' you can guess that the next glyph is 'h'. It is one of the ways in which this text is highly ordered.
If you combine 'ch' into 'C' you get:
poCedy shy shokeedar al kedal CokCy

Now it's difficult to guess what comes after C, since you have three different options (Ce, Co, Cy).

Rene's CUVA, which only differs from EVA for similar simple replacements, results in a higher entropy than EVA (see the Conditional Entropy table, just after mid-page in You are not allowed to view links. Register or Login to view.).

What happens if the VMS is not a verbal language? Then, to compare its entropy with that of any language is wrong. All the analysis done so far are based in the assumption that the VMS is a verbal language. But what happens if it is another way of communication, a visual language?
Maybe a lot of the research is anachronistic. What it was meaningful for a man of the XV century it can be something meaningless for us.

Marco,

that looks OK to me. What my description referred to was the overall entropy, specifically the word entropy.
It is interesting to see that also H1 and H2 are reduced.
For the conditional character entropy (the difference between H1 and H2), the behaviour is essentially unpredictable, because it is already extremely low.
We already saw examples of that in the entropy thread of Anton.

If the purpose of this type of change would be to bring the text closer to Latin (etc), then the effect is actually the opposite.

W.r.t. Eva and Cuva, the purpose of the two is very different.

Eva is intended to represent electronically the text as completely and closely as possible, while using an alphabetical representation, and allowing rendition through a TT font.
As such, it is an improvement over the previous most common alphabet (Currier) because the latter has 36 symbols (including numbers), and it does not allow to represent many of the ligatures. It also has no way to represent the so-called 'weirdoes'.

However, Eva is not suitable for character-based statistics because things that look very much like single characters ( ch iin ) are represented by multiple characters.
To do statistics, one should devise one's own alphabet based on one's own assumptions about what are single characters. One also has to decide what to do with ligatures and weirdoes.
(Note that this can be done by a script, using an Eva transcription as input. It is therefore not at all a big effort).

Cuva is just one example of that, that I have been using. I don't think it is in any way 'optimal'.

[font=Tahoma, Verdana, Arial, sans-serif]The statistical analysis of entropy, conditional character entropy, and bigram distribution (or character pair frequency distribution) "heat maps" is very interesting.[/font]

[font=Tahoma, Verdana, Arial, sans-serif]Perhaps relevant to this is the detailed global pan-linguistic study of consonant frequency in 50 languages from widely varying families and areas, which was performed and reported in the paper "On Consonant Frequency in Egyptian and Other Languages" by Carsten Peust in Lingua Aegyptia 16 (2008), pp. 105-134:[/font]
[font=Tahoma, Verdana, Arial, sans-serif]https://archiv.ub.uni-heidelberg.de/propylaeumdok/2676/1/Peust_On_consonant_frequency_in_Egyptian_2008.pdf[/font]

I was particularly struck by the statistics for Modern Greek (15th c. late medieval Byzantine Greek was much closer to Modern Greek than to Ancient Greek). It uses a small number of consonants very frequently, much more so than other European languages:

/s/ 18.2%
This Modern Greek /s/ is the highest single consonant frequency of all European languages in this study. Globally, it is only exceeded by Maori /t/, 26.3% (Maori is likely to have entropy and distribution stats very similar to Hawaiian, which is among the most similar to Voynichese), Bambara /n/, 20.2%, Tagalog /n/, 19.9% (Tagalog also comes up among the most similar entropies and distributions to Voynichese in Rene's studies as I understand them), Japanese /n/, 18.8%, and Maori /k/, 18.5%.
By contrast Latin's most frequent consonant /t/ is 15.7%, English /t/ is only 13.2%, and French /r/ is 13.7%.
Modern Greek is also the only language in the world in this study with /s/ as the most frequent consonant, except for Ancient Georgian (which however is only 12.5%).
Greek /s/ is most frequent non-initially, 19.8%, but also occurs with 13.4% frequency in initial position. Large numbers of Greek nouns and adjectives have final /s/.

/t/ 15.3%
Modern Greek /t/ occurs with a striking 24.5% frequency in word-initial position, exceeded only by Maori /t/, 30.7%, and slightly by Modern Hebrew /h/, 24.9%. In all three cases, this is the first letter of the definite article, although Modern Greek also has /o/ and /i/ as definite articles.
Note that Modern Greek has two consonants with higher frequency than any consonant in English or French.

In fact, I found it quite revealing to examine all 50 languages in this global study, ranking them by the frequency percentage of their two most frequent consonants. The median frequency of the two most frequent consonants is 27.6%, the mean is 27.5%, and the population standard deviation is 4.86% (using Bessel's correction since the 50-language study is a sample). 40 of the 50 languages in the study fall within one standard deviation of the mean; only the top 5 and the bottom 5 do not. Modern Greek is the 4th of the top 5, along with Maori, Japanese, Bambara, and Tagalog. The bottom 5 are Ingush, Czech, Ossetic, Manchu, and Yoruba. (Maori at 44.8% and Ingush and Czech at 15.9% and 16.9% are the extreme outliers in this study sample.) Thus Modern Greek in this respect patterns with Asian and particularly Austronesian languages rather than with other European and Indo-European languages, almost all of which have a more normal top two consonant frequency.

/n/ 11.9%
Not frequent in initial position, very frequent in non-initial position. /s/ and /n/ are the only consonants that can occur word-finally in Modern Greek. (Word-final /r/ also occurred in Ancient Greek.)
/r/ 9.2%
Almost non-existent in initial position (1.0%), very frequent elsewhere.
/k/ 8.7%
/p/ 8.2%
/m/ 6.2%
/l/ 5.1%
/dh/ 3.6% (This is the Modern Greek d, delta, pronounced as a voiced /th/, as in English "the".)
/kh/ 2.6%
/ph/ 2.3%
/th/ 2.2%
I write the fricatives /x/ and /f/ as /kh/ and /ph/ to emphasize their relationship to the Greek stop series /p/, /t/, /k/.
/v/ 2.0%
/gh/ 1.6%
/d/ 1.3% (Written as "nt" in Modern Greek; voiced stops are rare and occur mainly in borrowings.)
/z/ 0.9%
/b/ 0.4% (Written as "mp")
/g/ 0.4% (Written as "gk")

Now as is well known, Greek has always been a language that is relatively heavy on vowels and light on consonants. The same is true of Polynesian languages such as Hawaiian and Maori, and to a lesser extent other Austronesian languages such as Tagalog. In terms of entropy and bigram distribution studies, etc., it might be of interest to also examine the old Baybayin abugida that was used to write Tagalog prior to the 16th century, which contained 3 vowels and 14 consonants.

Finally, one form of medieval Greek was actually written in an almost vowelless abjad: the Judaeo-Greek (also called Yevanic) language spoken and written by the Romaniote Greek Jewish and Constantinopolitan Karaite Greek Jewish communities, which was written in the Hebrew script, and marked most vowels with the standard Hebrew vowel diacritic dots rather than with letters. These communities had slowly declined in the modern era, and then the Nazi occupation of Greece in World War II virtually wiped them out, but in medieval times the Greek Jewish community thrived in many areas of the Mediterranean. There was even a substantial Judaeo-Greek speaking community in southern Italy, some of whom migrated to northern Italy and other areas in the medieval period. (I also wish to thank D.N. O'Donovan for bringing to my attention a late 15th century reference to a "Karaite script" of Hebrew claimed to be written without the use of the letters aleph, ayin, he, chet, bet, and tsadi. I guess this may also have referred to a form of Greek, since it is one of the few languages which might reasonably be written without these letters, although standard Judaeo-Greek does use them.)

I think it would also be quite interesting to do entropy and bigram distribution studies of this medieval Judaeo-Greek language written in the Hebrew script. In addition to the lack of most vowels in the script, it was also notable for being based on the colloquial vernacular of Byzantine Greek, without the "Atticisms" that Byzantine and Modern Greek have often employed in writing to make the language look more like Ancient Greek than it actually is. For all of these reasons, I would very much like to see the entropy and bigram distribution statistics of Judaeo-Greek in the Hebrew script, to see how they compare with those of Hawaiian, Tagalog, and Voynichese.

(10-03-2019, 08:51 AM)ReneZ Wrote: You are not allowed to view links. Register or Login to view.With respect to the question of removing Eva-q:

a great number of words only differ in that they start with qo instead of o .
By removing the q , this distinction is lost and the words become indistinguishable.
This would be a reduction of the entropy.

Surely only at the level of words? At the level of glyphs the entropy increases because [q] is so predictable.

I think these two different changes in entropy together would be interesting as word order is considered too variable yet glyph order too structured. (I thought that was one of the main paradoxes of the text.)

(10-03-2019, 06:28 PM)Emma May Smith Wrote: You are not allowed to view links. Register or Login to view.Surely only at the level of words? At the level of glyphs the entropy increases because [q] is so predictable.

One has to be careful. Removing q removes information. But it also shortens the words. Thus, the average information per character may go up or down - it is not easy to predict. (Entropy is basically the average information per unit).

q is not really predictable. What is predictable is what comes after it. So, removing q makes the question: what comes after a space, a little bit more predictable = lower entropy.

What would really increase the entropy is to replace the pair qo by a new character. Then, all information is maintained, but the words are shortened, so average information per unit goes up.

(10-03-2019, 06:28 PM)Emma May Smith Wrote: You are not allowed to view links. Register or Login to view.I think these two different changes in entropy together would be interesting as word order is considered too variable yet glyph order too structured. (I thought that was one of the main paradoxes of the text.)

One can treat the word order and the glyph order inside words as two independent things. This can be shown by the following thought experiment.

Take a known, meaningful plain text. (Not too long.)
Sort all words according to decreasing frequency. This gives word list 1.

Now take a Voynich MS transcription and sort its words according to decreasing frequency. This gives word list 2.

Now make a dictionary which consists of pairing up word list 1 and word list 2 with each other.
Using this dictionary, one can "translate" the plain text to "Voynichese" on a word by word basis.

The resulting text has a "normal" word order and is meaningful. However, its glyph order follows the peculiar Voynich patterns.

(Of course, there are reasons to believe that this is not how the Voynich text was made. One of the main reasons are the peculiarities of line-initial words and line-initial characters, as you well know. But that's another story.)

Pages: 1 2 3

ReneZ

ReneZ

MarcoP

MarcoP

Antonio García Jiménez

ReneZ

ReneZ

geoffreycaveney

Emma May Smith

ReneZ