The Voynich Ninja

Full Version: Experiments with language corpora
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Pages: 1 2 3
(29-05-2018, 11:21 AM)MarcoP Wrote: You are not allowed to view links. Register or Login to view.
(28-05-2018, 09:57 PM)DONJCH Wrote: You are not allowed to view links. Register or Login to view.Can a confidence interval be calculated for any of these figures?

I am an expert of statistics, but I guess the answer can only be "yes" - we have a set of numbers, so they can be fed to any numerical method. The meaningfulness of the output is another matter.

Of course, it also depends on the measurement we want to evaluate. Qualitatively, from the graphs above, it seems that we can be relatively confident about Voynichese H1, while conditional entropy has strong variations depending on both the subset of the text (Currier A B) and the transcription system.

Did you mean to say you are NOT an expert in statistics rather than that you are?
The rest of your sentence suggests that you meant to say the former...

Not that it matters... but, if you are, please excuse the lecture!

Each of the points on your graphs will have an error bar around it, whether that is calculated or not.
This represents the (usually) 95% confidence limits for the measurement.
The error decreases with the size of the dataset.

So some of those conditional entropy values in your graph may not be (statistically) significantly different from each other and this information would be valuable to know. The 2.02 value for EVA may be essentially the same as that for the next language - or the uncertainty may be even wider, we just do not know.

A good software package would do all this for you.

This is not meant as criticism, your work is awesome!
Hello DONJCH,

what you write is not wrong, but it not quite as simple as that.

The entropy measures that have been presented are usually quantities computed for one specific piece of text expressed in one specific character set. In such a case, there is one entropy measure (i.e one for H1, one for H2, one for H2-H1) and there are no error bars.
These are the entropy measures for _that text_.

This changes as soon as one starts to interpret it as an entropy measure for "the language" in which the text is expressed. In that case, one could (or should) take many different sample texts in that language and establish the error bars as you suggest. One can compute sigmas, even though the distribution will not be Gaussian.

This is where it gets a bit complicated, because the entropy values depend (to some extent) on the length of the text sample, and to lesser extents on the subject matter, the author and the age of the text.

Shorter texts will have larger error bars, but there is also a deterministic part of the dependence on the length.
This is negligible for H1, is visible for H2, significant for H3, and causes prohibitive problems in computing H4 and higher in a reliable manner.

The plot that Marco included from my web site shows two effects related to this uncertainty. On the one hand, two Latin texts that were written 16 centuries apart, but of similar length (because I truncated both) show very similar values for H1 and H2 - H1.
On the other hand, different transcription alphabets for Voynichese show (partly) different statistics, because the information is 'packetised' in different manners. Still, only Eva shows a clearly significant difference.

For the many source texts used by Marco, there is exactly one sample text in each language, which is also a bit short, so we remain uncertain about size of the error bars for the respective languages.
Zandberger-Senpai, Co-father of EVA,
This unworthy one thanks you for your detailed and as usual crystal clear reply!

Seriously though that's great. I'm with you now.

So if the underlying distribution is not Gaussian, what is it? Poisson or binomial maybe?
In biochemistry we often force the issue by using a log transformation or just get the 95 percentile non-parametrically by lining up the values in order and cutting off the top and tail.

Be that as it may, I guess with the VMS we have to go folio by folio or quire by quire then get the standard deviation of the results? Has this been done?

Lastly my big question would be, what would happen to the stats for the whole VMS if we slipped in a passage from the Bible in say Latin transcribed phonetically into Voynichese? This harking back to the steganography (WW2 thread)l and assuming perhaps 10% plaintext in a 90% covertext of pure Voynichese?

I bet the entropy stats would not be able to detect this. Your expert eyes would be a different matter, of course,  depending on how cunning the author was.
(03-06-2018, 02:50 AM)DONJCH Wrote: You are not allowed to view links. Register or Login to view.Did you mean to say you are NOT an expert in statistics rather than that you are?

You are right of course, thank you DONJCH!
Since the entropy values are positive-only, I would expect the distribution to be more Poisson-like, but it is quite a smooth distribution and as long as one is reasonably far away from zero it won't make a big difference.

The same method for defining 95% percentile values as you suggest was used for the radio-carbon dating statistics, where the distribution is far from Gaussian:

You are not allowed to view links. Register or Login to view.
Here is the Dutch news item, that the first line, is "cracked" by Kondrak en Hauer.  You are not allowed to view links. Register or Login to view.
Confused What a total waste of time)
After a while, here is an update of my experiments with the UDHR corpus.
I have added a fourth property:  the percentage of words starting with the most frequent symbol. E.g. in the string:
aa aba caa addd
the value is 75% (3 of 4 words starting with the most frequent symbol "a")

I have also added another Voynich dataset: the first 13,000 characters of the Currier-D'Imperio transcription (I extracted the data from one of Rene's IVTT files); the length of the sample is comparable with the average length of UDHR files.

The Voynich data-sets I now consider are:
_C-D_13K the Currier-D'Imperio file described above
_EVA_ZL_ALL Zandbergen-Landini whole transcription of the manuscript
_NEAL_A Currier A subset of Takahashi's transcription, modified as suggested by Philip Neal
_NEAL_B same as above for Currier B
These correspond to the blue diamonds in the plots.

I have matched each UDHR file against four measures:
  1. ENT1: 1st order entropy
  2. COND: conditional entropy
  3. REP1000: number of word repetitions for 1000 words. I have updated this count to include repetitions in which the two occurrences are separated by a dash (in addition to repetitions separated by a space). Reduplications with no separation (like "purpur") are also included.
  4. %_INIT_MOST_FREQ: percentage of words starting with the most frequent character
Entropy graphs (the second one "zooms" into the VMS area). As is well known, most languages have higher entropy values than Voynichese.
You are not allowed to view links. Register or Login to view.
REP1000 / %_INIT_MOST_FREQ graphs. Most languages cluster near the origin, with zero repetition and less than 10% of words starting with  the most frequent character.
You are not allowed to view links. Register or Login to view.


The five best matches are (by increasing distance from the VMS samples):
1.16 hms Hmong, Southern Qiandong (China)
1.53 auc Waorani (Ecuador)
1.66 pam Pampangan (Philippines)
1.89 cot Caquinte (Peru)
1.99 rar Rarotongan (Polynesia)
These correspond to the green circles in the plots.
Both the numeric distance and the plots should make clear that hms Hmong is better than all the other UDHR samples, according to these specific measures.

These languages are from very different places. Geographically, none of these languages looks likely for the VMS, but I am considering looking into some on them, in order to understand the kind of phenomena that can result in Voynich-like texts.

I have manually selected these six languages as the closest ones that are also geographically European or close to Europe:
3.13 gla Gaelic, Scottish
3.48 gle Gaelic, Irish
3.52 hye Armenian
3.65 ydd Yiddish, Eastern
3.74 nld Dutch
3.77 eus Basque Euskara
These correspond to the orange circles in the plots.

The best one (Scottish Gaelic) only ranks 77th among the 378 language I considered. These European candidates perform quite poorly in the entropy plot and also in the repetition measure. The Basque UDHR text has three instances of exactly repeated words: I am fairly sure that word repetition is an actual linguistic phenomenon in Basque (see for instance You are not allowed to view links. Register or Login to view. and You are not allowed to view links. Register or Login to view.) but the evidence suggests that the phenomenon is considerably more frequent in Voynichese.

On the other hand, the two variants of Gaelic are good matches in terms of words beginning with the most frequent symbol: both have several short function words starting with a-. Armenian and Eastern Yiddish also perform rather well according to this measure. But these four languages do not seem to make use of word repetition. The result is that,  also in the second plot, all European languages appear to be rather distant from the VMS samples.

I also manually selected three languages (yellow circles) that are somehow intermediate between the green and orange samples: they fit slightly better than the orange circles and they are slightly closer to Europe than the green circles.
2.30 plt Malagasy, Plateau (Madagascar)
2.40 flm Chin, Falam (Southeast Asia)
2.69 njo Naga, Ao (North-East India)

Even if the results are still negative, I believe this research is promising: with whatever system the VMS was created, if it has a linguistic meaning, it is very likely to be expressed in a language that has a close relative in the UDHR corpus. Assuming that this is the case, why are the results negative? There are several non mutually exclusive explanations, for instance:
  • We have not found the right set of statistical properties yet.
  • Our geographical assumptions are wrong (e.g. Voynichese is a close relative of Hmong).
  • The encoding used in the manuscript is not directly comparable with the writing system used in the UDHR (e.g. the ms is written in a complex cipher that scrambles the statistics; or the ms is written alphabetically while the corresponding language in the UDHR is written syllabically).
Pages: 1 2 3