The Voynich Ninja

Full Version: How to recombine glyphs to increase character entropy?
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Pages: 1 2 3 4 5 6 7 8
(02-05-2020, 11:23 AM)MarcoP Wrote: You are not allowed to view links. Register or Login to view.Another difference is that I used the Zandbergen Landini transcription, while Koen used Takahashi's. But I don't think this really makes a difference.

Takeshi Takahashi's transliteration is used in different sources, and they are not all the same. I have been working with the one that can be extracted from the interlinear file.

Also of the ZL transliteration there are several versions, as I have been fixing errors and improving consistency in it.

The differences in location of word spaces is likely to be significant.

At the glyph level the correspondence between the two is 97.5% and this difference will show up in the statistics.

To test two different tools or approaches, both should use an identical input data set.

To see the impact of transliteration errors, one could apply the same tool to two different source texts and see what difference it makes. This is an indicative test, because there may be common errors.

Standard texts for reproducible analyses may be found You are not allowed to view links. Register or Login to view. .
Thank you very much, Rene! You are right, of course: I apologize for the lack of references.

The Takahashi and ZL transcriptions I used in the previous posts are:

ZL_ivtff_1c.txt - which you mentioned You are not allowed to view links. Register or Login to view., but does not seem to be available anywhere any more;

TT_ivtff_v0a.txt - which also does not seem to be available online any more.

Have you ever considered using some kind of version control to distribute ivtt files? That could make older versions available, so that experiments that we run today may still be reproduced a couple of years from now.
....(baffled to see all these "investigations" going on here, all exactly the same as i performed years ago, with the exact same results ...)

You will see that using versions on transcriptions, will prove to be useful for one and useless for the other because transcriptions are still subjective, although I suspect that many of you disagree. That is also because the amount of errors made by the scribe (in using the wrong letters on that position) are approx. 1% of the total. These so called errors are not clearly visible but can be seen on statistical levels, as well as based on grammatical composition. Here on the forum they are often presented as exceptions or anomalies and used as argument why a certain theory does not hold, and that is the exact opposite of how I see it.

In my experience it is the utmost best, to use the same file, over and over again. Even if the transcript contains some errors, later on.
Any changes in between, must be downwards compatible otherwise comparisons with older investigations and analysis have become useless or less significant at least,
but it of course depends also a little bit on what you investigate. 

It is also possible to change the transcription text such a way, that you receive the results you are looking for. That is often done in publications regarding the Voynich.
Indeed, there are some good uses for older files, in particular regression testing of one's tools.

I am much in favour of having some (semi-)automated way of running tests/experiments, such that they can be easily repeated for different source data or different options. This makes it easier to move to newer versions.

In any case, I prepared an area with a few of the older files, including the ones listed by Marco.
If there are others, please let me know.

There is a link to this area below the caption of Table 12 on You are not allowed to view links. Register or Login to view. . The page may need to be refreshed first.

This link points to You are not allowed to view links. Register or Login to view. .
(03-05-2020, 01:14 PM)Davidsch Wrote: You are not allowed to view links. Register or Login to view.....(baffled to see all these "investigations" going on here, all exactly the same as i performed years ago, with the exact same results ...)

Wonderful, David, it will save me a lot of work if you went down this path already. Could you tell us what your findings were? Or do you have a link?
Rene, I was mailing with Marco about the best way to establish a range to aim for. He suggested something like avg-stdDev / avg+stdDev. Do you think this would be a good approach? This is all just to get a ballpark idea of what is "normal".

The values for h2/h1 would be as follows:

Code:
      Space         Nospace
AVG    0.78         0.83
STDEV  0.04         0.03

This would result in a range of 0.74 - 0.82 for texts with spaces and 0.79 - 0.86 for those without.
This has a clear meaning for data that are normally distributed (i.e. 'Gaussian').  Data that can only take positive values cannot be Gaussian, but for some part of the curve it may look 'similar' to Gaussian.
If data are normally distributed, then a bit less than two thirds is in between the range you propose and one third is outside, so still quite a lot.

For distributions with an unknown shape, one can use 'percentiles'. The 95-percentile is used a lot, because in a normal distribution this also represents 2-sigma. There is only 5% probability that the quantity is not between the two limits.

This is not computed as easily as mean and std.dev. (one pass through the data) but Excel may have functions for this (?)
This snippet illustrates length variation in medieval Latin words (this is slightly on the long side, but it is not an extreme example, it's a reasonably normal one):


Many of the words in this selection are abbreviated, which shortens them, and yet even with the abbreviations, we see the following word lengths:

     6 10 15 5 11 3 8 3 3 4 10 5 9 9 3 6 9 4 4 (last word breaks across the line and indulgere breaks across the line)

  • 19 words, 11 of them abbreviated (more than half)
  • range: 3 to 15 characters (the last word has been chopped across a line, it is 4 chars, and indulgere is one word)
  • average word length (unexpanded) 6.7 characters
  • average word length (abbreviations expanded) 7.6 characters
  • 16 different characters at the beginnings of words
[attachment=4412]


Then I took a snippet from You are not allowed to view links. Register or Login to view. where some of the tokens tend to be longer. These stats are for the first 19 tokens:

    5 5 6 4 6 6 5 4 2 4 3 4 7 6 5 4 7 4 5
  • 19 words (starting with tedal), unknown whether there are abbreviations
  • range: 2 to 7 glyphs (and I was being generous in assigning "2" since it is a half-space, it may actually be 3 to 7 glyphs)
  • average token length 4.8 (minims were treated as individual glyphs but they may not be)
  • 6 different characters at the beginnings of tokens
[attachment=4411]


But... the almost 30% difference in token length isn't the most important difference.

The more important patterns are:
  • the variation in length, with medieval Latin ranging from 3 to 15 characters and the VMS within a narrower range, from 2 to 7 (Note that the "2" in this equation is actually a half-space. It might actually be a 6 rather than 2 + 4, which makes it even less varied than the Latin sample, but there are 2-glyph tokens in the VMS, so I decided to go with the half-space.)
  • the positionality of specific characters in the VMS, which is partly illustrated by the limited character set used at the beginnings of tokens
If you take the spaces at face value, the VMS is lacking in word-length variation (in addition to being highly repetitious and positional).
If you are suspicious of spaces (I have been suspicious of spaces for a long time, partly because of this kind of pattern), then the possibility exists of abbreviations or markers/modifiers or both or tokens split across a space.

------------------
Caveat: I am well aware that one should not do statistical comparisons on small sample sets. In fact, it is folly to do so. But we have many statistical attacks on larger sample sets (and they are the ones that should be considered important) yet I still get the feeling that the differences are not being internalized by a number of people offering substitution solutions, and I still sense resistance to the idea that the spaces might be suspect.

So, I thought I would post this to help researchers visualize what is going on in the larger-sample studies but with a smaller amount of data and actual clips of text, just to get the idea across.
As for about any other topic, good references about word lengths statistics can be found You are not allowed to view links. Register or Login to view. (4.6 Word length distribution).

In this case, the fundamental work is You are not allowed to view links. Register or Login to view..

Stolfi's research is mentioned in Reddy and Knight's well-known 2011 paper You are not allowed to view links. Register or Login to view.: they add some interesting observations.

Recently You are not allowed to view links. Register or Login to view. mentioned the corpus of texts in different languages collected by You are not allowed to view links. Register or Login to view..

I cleaned the texts by removing (most) punctuation and computed two simple statistical measures for word lengths:
  • average value
  • standard deviation (a measure of variability)
The VMS samples (blue) are:
The Currier D'Imperio transliteration (about half the ms is included)
Currier A and B using the EVA transliteration by Zandbergen and Landini (ZL_ivtff_1c.txt) with and without uncertain spaces.
This is the resulting plot:
[attachment=4416]

Of course, the two measures are positively correlated: if a text has longer words it also has more room for variability.
The graph is dominated by a few extreme outliers, for instance:
KAL (Greenlandic) - From Cham's corpus description: "Family: Eskimo-Aleut Notes: Polysynthetic language; words can be very long."
THA (Thai) does not use spaces between words, so the counts here correspond more to syllables in a sentence rather than sounds in a word

This plot restricts the average length to a range closer to the centre of the diagram:
[attachment=4415]

As one can see, considering uncertain spaces in the VMS has a much smaller effect than the transliteration system used. The Currier-D'Imperio system joins several EVA sequences into single symbols, so that daiin is encoded as 8AM and chol as SOE.
Several languages have the same average word-length as the VMS. For instance Italian (ITA) and Middle English (ENM) are close to VMS-CD. Ancient Greek (GRC) and Mongolian (MON) are close to the EVA samples.
As JKP showed for Latin, the difference is that the other texts with comparable average word length all have considerably greater variability (Standard Deviation): they appear to be "higher" on the plot.

The Arabic abjad (ARA) is one of the text samples that comes closer the VMS (this is also discussed by Ruddy and Knight).

Stolfi observed that some languages which are really close to Voynichese word-length behaviour: pinyin Chinese (LZH), Vietnamese (VIE) and Tibetan (not in Cham's corpus). They appear at the bottom left of the graph. Stolfi could match Voynichese to these languages by using an encoding system (which he calls OKO) even more "compressive" than CD: this has the effect of further reducing the average length of words, so that the results are perfectly comparable with Chinese and similar languages. 

One of Stolfi's plots (frequencies of different word-lengths):
[EDIT: this plot appears to be based on word types (i.e. ignoring word frequencies) while the scatter-plot above was computed on word tokens]
[attachment=4417]
Very interesting, Marco. Do we know how OKO works? He seems to eliminate all vowel bigrams, but I'd like to see the actual substitutions. The link on Stolfi's site doesn't work for me. I wonder what its entropy values are and how those compare to Vietnamese.
Pages: 1 2 3 4 5 6 7 8