The Voynich Ninja

Full Version: Discussion of "A possible generating algorithm of the Voynich manuscript"
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Pages: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
(30-10-2019, 09:03 PM)Koen G Wrote: You are not allowed to view links. Register or Login to view.
(30-10-2019, 08:13 PM)ReneZ Wrote: You are not allowed to view links. Register or Login to view.That's quite remarkable: the encoded Macchiavelli is almost identical in behaviour with the Voynich text.

Indeed, at least in these per page statistics. This probably means that Voynichese vocabulary alone is enough to account for the high frequencies of small edit distance?

Since the mapping matched word frequencies in the two texts (Machiavelli and VMS) we basically imported the VMS network of similar words into Machiavelli, which already started with a very similar TTR. For instance, one of the properties mentioned by Torsten is that in Voynichese word types that occur frequently also have the highest number of similar word-types: this mapping makes so that the encoded Machiavelli shares this property.
In the original Italian, there are 2 words with edit distance 1 from "che" (the second most frequent word); in the encoded version, "che" is mapped to "ol" and there are 38 similar word types.
Ah yes, I understand Marco. But I can't think of another way of mapping that would be "fair". Word frequency is one of those things that allow mapping without any interpretation.

So I guess the results of this exercise were a bit predictable? You end up importing many similar words, so edit distance per page is low. And any patterns are predetermined by the Italian text.
Let's be very careful.
We need to be as critical about these statistics and their interpretation as about those of Torsten.

What is not a problem is the statistical significance. All results (Torstens's and these) are based on large text samples.

Now to start with Koen's last remark.
It was to be expected that all three modified versions of Macchiavelli and Pliny would be closer to the Voynich text than the original. However, it was not expected - at least by me - that the modified Macchiavelli was that close to the Voynich text.

Obviously, the two originals behave completely differently - see the dark purple and light blue bars.

The modified versions of Pliny only impose a word structure and they both are even more oriented towards small edit distances than Voynichese. This suggests that these word structures are perhaps more rigid than Voynichese. pliny_mod1 uses Roman numerals, which 'feels' indeed more rigid than Voynichese.

pliny_mod2 is interesting, because due to its structure it allows to predict the frequencies of the various edit distances. For all words that are not the same (edit distance 0) the ratio should be 20 : 5 : 1 for edit distances 1 , 2 , 3 , and smaller again for 4.
Since edit distance 0 accounts for 19%, that leaves 81% for the others, and the predicted frequencies would be:
62 , 15.5 , 3
which fits almost exactly.

So now, what does this very close match between the modified Macchivelli and Voynichese mean?
I would say, that it answers exactly this question of Marco:

Quote:For instance, in the VMS, 87% of words have a distance from a previous word in the same page that is smaller than 3. Is this value enough to confirm Timm's theory?

The answer is 'No',  because a meaningful text that was not generated through auto-copying has exactly these statistics.

Now does this show that the Voynich text was not necessarily generated by auto-copying?

Also here I have to say: no, or at least, not yet. This is not yet demonstrated.

This is mainly because the test of Marco isn't exactly testing the method of Torsten. This method says that all first lines of paragraphs are not copied from the same page but from previous first lines of paragraphs. That means that these lines should not be searched for matching previous words, but should be searched as sources for matching words from later text on the same page.

On average, there are probably more than 13% of words on first lines, keeping in mind that the average paragraph length is strongly affected by the short paragraphs in quire 20.

This would be a more complicated test. For the modified Macchiavelly text this is a meaningless point.

However, we can still learn more. In the modified Macchiavelli, the page breaks are arbitrary. The text is not page-oriented and in any case the page breaks were introduced arbitrarily.
So, one can predict that the minimum edit distance will vary over each page as follows:
Since the appearance of similar words is purely a matter of chance, near the top of each page the average will be higher, since there are simply fewer words to compare with. As one goes down the page, the average will decrease as there are more and more words to compare with.
This would be the typical behaviour of a text that was not created using auto-copying.

For a text that was created using auto-copying, such a trend should not exist, and the minimum edit distance should be much more equal over each page.
In fact, if this could be visualised (e.g. using colour-coded words), one should see a more or less constant  behaviour over the text, except the first lines of paragraphs which should be higher as the words were also taken from earlier pages.
(30-10-2019, 10:15 PM)MarcoP Wrote: You are not allowed to view links. Register or Login to view.Since the mapping matched word frequencies in the two texts (Machiavelli and VMS) we basically imported the VMS network of similar words into Machiavelli, which already started with a very similar TTR. For instance, one of the properties mentioned by Torsten is that in Voynichese word types that occur frequently also have the highest number of similar word-types: this mapping makes so that the encoded Machiavelli shares this property.
In the original Italian, there are 2 words with edit distance 1 from "che" (the second most frequent word); in the encoded version, "che" is mapped to "ol" and there are 38 similar word types.


Indeed, to search for similar words in the Voynich manuscript is like searching for trees in the forest. There are far too many of them. It would be much more interesting to count the number of similar word types.

In 2014 I have executed a test for rarely used words: "In this paper all words occurring seven times and all words occurring eight times are used as two separate control samples. ..." (Timm 2015, p. 12ff).

For executing this test it was necessary to define what 'similar' means and also what 'near to each other' means:
"To consider the peculiarities of the VMS script, the edit distance is defined as follows: If a glyph is deleted, added or replaced by a similar glyph, this is counted as one change. Also, the change from 'ee' into 'ch' or from 'eke' or 'kch' into 'ckh' is counted as one change. If a glyph is replaced by a non-similar glyph, this is treated as deleting one glyph and adding another glyph. This is counted as two changes." (Timm 2015, p. 6).

"Near to each other means within a range of three lines before and after a word, and similar means that it is possible to transform them into each other by changing three or fewer glyphs (edit distance <=3)." (Timm 2015, p. 13).

(30-10-2019, 08:33 AM)MarcoP Wrote: You are not allowed to view links. Register or Login to view.Could shapchedyfeey in You are not allowed to view links. Register or Login to view. result from sho.pcheey.pchey in f8r?


For testing a rarely used word I find it useful to check the transcription first. In this case only Takahashi transcribes this token as <shapchedyfeey>. Currier and the first study group have transcribed it as two tokens <shapchedy> and <fchy> (see You are not allowed to view links. Register or Login to view.). In my eyes 'ee' is indeed a 'ch'. I would therefore transcribe this token as <shapchedyfchy>. 

The most similar word to 'shapchedy' on page You are not allowed to view links. Register or Login to view. is <chypchey> in f26r.P.1. Therefore I find it more likely that <chypchey> was used as source word for 'shapchedy'. But maybe <sho.pcheey.pchey> on page You are not allowed to view links. Register or Login to view. was used as source for generating <chypchey>.

The sequence 'pchey' in <chypchey> is also similar to 'fchy'. Therefore <chypchey> is also a possible source for the sequence 'fchy'. It is quite common that a source word was used twice. See for instance the tokens <pcheey> and <pchey> in <sho.pcheey.pchey>.
Hi Rene,
I think that the results for mod1 and mod2 tell us more than the encryption of Machiavelli: your two methods autonomously generate a network of similar words, as Timm's method does.
You wrote:

Quote:pliny_mod2 is interesting, because due to its structure it allows to predict the frequencies of the various edit distances. For all words that are not the same (edit distance 0) the ratio should be 20 : 5 : 1 for edit distances 1 , 2 , 3 , and smaller again for 4.

This seems a promising approach. I was thinking that it might be the case to apply the method to Machiavelli (or to a different text with a more Voynich-like TTR than Pliny), but I believe these ratios are not optimal. The histogram for Voynichese looks more like 15:10:5. I wonder how difficult it would be to tune the method to get results closer to Voynichese?

I attach measures for couples of consecutive words: exact-reduplication (distance=0) and quasi-reduplication (distance=1). I have normalized the measures, dividing by the number of words in each text.
Unless I messed up something, the frequency of quasi-reduplication in mod1 and mod2 is about four times as large as in the VMS (14% vs 3%).
[attachment=3598]
(31-10-2019, 05:41 PM)MarcoP Wrote: You are not allowed to view links. Register or Login to view.I think that the results for mod1 and mod2 tell us more than the encryption of Machiavelli: your two methods autonomously generate a network of similar words, as Timm's method does.
You wrote:



Quote:pliny_mod2 is interesting, because due to its structure it allows to predict the frequencies of the various edit distances. For all words that are not the same (edit distance 0) the ratio should be 20 : 5 : 1 for edit distances 1 , 2 , 3 , and smaller again for 4.



This seems a promising approach. I was thinking that it might be the case to apply the method to Machiavelli (or to a different text with a more Voynich-like TTR than Pliny), but I believe these ratios are not optimal. The histogram for Voynichese looks more like 15:10:5. I wonder how difficult it would be to tune the method to get results closer to Voynichese?

I can give that a go.

Note that the mod1 and mod2 approach both assign numbers to words, and the number is increased by 1 whenever there is a new word. This means that most of the time, the closest recent word in terms of edit distance is either 0 or the word with a numeral that is 1 less.

The encoding for mod1 is using Roman numerals, so the minimal change can have edit distances of 1, 2 or more, and could be predicted. This is easier for the mod2, which is more regular. It uses 5-'digit' words, and one could call both this and the Roman numerals 'high-endian'. That is: minimal changes are happening on the right-hand side of the 'word'. 

It is an interesting question whether the Voynich words could form some enumeration system. It seems doubtful, because minimal changes seem to happen anywhere in the word. However, a closer study of the word structure might still bring further insights.

Note that edit distance depends on the representation (transliteration). The results for the same text will be different depending on whether it is given in Eva, Cuva, FSG etc.

PS: looking at the _mod2 words again, I wonder if my prediction of 20:5:1 was right, and probably it should have been 20:4:1 . In that case, the percentages would become 65, 13 and 3 which doesn't fit quite as well, and I need to think about that a bit.
(31-10-2019, 06:22 AM)ReneZ Wrote: You are not allowed to view links. Register or Login to view.This method says that all first lines of paragraphs are not copied from the same page but from previous first lines of paragraphs. That means that these lines should not be searched for matching previous words, but should be searched as sources for matching words from later text on the same page. 

On average, there are probably more than 13% of words on first lines, keeping in mind that the average paragraph length is strongly affected by the short paragraphs in quire 20.

This would be a more complicated test. For the modified Macchiavelly text this is a meaningless point.

However, we can still learn more. In the modified Macchiavelli, the page breaks are arbitrary. The text is not page-oriented and in any case the page breaks were introduced arbitrarily.
So, one can predict that the minimum edit distance will vary over each page as follows:
Since the appearance of similar words is purely a matter of chance, near the top of each page the average will be higher, since there are simply fewer words to compare with. As one goes down the page, the average will decrease as there are more and more words to compare with.
This would be the typical behaviour of a text that was not created using auto-copying.

For a text that was created using auto-copying, such a trend should not exist, and the minimum edit distance should be much more equal over each page.
In fact, if this could be visualised (e.g. using colour-coded words), one should see a more or less constant  behaviour over the text, except the first lines of paragraphs which should be higher as the words were also taken from earlier pages.

Hi Rene,
again I have tried something along the lines you suggested. I am not sure I got everything right, but here are my results.
I have simplified the method you proposed, since I did not want to go into the details of detecting paragraph boundaries (though that is certainly feasible with some more effort).

As a first experiment, I processed each page, collecting the average distance of words from the closest words in previous lines. I considered the first 8 lines of each page. The plot starts from line 2 (since we need a previous line to compare with). The graph on the right is normalized by dividing by the line:2 measure. I removed pliny_mod1 and Q13 and added Q20, which seems closer to Timm's generated text, for this particular experiment (more pages, text and paragraphs than Q13).

[attachment=3635]

It seems that, even for auto-copying, the growing set of matchable words leads to progressively smaller and smaller values. All curves have a clear descending trend. The exception is what happens with Timm's data at line:5. I believe this is due to paragraph ends / new paragraphs, since most of the paragraphs in this data-set appear to be 4 or 5 lines long.


I tried an even simpler variant: matching each line against the first line of the page. Here there is a clearer difference: Timm's data appear to steadily increase for the first lines. Line 2 is by far closer to Line 1 than all other lines. I believe this is due to line 2 being derived from line 1 only, while line 3 is derived from line 2 and line 3, and so on. Each new line introduces more random variants, so words progressively diverge from those in line 1. Again, things change when new paragraphs begin.

[attachment=3634]

All the other data-sets oscillate around 1 with deviations of about 10%. Interestingly, mod2 also seems to progressively increase, though at a much lower rate than Timm's generated text. I guess this is due to the "counter" assuming values that are progressively higher than those in the first line.
The yellow line for the whole VMS appears to be almost perfectly flat. 
But Q20 might have some "paragraph-effect", possibly due to the features of the first lines of paragraphs (e.g. 'p' and 'f', Grove words) that make them less comparable with the rest. Similarly to Timm's data, there is a rising trend on the first few lines, but clearly the deviation is much smaller (11% vs 48%).
Excellent, many thanks Marco.

I didn't intend to experiment myself, initially, but I found a simple implementation of the Levenshtein distance for Fortran, so I have been doing something similar - not yet completed.

Looking at the very first picture, it appears that:
- Quire 20 is not significantly different from the Voynich text as a whole.
- Only the unchanged Latin and Italian plain texts lie above the Voynich MS
- The modified Machiavelli, which was closest to the Voynich text in the other comparison, is actually a little bit below the Voynich MS text in edit distance, so the small edit distance in the Voynich MS is definitely not conclusively a consequence of auto-copying
- Torsten's simulated text seems to be aiming at an average edit distance of 1.5. However, as the comparison text increases down the page, the probability of finding words with smaller differences also increases

I am not so sure how to interpret the second plot with the ratios. I think that the absolute differences between line 2 and line 8 are probably more significant.
Had the algorithm been aiming for an average edit distance of over 2, it would probably be closer to the Voynich MS text. However, an average over 2 basically means a range of 0 to over 4, and this is hardly auto-copying anymore.

The 'absolute drop' of the plain texts is 1.2 - 1.3 (reading from the figure), and this is also true for the Voynich MS text and the modified Macchiavelli.

For the Pliny_mod2 and Torsten's app it is 0.6 - 0.7 . The mod2 text is special in the sense that the words are very short, so edit distances are automatically smaller.

Looking at all this, I suspect that the substitution made on the Macchiavelli text, when applied to Pliny, is likely to be very close to the Voynich text.
Combining this with other observations, I am beginning to wonder if one of the main features of the Voynich MS text is that it is really page-oriented, i.e. properties are quite page-dependent. This would fit with an encyclopedic work (like Pliny).
Keep in mind: "We don't argue that the text was created by a computer program and we also don't argue that our program is able to simulate the complexity of human behavior" (Timm & Schinner 2019, p. 15). We have kept the algorithm as simple as possible. Our goal was to demonstrate that even our simple implementation "reproduces the intriguing key properties of the original text, including the presence of long-range correlations, the 'binomial-like' word length distribution, and both of Zipf’s laws" (Timm & Schinner 2019, p. 2). We also say: "Of course, it is possible to pinpoint quantitative differences between the real VMS and the used facsimile text (most likely any facsimile text). An example is the quantitative deviation of the <q>-prefix distribution from the original VMS text" (Timm & Schinner 2019, p. 15).

To keep the algorithm simple the app is using words from previous paragraph initial lines as source words for generating paragraph initial lines and source words from the same page otherwise (see Timm & Schinner, 12). This way the app only needs a singe line of text as initialization parameter. As a result words from other pages are only introduced in paragraph initial lines and therefore it is expected that the second line is closer to the initial line of a page than to the following lines.

If it comes to initial lines in the VMS we argue instead: "There was a similar problem for the author of the VMS every time he/she was starting a new (empty) page. In such a case it was probably useful to use another page as source. There is some evidence that the scribe preferred the last completed sheet for this purpose (see Timm 2015, p. 16)" (Timm & Schinner 2019, 11). It is unlikely that source words from the previous sheet were only used for generating initial lines: "One peculiarity for the words used seven or eight times is that they often appear on subsequent pages. For instance 'qodal' appears on four consecutive sheets: f51v, f52r, You are not allowed to view links. Register or Login to view. and f54v. For the glyph groups which occur seven or eight times, this happens more often on subsequent sheets (37 times) than on the front- and back of a page (5 times)" (Timm 2015, p. 17). Another observation for paragraph initial lines is: "on some pages several paragraphs begin with similarly spelled glyph groups. For instance, on page You are not allowed to view links. Register or Login to view. two paragraph initial groups ending with a final 'm'-glyph occur: <tsheoarom> in f3r.P.15 and <pcheoldom> in f3r.P.18. In the whole of the VMS, there are only seven paragraph initial groups with a final m-glyph. Therefore, it is remarkable that two of these occur on page f3r. Moreover, both groups are slightly similar to each other and it is possible to identify a possible source group for them on the same page in line f3r.P.9: <sheoldam>" (Timm 2015, p. 30f). (See also table XXIV. Similarities for paragraph initial glyph groups in Timm 2015, p. 95ff.) Because of this observations I come to the conclusion that the scribe was using tokens he was able to see as source words.
I'm not sure if I understand this correctly, but isn't Marco's final graph evidence against a progressive system in general, regardless of whether computers are involved?
Pages: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25