The Voynich Ninja

Full Version: Discussion of "A possible generating algorithm of the Voynich manuscript"
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Pages: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
(23-10-2019, 05:29 AM)ReneZ Wrote: You are not allowed to view links. Register or Login to view.It is not written down specifically (and it is again something that I wanted to double-check) but it seems to be implied that every word in the MS (after the initialisation) is the result of auto-copying. That is, there are no words that are 'new seeds' or incidental re-initialisations.
(I have asked earlier in this thread about this, but I think that this question was understood in a different way).
In any case, these points clarify that the initialisation procedure is too important just not to mention.

Then, if this assumption (no new seeds) is true, one could verify the auto-copying hypothesis by checking for each word in the MS if there is a recent (how far back?) similar word (which max. edit distance?) from which it could be derived. 
This seems to be the most basic test of the method

I Rene,
I have been trying to do something along the lines you suggest, but I am not sure I could produce anything helpful or new. As always, I might have made mistakes in the process.

[attachment=3595]
This histogram measures the minimum Levenshtein distance of words from preceding words in the same page. The first 200 words of each page are considered. For distance=0, this means that a word-type is repeated and the value is complementary to TTR. As we already known from Koen's experiments, Timm's text behaves similarly to Q13.
I have included comparisons for Latin (Pliny) and Italian (Machiavelli). Here I have generated "pseudo-pages" by splitting the text at a fixed length. Again, as observed by Koen, Voynichese and Timm's generated text are close to Italian, with respect to TTR / distance=0.
Values for distance=1 depend on the phenomenon that Timm has analysed with his networks of words: Voynichese has values close to those for distance=0, while Latin and Italian drop at about half their distance=0 values.

I am facing two problems here:
1. definition of a meaningful quantitative measure;
2. definition of an acceptance threshold on that measure.

For instance, in the VMS, 87% of words have a distance from a previous word in the same page that is smaller than 3. Is this value enough to confirm Timm's theory? It certainly is considerably higher than in ordinary written languages and autocopying could explain this difference.

One could focus on the remaining 13% of words and see if there is something that cannot be accounted for by Timm's theory, but I am not sure how this could be done. Autocopying allows the combination of previous words to generate new words. Could shapchedyfeey in You are not allowed to view links. Register or Login to view. result from sho.pcheey.pchey in f8r? Maybe, with a sufficient effort, one could define a quantitative measure that tells us how likely this is, but at the moment I cannot think of anything that would add much to what we already know.
Timm and Schinner wrote that "the scribe had complete freedom to implement random personal aesthetic preferences, spontaneous impulses, or even idiosyncrasies". How can we exclude that shapchedyfeey results from a spontaneous impulse? I guess this sentence means that some deviation of the generated text from actual Voynichese must be regarded as acceptable. How do we set an acceptance threshold?
Timm. You did not understand the question correctly. Why to form a new word,  can’t just take any character from the Eva alphabet and insert (change) anywhere in the base word, is it easier and faster to create a meaningless text? And they strictly adhere to certain rules that ensure the existence of invalid combinations of characters.
Hi Marco,

since Latin and Italian do not have the network of similar words, the percentages for small edit distances >0 will necessarily be much smaller.

If it were possible to do a word-for-word substitution of the plain text to use the Voynich vocabulary, it might provide a more interesting test.

In any case, the yellow bars above are already interesting.
Very interesting, Marco.
It seems to me that Timm's method is intended to mimic exactly these statistics, and apparently he did a decent job.

What I've been thinking for a while already though, is that someone with enough the skills could probably make something similar for Latin. It would need much more freedom than the Voynichese generator, but you could easily tweak it to generate something  that's phonetically regular in Latin and approaches Latin's edit distance and TTR stats.

So my feeling is that these stats, while very illuminating, certainly don't tell the whole story.

Rene: you mean for example sort the words of both texts by frequency and substitute in that order?
(30-10-2019, 08:35 AM)Wladimir D Wrote: You are not allowed to view links. Register or Login to view.Timm. You did not understand the question correctly. Why to form a new word,  can’t just take any character from the Eva alphabet and insert (change) anywhere in the base word, is it easier and faster to create a meaningless text? And they strictly adhere to certain rules that ensure the existence of invalid combinations of characters.

An experiment says more than a thousand words. By doing the experiment you can check yourself if your intuition is correct. Is method C indeed easier and faster than method B? You can also check the result for method B if you have replaced any character with any other character or if you have preferred certain replacements.
Koen, indeed.

In general about edit distance: if two three-letter words are completely different, do they have an edit distance of three?
Hi Marco,

if it is not too complicated to repeat the exercise with other texts, there are also the texts that were analysed You are not allowed to view links. Register or Login to view. , but not only would you have to create your own page breaks, the single words also have to be converted to lines.
I suspect this could be done with standard Unix commands.

These texts are a plain text of Pliny with a word-by-word substitution that would favour short edit distances.

Rene
(30-10-2019, 09:58 AM)ReneZ Wrote: You are not allowed to view links. Register or Login to view.In general about edit distance: if two three-letter words are completely different, do they have an edit distance of three?

Hi Rene,
yes, Levenshtein distance works like that. A related measure (Levenshtein ratio) computes a similarity coefficient in the 0-1 range that takes length into account.

This is the graph for the new experiments. The Pliny files are those you linked.

[attachment=3596]

I encoded Machiavelli as suggested by you and Koen, by mapping words according to number of tokens, so the mapping is:
  e  -> daiin 
  che  -> ol 
  di  -> chedy 
  non  -> aiin 
  a  -> shedy 
  ...
The most frequent Italian words are mapped into the most frequent Voynichese words.

This mapping, as what you did with Pliny, does not impact distance=0, which still is complementary to the original TTR. Pliny has too many types per N tokens to be a good match, but it seems clear that both methods of encoding produce a number of similar tokens that can also appear next to each other.

These methods do not generate exact reduplication (which, as you explained above, should be present in the original text, in order to appear in the ciphered output).
They also do not generate sequences made of several occurrences of two similar word-types, like the famous 5-tokens / 2-types
<f75r.38,+P0>    qokeedy.qokeedy.qokedy.qokedy.qokeedy
These are the number of sequences with length 3 or more (since the files have different lengths, I include the number of K-words). 
[attachment=3597]

The only other occurrence with the exception of the VMS and Timm's text is in the original Machiavelli: "ne se ne".
9 of the 45 occurrences in the VMS do not involve exact reduplication, e.g.:

<f76r.34,+P0>    cheor.shey.qoolkal.shedy.shedy.shey.shedy.ollchy.shlches.shcthy.sain.oly
<f107v.39,+P0>  okain.cheor.olkaiin.oain.cheary.raiin.okaiin.odaiin.okaiin.y
<f108r.32,+P0>  ykedar.chedy.qokey.lkeedy.otedy.okedy.otedy.otal.shol.alodar.or.alold
That's quite remarkable: the encoded Macchiavelli is almost identical in behaviour with the Voynich text.
(30-10-2019, 08:13 PM)ReneZ Wrote: You are not allowed to view links. Register or Login to view.That's quite remarkable: the encoded Macchiavelli is almost identical in behaviour with the Voynich text.

Indeed, at least in these per page statistics. This probably means that Voynichese vocabulary alone is enough to account for the high frequencies of small edit distance?

It would probably be different for small window MATTR though. EDIT: of course, it would retain all TRR properties of the original Italian...
Pages: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25