The Voynich Ninja - Discussion of "A possible generating algorithm of the Voynich manuscript"

Pages: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25

(04-11-2019, 10:14 AM)Koen G Wrote: You are not allowed to view links. Register or Login to view.
I'm not sure if I understand this correctly, but isn't Marco's final graph evidence against a progressive system in general, regardless of whether computers are involved?

Not at all. Keep in mind that the graph is based on Renès hint to my computer implementation. This way the graph was generated to measure an effect of the computer implementation.

To decide if there is a progressive system involved or not other tests are necessary. There is no doubt that the text in the VMS is not homogenous (see You are not allowed to view links. Register or Login to view.). We also know that the vocabulary of the VMs folios changes throughout the manuscript (see You are not allowed to view links. Register or Login to view.). There is a gradual evolution of a single system from "Currier A" to "Currier B" (see Timm & Schinner 2019, p. 6).

It is also possible to point too effects within pages (see Timm & Schinner 2019, p. 9). There is for instance a blogpost from 2007 about page related structures (see You are not allowed to view links. Register or Login to view.). Marco also writes "Q20 might have some 'paragraph-effect'.

I think it is indeed useful to separate the two questions:

1) Does the Voynich MS text behave like the result of auto-copying?

2) Does the app properly simulate the behaviour of the Voynich MS?

The posts by Marco and myself address both points.

(02-11-2019, 08:26 PM)ReneZ Wrote: You are not allowed to view links. Register or Login to view.I am beginning to wonder if one of the main features of the Voynich MS text is that it is really page-oriented, i.e. properties are quite page-dependent. This would fit with an encyclopedic work (like Pliny).

Comparing the most frequent words in VMS sections with those in different books (about different subjects) in Pliny, the problem with function-words pointed out by Torsten stands out.
I highlighted the words that appear among the 20 most frequent in all sections. In Pliny, these words are more numerous than in the VMS; they cluster near the top of the lists.
As observed by Torsten, in the VMS the most frequent words in each section tend to be similar.
In the Bio section, 7 of 20 words start by qok-, while in the Herbal A section none of the 20 does.
In Herbal A we find the group chol, chor, shol, shor.
An encoding like mod2 could map all function words into a few groups of similar words, but then we should see the same groups in all sections, while the 4 words listed above are frequent in Herbal A but not in the other two sections.
The three most frequent words in the Bio section (shedy, chedy, qokaiin) add up to about 10% of the word tokens in the section, but they do not appear among the 20 most frequent words in Herbal A.

My impression is that, while it is possible that specific subjects are partially responsible for how similar words cluster together, this explanation is not sufficient. Aren't the differences between sections too large for a word-to-word mapping of a uniform language?

(04-11-2019, 03:21 PM)MarcoP Wrote: You are not allowed to view links. Register or Login to view.
(02-11-2019, 08:26 PM)ReneZ Wrote: You are not allowed to view links. Register or Login to view.I am beginning to wonder if one of the main features of the Voynich MS text is that it is really page-oriented, i.e. properties are quite page-dependent. This would fit with an encyclopedic work (like Pliny).
My impression is that, while it is possible that specific subjects are partially responsible for how similar words cluster together, this explanation is not sufficient. Aren't the differences between sections too large for a word-to-word mapping of a uniform language?

There are many unknowns and variables to consider:
- are vords words? If not, there is no problem with using different (ciphertext) vords for encoding the same (plaintext) words, and a one-to-one vord-to-word mapping model is irrelevant,
- the vocabulary (that contains many similar words),
- a sometimes much reduced MATTR (Q13),
- some sharply changing unknown preferences/settings/constraints visible on glyph mono- and bi-gram statistics, very common in Currier language A at the page scale or even sometimes just a few lines,
- the "drift" between Currier language A and B characteristics,
- the hypothetical "self-citation",
- a hypothetical page-dependent subject matter.

Is there any way of separating the causes and consequences, the extent of the effect of each factor/hypothesis on local vord similarity? If not, there is no point in trying.

(04-11-2019, 04:06 PM)nablator Wrote: You are not allowed to view links. Register or Login to view.- are vords words? If not, there is no problem with using different (ciphertext) vords for encoding the same (plaintext) words, and a one-to-one vord-to-word mapping model is irrelevant,

Keep in mind that the VMS token strings fulfill both of Zipf’s laws. This fact is frequently seen as important evidence for the presence of human language and that vords are words (see You are not allowed to view links. Register or Login to view.). A different mapping would also result in changed statistics. Therefore the mapping should preserve the word frequencies. Otherwise a different explanation for the observed word frequencies is neeeded.

(04-11-2019, 08:16 PM)Torsten Wrote: You are not allowed to view links. Register or Login to view.Keep in mind that the VMS token strings fulfill both of Zipf’s laws. This fact is frequently seen as important evidence for the presence of human language and that vords are words (see You are not allowed to view links. Register or Login to view.). A different mapping would also result in changed statistics. Therefore the mapping should preserve the word frequencies. Otherwise a different explanation for the observed word frequencies would be necessary.

Word frequency is a matter of opinion... see for example the huge difference in MATTR that Koen computed using the two versions of the ZL transliteration, the one with less spaces and the one with more spaces. Getting "evidence" out of unreliable data is unconvincing.

Let's stay on topic: how to choose between all the imaginable causes of local self-similarity? I don't know. Sad

Is there any way to exclude the reduplication prone lines from this study and redo the statistics again? Maybe this is unnecessary, but if the lines filled with reduplications of words are some kind of chants, then removing them might proove useful..

How much repeat words should be considered is another question though..

(29-10-2019, 09:17 AM)ReneZ Wrote: You are not allowed to view links. Register or Login to view.I didn't mean to say that the long papers are bad. Quite the contrary: they contain a lot of valuable statistics. It is just that it makes it more time-consuming to ingest it all.

The reason for all my questions and doubts can be explained as follows.

Were the auto-copying hypothesis presented as a hypothesis that is still being further analysed, then I have no problem whatsover. It is of interest. The statistics are of interest.
If it is presented as "the way in which the MS was produced" and evidence/proof that the MS is meaningless, then I want to understand better.

Do you really want to argue that research about the Voynich manuscript is only of interest as long as no conclusions are drawn?

(29-10-2019, 09:17 AM)ReneZ Wrote: You are not allowed to view links. Register or Login to view.From the very first mention of the idea I considered that this process is too arbitrary and does not explain some of the most important properties of the text. I still think so but it is qualitative and not sufficient to decide if it can be true or not.

With other words you reject my research just because of your first impression? Moreover, you write that you expect arbitrary choices and at the same time you are arguing with You are not allowed to view links. Register or Login to view.. Have you tried the suggested You are not allowed to view links. Register or Login to view.? By executing this experiment you could check yourself if it is indeed efficient to invent new arbitrary modifications rules all the time and if it is indeed necessary to limit the modifications to small changes.

(29-10-2019, 09:17 AM)ReneZ Wrote: You are not allowed to view links. Register or Login to view.The near-repetitions are explained, the appearance of Eva-f and p primarily (but not at all exclusively) in top lines of paragraphs is explained, but the word structure is not. And there are a few pother things, already mentioned before.

The word structure is explained in Timm & Schinner 2019, p. 10f.

(29-10-2019, 09:17 AM)ReneZ Wrote: You are not allowed to view links. Register or Login to view.So I try to imagine how this would have arisen. This puts some constraints on how the system is initialised and how the changes are applied.

The changes during auto-copying are clearly not at all arbitrary

Indeed, the changes are not arbitrary (see Timm & Schinner 2019, p. 3ff).

(29-10-2019, 09:17 AM)ReneZ Wrote: You are not allowed to view links. Register or Login to view.The discussion is quite similar to that in 2004 or so when Gordon Rugg presented his method. To take into account the known properties of the text, the method will require numerous dedicated adaptations, and here is where Occam's razor comes into play. Not that it is proof of anything. It is a sign.

You wrote You are not allowed to view links. Register or Login to view. "I think one can only analyse all proposed solutions by themselves and see if they 'work'." Now you use a discussion about a theory from 2004, your first impression from 2014, a word for word substitution cipher hypotheses - which you have You are not allowed to view links. Register or Login to view. in 2015 - and phrases like "a few other things" as argumentation against a paper from 2019.

Hello. I did some statistical analysis on the most frequent repeated word-sequences, and some analysis on the transition relationships between characters with principal component analysis (PCA) on the same auto-generated text that Timm & Schinner used in the article, compared to an excerpt of the Voynich manuscript (the 'recipes' section, f103r-116v, the same excerpt they were aiming to simulate with their algorithm in the article). I used my routines written in C++ to extract and assemble the information.

Comments on the results text files (attached)

The sizes of both texts, measured in number of words (tokens), are similar: 10832 vs 10681. However, Timm's and Schinner's algorithm under-predicts the number of unique words somewhat (2228 vs 3103 in the Voynich text). The frequency distribution of the most frequent words seems similar, except for the three most frequent words, which are over-predicted by approximately twice the amounts.

The total numbers of consecutive repetitions of words (the same word repeated once or twice in succession) are also somewhat under-predicted.

For the list of repeated phrases with length of two- or three words: The word with low edit distance to 'cheedy' (such as 'chedy' and 'eedy' etc) are dominating the lists of repeated phrases of their auto-generated text. The total number of repeated two- or three-word phrases are also over-predicted by approximately twice the amounts. Both texts show more repeated two- or three-word phrases compared to randomly word-shuffled texts (as seen from comparing to the expected values), and both texts show a lack of longer repeated sequences though.

Another difference is that I see no single-character (single-glyph) words in the auto-generated text. The Voynich text has many single-character words, some of them among the most frequent words.

For the PCA analysis, things get more peculiar...

I used the same procedure to analyse the data as in my You are not allowed to view links. Register or Login to view. in Cryptologia, to show relationships between the transition frequencies of individual characters to other characters, based on analysis on the word-vocabularies. See the resulting score plots of the characters in the auto-generated text (left) and the Voynich recipes text (right), for the first two principal components along horizontal vs. vertical axes. The characters group together similarly, and show also the same placements in relationship to the directions of the original variable axes (if they would be projected also into the plots).

[attachment=3965] [attachment=3966]

What I found peculiar though is that about half of the characters from the auto-generated text line up almost precisely on a straight line with mathematical precision, and many of the other characters also seem to line up almost straight on another line crossing this line. But from the Voynich text the characters seem much more randomly placed (similarly to what you find for words in natural language). Could this be an indication of that if you use a simple algorithm for text generation such as in the article, deeper analysis on the transition frequencies between the characters would also show a simpler mathematically quantifiable relationship?

Personally, I'm not sure what to think of the generating algorithm of Timm and Schinner. It could be true that a similar process was used to write the Voynich manuscript, but then I think that it must have been more complicated or arbitrary. But then would a medieval scribe have the patience and/or the motivation to generate it?

Attached files:
The words/phrases analysis on the auto-generated text: 'v_analysis_timm.txt'
The words/phrases analysis on the Voynich manuscript text f103r-116v: 'v_analysis_103r-116v_EVA.txt'
Excerpt from the Voynich manuscript text: 'voynich_103r-116v_eva.txt'
Text sample generated by the algorithm: 'timm_autogen1.txt'

(08-02-2020, 12:34 PM)Alin_J Wrote: You are not allowed to view links. Register or Login to view.Hello. I did some statistical analysis on the most frequent repeated word-sequences, and some analysis on the transition relationships between characters with principal component analysis (PCA) on the same auto-generated text that Timm & Schinner used in the article, compared to an excerpt of the Voynich manuscript (the 'recipes' section, f103r-116v, the same excerpt they were aiming to simulate with their algorithm in the article). I used my routines written in C++ to extract and assemble the information.

Comments on the results text files (attached)

The sizes of both texts, measured in number of words (tokens), is similar: 10832 vs 10681. However, Timm's and Schinner's algorithm under-predicts the number of unique words somewhat (2228 vs 3103 in the Voynich text). The frequency distribution of the most frequent words seems similar, except for the three most frequent words, which are over-predicted by approximately twice the amounts.

Hi Jonas,
the graphs I attach compare Timm and Schinner's generated text with each individual section in the VMS (Takahashi transcription). It is possible that I have made errors along the way, but I think the variability illustrated by the graphs is actually there.
The top graph measures Moving Average Type Token Ratio (a technique that Koen discussed extensively on the forum and for which Nablator provided a Java implementation). I used a 200 words window. This is closely related with the number of unique words, but the moving average allows comparing texts of different lengths. Timm and Schinner's generated text has a MATTR value that is lower than all the VMS sections, but it is quite close to the value for Quire 13 (aka the Biological or Balneological section).

Frequencies of the most common words are somehow anti-correlated with respect to MATTR: if you have more word types (higher MATTR) the most common words tend to be less frequent. Here the differences between Voynich sections are striking: the most common word in Quire 20 (aiin, 1.8%) is less than half as frequent as the most common word in Herbal A (daiin, 5.0%). As discussed by Timm and Schinner, the fact that the most common words are different in the different sections is even more puzzling. BTW, frequent words in Q20 have very similar frequencies (non-Zipf-like): this results in different trancriptions pointing out different words has the most frequent.

As you say, the most common words in the generated text are much more frequent than those in Quire20, but (as for MATTR 200) the distribution is quite similar to that observed in Quire 13.

Pages: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25