Hello. I did some statistical analysis on the most frequent repeated word-sequences, and some analysis on the transition relationships between characters with principal component analysis (PCA) on the same auto-generated text that Timm & Schinner used in the article, compared to an excerpt of the Voynich manuscript (the 'recipes' section, f103r-116v, the same excerpt they were aiming to simulate with their algorithm in the article). I used my routines written in C++ to extract and assemble the information.
Comments on the results text files (attached)
The sizes of both texts, measured in number of words (tokens), are similar: 10832 vs 10681. However, Timm's and Schinner's algorithm under-predicts the number of unique words somewhat (2228 vs 3103 in the Voynich text). The frequency distribution of the most frequent words seems similar, except for the three most frequent words, which are over-predicted by approximately twice the amounts.
The total numbers of consecutive repetitions of words (the same word repeated once or twice in succession) are also somewhat under-predicted.
For the list of repeated phrases with length of two- or three words: The word with low edit distance to 'cheedy' (such as 'chedy' and 'eedy' etc) are dominating the lists of repeated phrases of their auto-generated text. The total number of repeated two- or three-word phrases are also over-predicted by approximately twice the amounts. Both texts show more repeated two- or three-word phrases compared to randomly word-shuffled texts (as seen from comparing to the expected values), and both texts show a lack of longer repeated sequences though.
Another difference is that I see no single-character (single-glyph) words in the auto-generated text. The Voynich text has many single-character words, some of them among the most frequent words.
For the PCA analysis, things get more peculiar...
I used the same procedure to analyse the data as in my You are not allowed to view links.
Register or
Login to view. in Cryptologia, to show relationships between the transition frequencies of individual characters to other characters, based on analysis on the word-vocabularies. See the resulting score plots of the characters in the auto-generated text (left) and the Voynich recipes text (right), for the first two principal components along horizontal vs. vertical axes. The characters group together similarly, and show also the same placements in relationship to the directions of the original variable axes (if they would be projected also into the plots).
[
attachment=3965] [
attachment=3966]
What I found peculiar though is that about half of the characters from the auto-generated text line up almost precisely on a straight line with mathematical precision, and many of the other characters also seem to line up almost straight on another line crossing this line. But from the Voynich text the characters seem much more randomly placed (similarly to what you find for words in natural language). Could this be an indication of that if you use a simple algorithm for text generation such as in the article, deeper analysis on the transition frequencies between the characters would also show a simpler mathematically quantifiable relationship?
Personally, I'm not sure what to think of the generating algorithm of Timm and Schinner. It could be true that a similar process was used to write the Voynich manuscript, but then I think that it must have been more complicated or arbitrary. But then would a medieval scribe have the patience and/or the motivation to generate it?
Attached files:
The words/phrases analysis on the auto-generated text: 'v_analysis_timm.txt'
The words/phrases analysis on the Voynich manuscript text f103r-116v: 'v_analysis_103r-116v_EVA.txt'
Excerpt from the Voynich manuscript text: 'voynich_103r-116v_eva.txt'
Text sample generated by the algorithm: 'timm_autogen1.txt'