Following my previous post #53, I think I have got very interesting results
• I can now propose a metric for the evaluation of slot grammars (a problem which I posed in post #39).
• I developed a random generator of pseudo-Voynich texts, which writes remarkably good Voynichese.
I think I can now say the LOOP grammar effectively captures the peculiar structure underlying the Voynichese word types and could be fruitful for future studies. But I’ll let you judge it.
-----------------------
A METRIC FOR EVALUATING SLOT GRAMMARS
It’s relatively easy to use a slot grammar to divide any word type in chunks. I’m not going to delve in details here but I think any programmer who reads post #53 can easily write a software to do it (basically, scan the word and check where it matches the slot grammar).
The result of the ‘chunkification’ is a table similar to this:
But this table is just another slot grammar, just remove the first column and rename the headers:
For lack of a better name I’ll call the second grammar the ‘chunkified’ version of the original grammar.
I propose as a ‘figure of merit’ for a slot grammar the total number of unique chunks in the chunkified grammar (Nchunktypes). That is to say: the total number of unique chunks in the second table above. The lower the number, the ‘better’ the grammar. This applies to any grammar, with a loop or without (that would be a grammar where the ‘loop’ repeats only 1 time).
Nchunktypes works because (see post #39) by chunkifying the Trivial LL*CS grammar we get the Trivial 1*WS grammar (this is very easy to verify by hand), which is quite remarkable. The Trivial 1*WS grammar uses as many chunks as the word types in the original text, so it will always score the lowest on this metric, so Nchunktypes is able to reject both the trivial grammars. Also, this completely bypasses the intractable mathematical problem inherent in the use of the efficiency metric (see again post #39).
Note 1: instead of counting only unique chunk types (Nchunktypes) it’s also possible to count all the chunks in the table multiple times (that would be what I called Ncharset (of the chunkified grammar) in post #39). I have no idea which is the better one (a cool math problem, but not much important at this moment, and maybe someone already solved it, see Note 2).
Note 2: I find it impossible that, probably in some very different form and context but with equivalent meaning, all this has not already been found in some branch of mathematics (group theory?). I really don’t know, but if anyone has an idea, please let me know. By the way, the 1*WS trivial grammar is invariant under chunkification.
---------------------
Well, you already knew what I was coming to… this is a comparison table between the LOOP-4 grammar (defined as in post #53), and Zattera’s SLOT and ThomasCoon V2 grammars. To get more data points, I also used a low-coverage LOOP-2 grammar (the same as LOOP-4, but only two repeats), and two high-coverage grammars (SLOT 2X and V2 2X) obtained by duplicating SLOT and V2 (transforming them into 2-loops grammars).
Note: LOOP-4 has 606 chunks, not 605 as previously reported. I missed one due to a bug.
Can a better grammar exist, with less than 606 chunk types? Yes it could, I did not test many variations, but I bet it’ll be a variant of LOOP-4.
-------------------------------------------
GENERATING HIGH-QUALITY VOYNICHESE TEXTS
As I said in post #53 I planned to use the chunkified LOOP-4 grammar, together with the word tokens frequency data, to build a random Voynich words generator. This task ended, I think, very well.
This is a random sample of the output (I call it Asemic-22 because it used 22 as random seed):
The output has been generated so it has exactly the same number of words as Voynich RF1a-n (here I’ll cut short on explanations on how words with rare characters are managed) to make comparisons easier. I generated just two texts, Asemic-0 and Asemic-22: it just takes a few seconds, but I think two were enough, I don’t want to go chasing uselessly the ‘perfect’ random seed. Here follow the comparisons with RF1a-n (I’m only posting the graphs, but of course I can give also the data tables).
CHARACTERS STATISTICS
Characters and bigrams distributions are indistinguishable from true Voynich:
Note: in Asemic-0 ‘y’ and ‘h’ switch place (the frequencies of ‘y’ and ‘h’ are very similar).
Note: SPACE is represented by a blank in the 2D graph. ‘Rare’ characters are not shown. Columns ‘y’ and ‘h’ are switched in Asemic-0.
WORDS STATISTICS
The word tokens distribution (Zipf’s law) is essentialy the same, you cannot even see the Voynich curve in the graph because it lies completely below the two Asemic curves (it tracks Asemic-0 expecially well).
By the way, this also demonstrates once more that a meaningful language is not needed at all to get Zipf’s law.
The distribution of the length of word tokens, which is quite peculiar to the VMS as identified (iirc) by Stolfi, is +- indistinguishable:
The distribution of the length of word types has some small systematic differences (too few types with lengths 6 and 7, too many with length 10 and 12), this too will be discussed later.
Side-by-side comparison of the vocabulary (100 most frequent word types): you can see how word types rankings are quite similar on a word-for-word basis.
OTHER DATA
Entropies are identical or very similar, only the words entropy is marginally higher for true Voynich (as usual, more on this later):
Miscellaneous data:
The difference in the number of word types (+15% Voynich in respect to Asemic) and in the number of hapax legomena (words appearing only once in the text, +30% Voynich in respect to Asemic) is the main recognizable difference between Voynich and Asemic, and it drives (I think) all the differences in words distributions (and word entropy) we have seen so far. See later for the discussion.
Note: the difference in the max word types length is trivial because Asemic can at most generate words with 4 chunks (and only 58 words in the whole VMS have more than 4 chunks).
Comparison of hapax legomena:
Asemic generates many hapax legomena which are not found in the true Voynich, and fails to generate many of the Voynich’s hapaxes. It should also be noted, however, that about one fourth of the hapaxes generated by Asemic also appear in the true Voynich (ie. Asemic-22 finds 5911-4174 = 1737 of the VMS hapax legomena). See discussion below!
---------------------------------------------------------------
DISCUSSION (Voynich vs. Asemic)
Asemic generates less hapax legomena (and, in generally, rare words) than Voynich does. This ultimately drives all the differences in the words statistics. I think I know why this happens: it’d be long to explain wholly but, briefly, at each slot of the ‘chunkified’ grammar all the different paths which the composition of a word type can take are mixed together, so the most common paths are preferentially taken and this reduces the number of rare words.
It would be rather easy to generate a ‘perfect’ Voynich by using a tree structure instead of a slot grammar, but I don’t think this is needed, or even useful.
In fact, imagine I fake a quire of Voynich pages, written by Asemic. How are we going to distinguish it from true Voynich pages?
• Word-by-word tokens distribution does not help: indeed the differences are much less than the differences between Currier A and Currier B.
• Hapax legomena do not help: every page of the Voynich has new hapax legomenas, so finding more in Asemic is wholly expected, and the fake pages also use some of the true Voynich hapax legomena, which is consistent with the fake to be original.
• Only the ratio -hapax legomena/number of word tokens- would be useful for the distinction,
but only if one already knows that. Else it could be easily explained away, just as the differences between Currier A and Currier B can (ie.: different topics, if believed meaningful).
--------------------------
DISCUSSION 2: COMPLEXITY OF THE PSEUDO-VOYNICH ASEMIC GENERATOR
As I told in post #53, I hoped that a relatively small number of parameters would have been needed to describe the Markov chain which drives the Asemic generator. In effect, I found 8165 ‘transitions’ are needed, and I hoped for much less (even if the vast majority of them are needed only for rare word types).
This has some consequences: not that I thought that the VMS could have actually been written by using a Markov chain based on a slot grammar, but finding a low number of transitions would have increased the probability that some meaningless mechanism was actually used. Thus, finding 8165 transitions actually decreases the probability of Voynich being a meaningless text (which is not what I expected).
This is wholly preliminary, I have yet to think thoroughly about the implications.
-----------------------------
WHAT THE LOOP GRAMMAR AND CHUNKIFICATION CANNOT DO
By construction, it’s impossible to model the effects of things such as paragraphs and line breaks and the effects of correlations between words (both short and long range), and the appearance of words repeated multiple times in a row in the text.
Paragraphs and line breaks are rather trivial, ie.: just insert paragraph breaks where a word begins with a ‘gallow’.
Words repeated up to four times are much more problematic (by the way, I think this is one of the most important evidences against VMS being a ‘regular’ language coded in some way). From a simple probabilistic argument, Asemic should generate about 16 ‘daiin daiin’, which it actually does. It should generate 0.34 ‘daiin daiin daiin’ (there are none in Asemic-0 and Asemic-22), and should generate 4 ‘daiin’ in a row only once every 131 runs. So, intra-word correlations and chance do not explain the words repated four times, at all.
Now I have to confess I don’t know much about short and long range correlations between words. I know a correlation has been found between the last character of a word and the first of the following one, but I have not checked. I also know of studies which claim the true Voynich can be distinguished from a scrambled version of itself, but yet again I have not checked them. So it would be interesting to see how Asemic behaves in this respect (it should fail the tests, if the claims are true). It would also be interesting to compare Asemic with the texts generated by Timm&Schimmer’s self-citation mechanism.
But I think the chunkification process
might be of help in studying word correlations. This because it
might allow to define a new measure of the ‘distance’ between two words, based on their chunks instead than on their visual appearance.
------------------------
RAW DATA
The Asemic-22 text is here:
You are not allowed to view links.
Register or
Login to view.
I’d be glad if any of you uses it for some statistical analysis vs. real Voynich with your tools (you may have to remove the first lines of intestation).
The Excel dump of the chunkified LOOP-4 grammar is in two files. “LOOP-4 chunks” (very similar to the one of post #53), with find the chunkification of all the Voynich words, the categorization of chunks and the chunks list (you may have to resize the columns to be able to see the full lists).
You are not allowed to view links.
Register or
Login to view.
In “LOOP-4 Chunkified and transitions” you find the chunkified slot grammar, followed by the full table of the transitions used by the Markov chain engine to write Asemic’s texts.
You are not allowed to view links.
Register or
Login to view.