The Voynich Ninja

Full Version: A family of grammars for Voynichese
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Pages: 1 2 3 4 5 6 7 8
(04-12-2024, 01:54 PM)oshfdk Wrote: You are not allowed to view links. Register or Login to view.
(04-12-2024, 10:42 AM)Mauro Wrote: You are not allowed to view links. Register or Login to view.I also take the chance to compliment for your works on the Marginalia (just saw them yesterday, I have a lot of threads to read xD). I'd wish I had 1/100th of your skills and knowledge in paleography and image processing!! And of course, clues external to the bare transcribed text could be (and are) very, very useful.

Thank you so much, but I think there is some misunderstanding, while I do have some skills in automated image processing, paleography is definitely not on my CV. I'm not even sure what paleography is exactly, I assume it has something to do with history and images. With the marginalia I only tried to enhance the multispectral images and accidentally stumbled upon a strange feature at the bottom right of f116v, that's my only contribution, I think. My main area of interest is the text, that's why I'm so curious about the results you have shared and the general methodology.

I thought that because you said you reviewed a lot of European signatures of the XV-XVII centuries, I assumed doing that was inside the field of paleography (not sure exactly how it's defined, tbh  Big Grin ). In any case, an enviable skill!
(04-12-2024, 06:40 PM)Mauro Wrote: You are not allowed to view links. Register or Login to view.I thought that because you said you reviewed a lot of European signatures of the XV-XVII centuries, I assumed doing that was inside the field of paleography (not sure exactly how it's defined, tbh  Big Grin ). In any case, an enviable skill!

I primarily used the Vatican Digital Library (and also a Swiss digital artifacts site, though I don’t remember its name). I opened the index and began searching for letters from people whose signatures might start with A. I reviewed around 120–140 letters from 40–50 people, I think. Enough to get a basic understanding of how people signed letters back then, but certainly not enough to earn the paleography badge  Big Grin
I tried keep this post to the bare minimum, cutting a lot of details (you’re free to ask anything, of course). It’s some preliminary results I got exploring the LOOP grammars and what they can do. They look promising, but with all the mandatory caveats needed with the VMS….

----------------

I used the LOOP grammar to divide each Voynich word type in word chunks. I experimented with some variations of the grammar, and in the end I settled on this one:

[attachment=9492]

The loop is repeated up to four times.

Notice this grammar has no pretense of efficiency, and efficiency is not my aim at all. I want to have the simplest possible grammar with a high coverage, which in this case is a hefty 99.28% (on RF1a-n; this excludes word types containing ‘rare’ characters, but I could as well add them so I don’t get bored of typing this…). Only 58 words (on 8031) are missed and they are all with more than 4 chunks, ie. ‘tchodypodar’ would be [t][chod][yp][od][ar]. Just for colour, ‘fachys’ is chunkified as [f][a][ch][ys] and is among the valid word types found by the grammar.

------------------------

So, I divided all the Voynichese word types in chunks following the above grammar, which resulted in 605 different chunks, relatively few (the other variations I tried resulted in more chunk types). Many are quite rare, ie. ‘chai’, used by just one word, ‘ychaies’. I also grouped the chunks in ‘categories’, which is better explained with an example: ‘ched’ uses the slots PE---S, it goes in the same ‘category’ as ‘shed’ and ‘sheed’, which use the same slots. Instead ‘aiin’ uses ---AIS, so it goes in a different category (the ‘Y’ final slot is treated separately and is not considered in the categories, but I may add it in the future, I’ll see). The maximum number of possible categories is 64 (2 raised to 6th power, the number of slots inside the word chunk), but only 51 different ones are actually used by the VMS (some of them extremely rare, ie. the ‘chai’ above is the only chunk in his category), which is mildly encouraging.

This means that the whole VMS (except 58 word types) can be written by using 605 different chunks, to be repeated up to 4 times , which I think is an interesting result. But remember: changing the grammar will result in a different chunk-ification, so I’m surely not claiming that these are the right chunks! They are a possibility, and it doesn’t look too bad to me, for now.

---------------------------

Many criticism can be raised of course, I thought of these:
• 605 chunks is not a big number but it’s also not a small one. It could be that any text could be ‘decomposed’ with analogous results. I find it improbable for natural languages, I would have no clue on how to build a slot grammar for a natural language (their word types are too varied). I also made a sanity check using syllabified natural languages. Results are preliminary, not checked thoroughly and must be taken with a pinch of salt because syllabifying a text is not trivial, but to keep it short, with texts of comparable length (and even shorter ones) I find a least ~1200 different syllables in certain Italian and Latin text (they usually have more), and many more for English. Of course languages written natively with a syllabary will have much less, but they're improbable in medieval Europe.
• Some chunks are almost whole words, so it’s not surprising they can generate Voynich words. Seen from another point of view, I’m approaching the “Trivial 1*WS” grammar (which uses ~8000 symbols) with a grammar with 4 slots (the number of chunks) and 605 symbols (+null). The Ncharset (see post #39) would be rather high but I need more data to calculate this number, it cannot be higher than 4*(605+1) but will be probably lower because I won’t need all the chunks in all the 4 slots.

As another sanity check: the chunks include all the single 'characters'. This would be very bad for the grammar if many words turned out to be chunkified using only single-character chunks but, apart from obvious cases (ie. word types such as ‘d’, ‘dy’, ‘chy’ etc.) this does not seem to be the case.

------------------------

Now what I want to do in the next days is to calculate the frequencies of the chunks on word tokens, ie. P(START)-->’qok’, P(START)-->’ched’… P(qok)-->’eee’ and so on. I think (and hope xD) the number of probabilities different from zero will not be that many (which would be another encouraging thing). Once I’ve done that I should have a 4 slots (unlooping) grammar, one slot for each possible chunk, with 605 symbols (+ null and end-of-word) + frequencies, which should essentialy ‘capture’ all the statistics of the VMS text (except of course long/short-range things like possible correlations between words or the effect of paragraphs etc.), and I can use it as a Markov chain to generate a pseudo-Voynich with random words, then see how it compares to the real VMS on some statistics (and expecially on the vocabulary). I’ll let you know what happens (also in case of a negative, which is always behind the corner with VMS xD).

-----------------

Feel free to post any criticisms, remarks, suggestions, requests for additional informations, whatever.

----------------

The Excel file dump with the word types parsed into chunks, the chunk categories and the list of unique chunks found is here:

You are not allowed to view links. Register or Login to view.

The most used chunk categories are in the following table. Inside each category the chunks lists themselves are +- ordered by frequency (calculated on word types, not tokens), many are quite rare (and would be even rarer over word tokens), but I have no actual frequency stats at the moment (the ordering is an unforseen but welcomed side effect of the method I used to process the VMS lol, it looks right but I’m not 100% sure, some low frequency chunks could be out of order).
[attachment=9493]
# of uses includes all the occurences in the word types, ie. 'olorol' counts as three 'ol'-like chunks.

Certain categories came as a surprise, ie. the 4th-ranked ‘chd/yk’-like and most of all the single ‘o’, which appears much more frequently (but in uncommon word types) than I imagined. An example is ‘oees’ [o][ees], another is ‘dalo’ [d][al][o]. The distribution of the 51 categories of chunks vs. frequency (among word types) is this:
 [attachment=9494]
At the moment, I don’t have stats on the chunk types frequency, nor on the distribution of word types vs. number of chunks. More software xD
Following my previous post #53, I think I have got very interesting results

• I can now propose a metric for the evaluation of slot grammars (a problem which I posed in post #39).
• I developed a random generator of pseudo-Voynich texts, which writes remarkably good Voynichese.

I think I can now say the LOOP grammar effectively captures the peculiar structure underlying the Voynichese word types and could be fruitful for future studies. But I’ll let you judge it.

-----------------------

A METRIC FOR EVALUATING SLOT GRAMMARS

It’s relatively easy to use a slot grammar to divide any word type in chunks. I’m not going to delve in details here  but I think any programmer who reads post #53 can easily write a software to do it (basically, scan the word and check where it matches the slot grammar).

The result of the ‘chunkification’ is a table similar to this:

[attachment=9512]

But this table is just another slot grammar, just remove the first column and rename the headers:

[attachment=9513]

For lack of a better name I’ll call the second grammar the ‘chunkified’ version of the original grammar.

I propose as a ‘figure of merit’ for a slot grammar the total number of unique chunks in the chunkified grammar (Nchunktypes). That is to say: the total number of unique chunks in the second table above. The lower the number, the ‘better’ the grammar. This applies to any grammar, with a loop or without (that would be a grammar where the ‘loop’ repeats only 1 time).

Nchunktypes works because (see post #39) by chunkifying the Trivial LL*CS grammar we get the Trivial 1*WS grammar (this is very easy to verify by hand), which is quite remarkable. The Trivial 1*WS grammar uses as many chunks as the word types in the original text, so it will always score the lowest on this metric, so Nchunktypes is able to reject both the trivial grammars. Also, this completely bypasses the intractable mathematical problem inherent in the use of the efficiency metric (see again post #39).

Note 1: instead of counting only unique chunk types (Nchunktypes) it’s also possible to count all the chunks in the table multiple times (that would be what I called Ncharset (of the chunkified grammar) in post #39). I have no idea which is the better one (a cool math problem, but not much important at this moment, and maybe someone already solved it, see Note 2).

Note 2: I find it impossible that, probably in some very different form and context but with equivalent meaning, all this has not already been found in some branch of mathematics (group theory?). I really don’t know, but if anyone has an idea, please let me know. By the way, the 1*WS trivial grammar is invariant under chunkification.

---------------------

Well, you already knew what I was coming to… this is a comparison table between the LOOP-4 grammar (defined as in post #53), and Zattera’s SLOT and ThomasCoon V2 grammars. To get more data points, I also used a low-coverage LOOP-2 grammar (the same as LOOP-4, but only two repeats), and two high-coverage grammars (SLOT 2X and V2 2X) obtained by duplicating SLOT and V2 (transforming them into 2-loops grammars).


[attachment=9514]
Note: LOOP-4 has 606 chunks, not 605 as previously reported. I missed one due to a bug.

Can a better grammar exist, with less than 606 chunk types? Yes it could, I did not test many variations, but I bet it’ll be a variant of LOOP-4.

-------------------------------------------

GENERATING HIGH-QUALITY VOYNICHESE TEXTS

As I said in post #53 I planned to use the chunkified LOOP-4 grammar, together with the word tokens frequency data, to build a random Voynich words generator. This task ended, I think, very well.

This is a random sample of the output (I call it Asemic-22 because it used 22 as random seed):

[attachment=9515]

The output has been generated so it has exactly the same number of words as Voynich RF1a-n (here I’ll cut short on explanations on how words with rare characters are managed) to make comparisons easier. I generated just two texts, Asemic-0 and Asemic-22: it just takes a few seconds, but I think two were enough, I don’t want to go chasing uselessly the ‘perfect’ random seed. Here follow the comparisons with RF1a-n (I’m only posting the graphs, but of course I can give also the data tables).

CHARACTERS STATISTICS

Characters and bigrams distributions are indistinguishable from true Voynich:

[attachment=9516]
Note: in Asemic-0 ‘y’ and ‘h’ switch place (the frequencies of ‘y’ and ‘h’ are very similar).

[attachment=9517]
Note: SPACE is represented by a blank in the 2D graph. ‘Rare’ characters are not shown. Columns ‘y’ and ‘h’ are switched in Asemic-0.


WORDS STATISTICS

The word tokens distribution (Zipf’s law) is essentialy the same, you cannot even see the Voynich curve in the graph because it lies completely below the two Asemic curves (it tracks Asemic-0 expecially well).

[attachment=9518]
By the way, this also demonstrates once more that a meaningful language is not needed at all to get Zipf’s law.

The distribution of the length of word tokens, which is quite peculiar to the VMS as identified (iirc) by Stolfi,  is +- indistinguishable:

[attachment=9519]

The distribution of the length of word types has some small systematic differences (too few types with lengths 6 and 7, too many with length 10 and 12), this too will be discussed later.

[attachment=9520]

Side-by-side comparison of the vocabulary (100 most frequent word types): you can see how word types rankings are quite similar on a word-for-word basis.

[attachment=9521]


OTHER DATA

Entropies are identical or very similar, only the words entropy is marginally higher for true Voynich (as usual, more on this later):


[attachment=9522]

Miscellaneous data:


[attachment=9523]

The difference in the number of word types (+15% Voynich in respect to Asemic) and in the number of hapax legomena (words appearing only once in the text, +30% Voynich in respect to Asemic) is the main recognizable difference between Voynich and Asemic, and it drives (I think) all the differences in words distributions (and word entropy) we have seen so far. See later for the discussion.

Note: the difference in the max word types length is trivial because Asemic can at most generate words with 4 chunks (and only 58 words in the whole VMS have more than 4 chunks).

Comparison of hapax legomena:
[attachment=9524]

Asemic generates many hapax legomena which are not found in the true Voynich, and fails to generate many of the Voynich’s hapaxes. It should also be noted, however, that about one fourth of the hapaxes generated by Asemic also appear in the true Voynich (ie. Asemic-22 finds 5911-4174 = 1737 of the VMS hapax legomena). See discussion below!

---------------------------------------------------------------

DISCUSSION (Voynich vs. Asemic)

Asemic generates less hapax legomena (and, in generally, rare words) than Voynich does. This ultimately drives all the differences in the words statistics. I think I know why this happens: it’d be long to explain wholly but, briefly, at each slot of the ‘chunkified’ grammar all the different paths which the composition of a word type can take are mixed together, so the most common paths are preferentially taken and this reduces the number of rare words.

It would be rather easy to generate a ‘perfect’ Voynich by using a tree structure instead of a slot grammar, but I don’t think this is needed, or even useful.

In fact, imagine I fake a quire of Voynich pages, written by Asemic. How are we going to distinguish it from true Voynich pages?

• Word-by-word tokens distribution does not help: indeed the differences are much less than the differences between Currier A and Currier B.
• Hapax legomena do not help: every page of the Voynich has new hapax legomenas, so finding more in Asemic is wholly expected, and the fake pages also use some of the true Voynich hapax legomena, which is consistent with the fake to be original.
• Only the ratio -hapax legomena/number of word tokens- would be useful for the distinction, but only if one already knows that. Else it could be easily explained away, just as the differences between Currier A and Currier B can (ie.: different topics, if believed meaningful).

--------------------------

DISCUSSION 2: COMPLEXITY OF THE PSEUDO-VOYNICH ASEMIC GENERATOR

As I told in post #53, I hoped that a relatively small number of parameters would have been needed to describe the Markov chain which drives the Asemic generator. In effect, I found 8165 ‘transitions’ are needed, and I hoped for much less (even if the vast majority of them are needed only for rare word types).

This has some consequences: not that I thought that the VMS could have actually been written by using a Markov chain based on a slot grammar, but finding a low number of transitions would have increased the probability that some meaningless mechanism was actually used. Thus, finding 8165 transitions actually decreases the probability of Voynich being a meaningless text (which is not what I expected).

This is wholly preliminary, I have yet to think thoroughly about the implications.

-----------------------------

WHAT THE LOOP GRAMMAR AND CHUNKIFICATION CANNOT DO

By construction, it’s impossible to model the effects of things such as paragraphs and line breaks and the effects of correlations between words (both short and long range), and the appearance of words repeated multiple times in a row in the text.

Paragraphs and line breaks are rather trivial, ie.: just insert paragraph breaks where a word begins with a ‘gallow’.

Words repeated up to four times are much more problematic (by the way, I think this is one of the most important evidences against VMS being a ‘regular’ language coded in some way). From a simple probabilistic argument, Asemic should generate about 16 ‘daiin daiin’, which it actually does. It should generate 0.34 ‘daiin daiin daiin’ (there are none in Asemic-0 and Asemic-22), and should generate 4 ‘daiin’ in a row only once every 131 runs. So, intra-word correlations and chance do not explain the words repated four times, at all.

Now I have to confess I don’t know much about short and long range correlations between words. I know a correlation has been found between the last character of a word and the first of the following one, but I have not checked. I also know of studies which claim the true Voynich can be distinguished from a scrambled version of itself, but yet again I have not checked them. So it would be interesting to see how Asemic behaves in this respect (it should fail the tests, if the claims are true). It would also be interesting to compare Asemic with the texts generated by Timm&Schimmer’s self-citation mechanism.

But I think the chunkification process might be of help in studying word correlations. This because it might allow to define a new measure of the ‘distance’ between two words, based on their chunks instead than on their visual appearance.

------------------------

RAW DATA
The Asemic-22 text is here:

You are not allowed to view links. Register or Login to view.

I’d be glad if any of you uses it for some statistical analysis vs. real Voynich with your tools (you may have to remove the first lines of intestation).


The Excel dump of the chunkified LOOP-4 grammar is in two files. “LOOP-4 chunks” (very similar to the one of post #53), with find the chunkification of all the Voynich words, the categorization of chunks and the chunks list (you may have to resize the columns to be able to see the full lists).

You are not allowed to view links. Register or Login to view.


In “LOOP-4 Chunkified and transitions” you find the chunkified slot grammar, followed by the full table of the transitions used by the Markov chain engine to write Asemic’s texts.

You are not allowed to view links. Register or Login to view.
Errata:

The word tokens distribution graph in the previous post is wrong. This is the correct one (conclusions do not change, the distributions are essentialy the same):

[attachment=9526]
(08-12-2024, 06:42 PM)Mauro Wrote: You are not allowed to view links. Register or Login to view.I propose as a ‘figure of merit’ for a slot grammar the total number of unique chunks in the chunkified grammar (Nchunktypes). That is to say: the total number of unique chunks in the second table above. The lower the number, the ‘better’ the grammar. This applies to any grammar, with a loop or without (that would be a grammar where the ‘loop’ repeats only 1 time).

I have some difficulties following your argument, so I hope you won't mind a few dumb questions on my part. How do you define "unique chunks" here? Is it unique per slot or unique in all the slots? 

From the way I understand it, the trivial grammar that just lists 25 most common Voynichese characters in 10 slots would have either 25 NChunktypes or 250 NChunktypes with almost 100% coverage. Is this correct?
(09-12-2024, 04:24 AM)oshfdk Wrote: You are not allowed to view links. Register or Login to view.
(08-12-2024, 06:42 PM)Mauro Wrote: You are not allowed to view links. Register or Login to view.I propose as a ‘figure of merit’ for a slot grammar the total number of unique chunks in the chunkified grammar (Nchunktypes). That is to say: the total number of unique chunks in the second table above. The lower the number, the ‘better’ the grammar. This applies to any grammar, with a loop or without (that would be a grammar where the ‘loop’ repeats only 1 time).

I have some difficulties following your argument, so I hope you won't mind a few dumb questions on my part. How do you define "unique chunks" here? Is it unique per slot or unique in all the slots?

I'm happy to answer any questions, and I apologize for any lack of clarity in what I wrote. More drawings would have been useful, but they take a lot of time to make and then upload, and I kept the explanations as tight as I could to avoid an interminably long post.

With "unique chunks" I mean: count each chunk once, even if it appears in more than one slot of the "chunkified" grammar. That is: count chunk "types". However (see the Note 1 of my post) one could also just count the chunks as they are (so that the same chunk is counted multiple times when it appears in more than one slot): I'm not sure which of the two counting methods is the 'right' one, because both counts (Nchunktypes and, let's say, Nchunktokens) have the same propriety of rejecting both the trivial grammars.


(09-12-2024, 04:24 AM)oshfdk Wrote: You are not allowed to view links. Register or Login to view.From the way I understand it, the trivial grammar that just lists 25 most common Voynichese characters in 10 slots would have either 25 NChunktypes or 250 NChunktypes with almost 100% coverage. Is this correct?

Yes, right. It would have 25 "Nchunktypes", or equivalently 250 "Nchunktokens". Coverage lower than 100% because it cannot generate words with more than 10 characters.


I add an example here which can maybe make things clearer on how the 'scoring' of grammars work:

Let's say we want to 'score' the grammar of your example. That would be the trivial LL*CS grammar (length * character set, 10 slots * 25 characters in your example). This grammar is rejected because, if we 'chunkify' it, it's very easy to see that each word type becomes a different chunk. So after the 'chunkification' we get the other trivial grammar, the 1*WS (one slot containing every word type in the text), and Nchunktypes is now ~8000 = the number of word tokens in the text (with length <= 10). This is the highest (and worst) possible number of chunks one could get, so the trivial LL*CS grammar goes at the bottom of the 'score' list.

If we want instead to 'score' directly the 1*WS trivial grammar, the 'chunkification' gives us the same grammar (it's invariant under 'chunkification'), and yet again it goes at the bottom of the 'score' list.

So the 'chunkification' works equally well on both the limit cases of trivial grammars (they even get the same score!), which is what we want from a sound 'scoring' method.
I still don't get it, unfortunately. Could you maybe show it to me step by step using this simple example: say, we have words CAT, BAT, MAT, COT, BOT, MOP. The trivial character grammar is [A|B|C|M|O|P|T]x3 (3 slots of 7 elements each). How to chunkify this grammar?
(09-12-2024, 09:34 AM)oshfdk Wrote: You are not allowed to view links. Register or Login to view.I still don't get it, unfortunately. Could you maybe show it to me step by step using this simple example: say, we have words CAT, BAT, MAT, COT, BOT, MOP. The trivial character grammar is [A|B|C|M|O|P|T]x3 (3 slots of 7 elements each). How to chunkify this grammar?

Sure.

Grammar: [A|B|C|M|O|P|T]x3 (trivial LL*CS)

Word 1: CAT.

We start with a 'null' chunk and we will add a character at a time as we find it in the grammar

Start from the first character of CAT: C
It matches the 'C' in slot 1, 3rd position, so the first character of the chunk is C
Next character: A
It matches the 'A' in slot 2, 1st position, so the chunk is now CA
Next character: T
It matches the 'T' in slot 3, 7th position, so the chunk is now CAT

End of word: CAT is 'chunkified' as CAT

Word 2: BAT

Exactly the same as above: it's chunkified as BAT

... etc. etc.: every word in the list is 'chunkified' as the word itself.

So the 'chunkified' grammar we end with is [CAT, BAT, MAT, COT, BOT, MOP]x1, the trivial 1*WS grammar. It has 6 chunks, one for each of the words in the list, which is the worst possible number of chunks one can get.



Let's try the same with the [CAT, BAT, MAT, COT, BOT, MOP]x1 (trivial 1*WS) grammar instead.

Word 1: CAT
It matches the 'CAT' in the 1st (and only) slot in the grammar, 1st position: chunkified as CAT

Word 2: BAT
It matches 'BAT' in the 1st slot of the grammar, 2nd position: chunkified as BAT

.. etc. etc., so we get back exactly the same grammar we started with, and we get again the worst possible score.




If the explanation is not enough, I'm here! Btw, that's exactly how my software, basically, works.
Thank you for your effort! Probably, I'm missing some context or some important background idea, I still don't understand how and why this metric works. To me it seems that any grammar with 100% coverage will reduce to the 1*WS grammar after chunkification, because it should be possible to match any word to the grammar, producing a single chunk.
Pages: 1 2 3 4 5 6 7 8