(10-12-2024, 01:31 AM)ReneZ Wrote: You are not allowed to view links. Register or Login to view.I am also still trying to fully understand...
Thank you René. One thing which does not help is that the whole exposition, at the moment, is spread over multiple posts, and with changing and evolving ideas on top of that. I could (indeed I should) re-write everything from scratch in a single, coherent document (which might possibly go into a new thread): my prose won't get any better I fear, but at least the reading will be simpler.
But, the last thing I want is to appear as an insistent and annoying guy constantly peddling his pet theory... what's your opinion?
(10-12-2024, 01:31 AM)ReneZ Wrote: You are not allowed to view links. Register or Login to view.Taking this line as a starting point:
(06-12-2024, 09:53 AM)Mauro Wrote: You are not allowed to view links. Register or Login to view.Only 58 words (on 8031) are missed and they are all with more than 4 chunks, ie. ‘tchodypodar’ would be [t][chod][yp][od][ar]. Just for colour, ‘fachys’ is chunkified as [f][a][ch][ys] and is among the valid word types found by the grammar.
One risk I see is overmodelling. It is probably sufficient, even better (subjectively) not to aim for more than, say, 98% coverage. There will be errors in the text. We don't know how many and where, but the high number of hapax is also a hint in that direction.
The major issue of word spaces is largely covered by a looped slot system, which is probably one of the most attractive aspects of it.
The word tchodypodar is a tricky one. It is very long, compared to average word length, but all its transitions are common. Now it 'feels' wrong that this should have as many as 5 chunks, and it also 'feels' wrong that the common bigram 'dy' is split into two different chunks. The word 'fachys' is nice and short but highly irregular. It is the type of word that I would discard when setting up a model.
On a side note, the first character of each paragraph could be left out of consideration for the known reasons, but that would change 'fachys' into 'achys' which is only a bit better.
Yes, chasing coverage for the sake of itself can be problematic, overmodelling is always a risk and yes, errors/transcription errors/interpretation of spaces etc. are very much expected to be present in the transcription and they will skew/alter the results.
Coverage: it's not that I chased coverage. At the beginning yes, of course, but once I struck onto the LOOP grammar coverage became, effortlessly, very high. And indeed SLOT and ThomasCoon's grammars reach a very high coverage too when the 'loop mechanism' is applied to them by repeating them twice as I showed (and the whole problem of 'separable' words melts away). So I think that at least the basic idea of 'looping grammars' is probably sound.
Errors/transcription errors/interpretation of spaces: all what you said is true, and of course influences the results. In the past I've always been very willing to dismiss exceptional words as flunkies or errors (and to dismiss 'separable' words too as just being two words joined together). Then in my last bout of Voynich mania I decided to go the opposite way: to take the whole text at face value, without invoking any exceptions, and see what would happen. Exceptions could always be added later if needed. Then when I came to the first rough versions of LOOP (not yet thought as a loop at that time!), I was very much surprised to find that almost every word stopped to be 'weird' or 'separable' or 'exceptional' (this happens with SLOT and Coon's grammars too, just by looping them twice), so I stopped worrying about textual errors. I don't need them for LOOP (or SLOT x2 or Coon x2) to work, and in any case they can always be factored in later (and will always simplify the results). Details may change, but I doubt the basics will.
Overmodelling and, I would add, overfitting: I am indeed worried that what I did, after all, is something that works on every possible text of any kind (provided one has first defined a reasonable grammar to use), so the results obtained on the VMS could just be some generic propriety of my software rather than give true insights into the VMS. I've made a sanity check (albeit preliminary and with many caveats) comparing the results with the syllabification of natural languages (briefly reported in post #53 iirc), with encouraging results. But yeah, overmodelling/overfitting is a big, possibly fatal, risk with my model. But I also think the metric based on Nchunktypes (or Nchunktokens) is very interesting, after all it gives a way to reduce a text to its basic 'atoms' (the chunks), and then to compare the results with different ways of 'atomizing' the text to see which one is more 'compact' (needs the least information to be described). And getting a good score on this metric is encouraging for LOOP.
As a side note: I said before I was pretty sure that the 'chunkification' and its associated metric had already been 'discovered' before, in a different form, in some branch of mathematics. Now I realize what I did is actually a form of data compression, similar to 'zipping' a file in order to reduce its size (with lower sizes being the better). Not sure what I can get from this consideration, if anything, nor if it's good or bad for me, but it could be useful.
(10-12-2024, 01:31 AM)ReneZ Wrote: You are not allowed to view links. Register or Login to view.However, my main question is: how does your asemic text relate to the slot model? Is it generated based entirely on it?
It's entirely based on the slot model + 'bi-chunks' (analogous to bi-grams) frequencies. It works like this:
1) The VMS word types are divided in chunks following the grammar. Ie., with LOOP-4: 'daiin' = [d] + [aiin]
2) I calculate the 'chunkified' grammar, just by putting every chunk found in the corresponding slot. Ie.: [d] goes in slot 1 of the chunkified grammar, [aiin] goes in slot 2.
3) I calculate all the 'bi-chunks' probabilities, ie. P(STARTOFWORD followed by [d]) = xx; P([d in 1st slot] followed by [aiin]) = yy; etc. etc.
4) These probabilities give the transitions of the Markov chain, while the chunks are the nodes. Ie.: node START; the random number chooses a chunk in the first slot of the chunkified grammar as the next node, based on its relative frequency, let's say it was [d]. Next slot: the random generator chooses [aiin], based on the probability of finding an [aiin] after a [d] in the first slot. Next slot: the random generator chooses [END], based on the probability of finding [END] after an [aiin] in the second slot.
All these passages can be followed 'easily' in the Excel files I posted in post #54, first the 'Chunks', then the 'Chunkified' sheets (Don't mind the chunk 'categories' columns in the first sheet, the X__X__X, that was and idea which I think is cool but I did not develop further. They are not actually used anywhere, even if the total number of chunks (Nchunktypes) is written at the bottom of this table).
*** But I want to add a very important thing: Asemic Voynich is rather cool, but its primary purpose was to verify all the software passages are working properly. It adds something to the discussion, because it's nice to be able to see where VMS differs from an (otpimized, and possibly overfitted) random process, and maybe learning something from that, but it's not the main point.
And yet another side note... In a sense, Asemic Voynich is a 'trick' conceptually similar to what Zattera did with his SLOT MACHINE grammar: it sumperimposes a structure (a deterministic state-machine in the case of Zattera, a probabilistic frequency table in the case of Asemic Voynich) to an underlying slot grammar, thus constraining (deterministically or probabilistically) the paths it can take. What do I make of this? I don't know atm
