nablator > 09-12-2024, 11:50 AM
(28-11-2024, 10:10 AM)nablator Wrote: You are not allowed to view links. Register or Login to view.If some words are indeed "separable" (as suggested by M. Zattera's article), maximizing the F1 score or any metric is futile: the model is not meant to match "separable" words. Fitting the model to data containing "separable" words is probably what creates the wrap-around effect in the MZ slot sequence: slots 7-11 have 6 glyphs in common with slots 0-2, resulting in multiple possibilities of separation: for example sheolkeedy (in f115v) could be, according to Massimiliano Zattera's 12-slot sequence:
sheol keedy
sheo lkeedy
she olkeedy
These 6 words exist in the VM: it is unclear where the space should be inserted. There could be a rule, for example, to choose the first of the three because it maximizes the length of the first word, but we don't know.
Maybe a non-ambiguous slot sequence, allowing a unique re-parsing from a space-less transliteration, would better capture the partial ordering principle at work in the VM.
Mauro > 09-12-2024, 01:43 PM
(09-12-2024, 11:27 AM)oshfdk Wrote: You are not allowed to view links. Register or Login to view.Thank you for your effort! Probably, I'm missing some context or some important background idea, I still don't understand how and why this metric works. To me it seems that any grammar with 100% coverage will reduce to the 1*WS grammar after chunkification, because it should be possible to match any word to the grammar, producing a single chunk.
Mauro > 09-12-2024, 01:58 PM
(09-12-2024, 11:50 AM)nablator Wrote: You are not allowed to view links. Register or Login to view.Hello Mauro,
If I understand correctly your "chunkification" process, the pattern matching is greedy: this is the first of the three possibilities in the example that I posted earlier on the first page of this thread, and I don't see why it is a better choice that the other two:
(28-11-2024, 10:10 AM)nablator Wrote: You are not allowed to view links. Register or Login to view.If some words are indeed "separable" (as suggested by M. Zattera's article), maximizing the F1 score or any metric is futile: the model is not meant to match "separable" words. Fitting the model to data containing "separable" words is probably what creates the wrap-around effect in the MZ slot sequence: slots 7-11 have 6 glyphs in common with slots 0-2, resulting in multiple possibilities of separation: for example sheolkeedy (in f115v) could be, according to Massimiliano Zattera's 12-slot sequence:
sheol keedy
sheo lkeedy
she olkeedy
These 6 words exist in the VM: it is unclear where the space should be inserted. There could be a rule, for example, to choose the first of the three because it maximizes the length of the first word, but we don't know.
Maybe a non-ambiguous slot sequence, allowing a unique re-parsing from a space-less transliteration, would better capture the partial ordering principle at work in the VM.
oshfdk > 09-12-2024, 02:47 PM
(09-12-2024, 01:58 PM)Mauro Wrote: You are not allowed to view links. Register or Login to view.You are perfectly right: the pattern-matching algorithm is 'greedy', it will stop at the first match it finds. Actually the algorithm is more sophisticated and I could go on and find all the possible alternative matches too, but I think this would not be much useful. It's one of those things which would be very easy to explain face-to-face just by scribbling on a piece of paper, but it's not easy to explain in written form. I try my best.
It does not matter because, after all, the grammar structure univocally determines the chunk structure (even if there many be many possible alternative 'chunkings'). We could easily re-arrange the grammar to get the alternative chunkings, for instance in LOOP-4 all the 'consonants' [d, k, l, r...] etc. are at the end, so a word such as 'daiin' is chunked as [d] + [aiin]. But I could just put that slot at the beginning, and 'daiin' would become [daii] + [n].
So, what discriminates between one of the two alternatives forms of LOOP-4? It's the total number of chunks, Nchunktypes, found by chunkifying each grammar: I actually made a test with the 'consonants' at the beginning (just past the [q, ch, sh, y] slot) and it finds 609 Nchunktypes, so it's slightly worse than the standard LOOP-4 at 606 (but a very viable grammar nonetheless, I guess). I also tested with 'consonants' both at the beginning and the end and it resulted in ~675 chunks (I don't remember exactly).
About 'separable' words: I don't think the concept is even meaningful in the context of slot grammars: just duplicate any grammar and it will find all the separable words it had been missing before. Seen from another point of view: it's not that defining a word as 'separable' excuses the original grammar for not having found it
Mauro > 09-12-2024, 04:45 PM
(09-12-2024, 02:47 PM)oshfdk Wrote: You are not allowed to view links. Register or Login to view.(09-12-2024, 01:58 PM)Mauro Wrote: You are not allowed to view links. Register or Login to view.You are perfectly right: the pattern-matching algorithm is 'greedy', it will stop at the first match it finds. Actually the algorithm is more sophisticated and I could go on and find all the possible alternative matches too, but I think this would not be much useful. It's one of those things which would be very easy to explain face-to-face just by scribbling on a piece of paper, but it's not easy to explain in written form. I try my best.
It does not matter because, after all, the grammar structure univocally determines the chunk structure (even if there many be many possible alternative 'chunkings'). We could easily re-arrange the grammar to get the alternative chunkings, for instance in LOOP-4 all the 'consonants' [d, k, l, r...] etc. are at the end, so a word such as 'daiin' is chunked as [d] + [aiin]. But I could just put that slot at the beginning, and 'daiin' would become [daii] + [n].
So, what discriminates between one of the two alternatives forms of LOOP-4? It's the total number of chunks, Nchunktypes, found by chunkifying each grammar: I actually made a test with the 'consonants' at the beginning (just past the [q, ch, sh, y] slot) and it finds 609 Nchunktypes, so it's slightly worse than the standard LOOP-4 at 606 (but a very viable grammar nonetheless, I guess). I also tested with 'consonants' both at the beginning and the end and it resulted in ~675 chunks (I don't remember exactly).
About 'separable' words: I don't think the concept is even meaningful in the context of slot grammars: just duplicate any grammar and it will find all the separable words it had been missing before. Seen from another point of view: it's not that defining a word as 'separable' excuses the original grammar for not having found it
After reading this reply I realized that the (loop) grammar matching mechanics are more complicated than I thought, which probably explains my persistent confusion. I cannot really understand your results so far, but I'm looking forward to new updates, especially those a bit less technical in nature
ReneZ > 10-12-2024, 01:31 AM
(06-12-2024, 09:53 AM)Mauro Wrote: You are not allowed to view links. Register or Login to view.Only 58 words (on 8031) are missed and they are all with more than 4 chunks, ie. ‘tchodypodar’ would be [t][chod][yp][od][ar]. Just for colour, ‘fachys’ is chunkified as [f][a][ch][ys] and is among the valid word types found by the grammar.
nablator > 10-12-2024, 10:10 AM
(09-12-2024, 01:58 PM)Mauro Wrote: You are not allowed to view links. Register or Login to view.Addendum. About 'separable' words: I don't think the concept is even meaningful in the context of slot grammars: just duplicate any grammar and it will find all the separable words it had been missing before. Seen from another point of view: it's not that defining a word as 'separable' excuses the original grammar for not having found it
Mauro > 10-12-2024, 01:31 PM
(10-12-2024, 01:31 AM)ReneZ Wrote: You are not allowed to view links. Register or Login to view.I am also still trying to fully understand...
(10-12-2024, 01:31 AM)ReneZ Wrote: You are not allowed to view links. Register or Login to view.Taking this line as a starting point:
(06-12-2024, 09:53 AM)Mauro Wrote: You are not allowed to view links. Register or Login to view.Only 58 words (on 8031) are missed and they are all with more than 4 chunks, ie. ‘tchodypodar’ would be [t][chod][yp][od][ar]. Just for colour, ‘fachys’ is chunkified as [f][a][ch][ys] and is among the valid word types found by the grammar.
One risk I see is overmodelling. It is probably sufficient, even better (subjectively) not to aim for more than, say, 98% coverage. There will be errors in the text. We don't know how many and where, but the high number of hapax is also a hint in that direction.
The major issue of word spaces is largely covered by a looped slot system, which is probably one of the most attractive aspects of it.
The word tchodypodar is a tricky one. It is very long, compared to average word length, but all its transitions are common. Now it 'feels' wrong that this should have as many as 5 chunks, and it also 'feels' wrong that the common bigram 'dy' is split into two different chunks. The word 'fachys' is nice and short but highly irregular. It is the type of word that I would discard when setting up a model.
On a side note, the first character of each paragraph could be left out of consideration for the known reasons, but that would change 'fachys' into 'achys' which is only a bit better.
(10-12-2024, 01:31 AM)ReneZ Wrote: You are not allowed to view links. Register or Login to view.However, my main question is: how does your asemic text relate to the slot model? Is it generated based entirely on it?
Mauro > 10-12-2024, 01:47 PM
(10-12-2024, 10:10 AM)nablator Wrote: You are not allowed to view links. Register or Login to view.(09-12-2024, 01:58 PM)Mauro Wrote: You are not allowed to view links. Register or Login to view.Addendum. About 'separable' words: I don't think the concept is even meaningful in the context of slot grammars: just duplicate any grammar and it will find all the separable words it had been missing before. Seen from another point of view: it's not that defining a word as 'separable' excuses the original grammar for not having found it
Yes, but when it's too easy to match anything, the slot sequence can't be optimized. There must be some rule or limit to prevent over-modeling (as ReneZ said) otherwise it's more a certitude than a risk. I think the MZ 12-slot sequence is already too long, it includes a limited wrap-around effect that can be extended by allowing full loops: allowing 3 or even 4 loops improves the coverage, but I don't see the point.
Mauro > 03-01-2025, 11:28 AM