(09-05-2024, 06:16 AM)ReneZ Wrote: You are not allowed to view links. Register or Login to view.While the word: "qokeeokeedy" is not attested, the similar word "qokeokedy" is:
You are not allowed to view links. Register or Login to view.
That begs the question: can we come up with some non-subjective way to define valid and invalid words?
I can at least describe how I came up with [qokeeokeedy].
For each bigram, we can calculate the probability that any given token of the first glyph will be followed by the second glyph, ignoring word breaks. For [qo], the probability that any [q] will be followed by [o] is something like 97%. The next most probable case varies by language: in Currier A it's [qk] at 1.4%, and in Currier B it's [qe] at 1.2%. Of course these calculations require some working assumptions about what counts as a glyph (which I won't go into here, except to acknowledge that different choices about this might have led to different conclusions).
For every glyph there's another glyph that's statistically more likely to follow it than any other. The statistics are very different overall for Currier A and Currier B. For Currier B, if we generate a string by following the most statistically probable steps from glyph to glyph, we get a continuous loop: [qokeedyqokeedyqokeedy]. For Currier A, the continuous loop is instead [choldaiincholdaiincholdaiin]. Inserting word breaks here between glyphs that are ordinarily separated by spaces gives us a repeating [qokeedy] in Currier B or a repeating [choldaiin] or [chol.daiin] in Currier A.
But for each glyph in these loops, there's also a
second most probable choice of following glyph -- and third, and fourth, and so on. In the case of [qo], the next-most-probable choice isn't very probable at all. But in other cases, the probabilities of the first and other "choices" are much closer. So I experimented to see what sequences result if we take the [qokeedy] loop and substitute a single moderately less probable "transition" within it, or start with some other glyph that isn't part of the loop, such as [a], and do the same thing.
The word [qokeeokeedy] is what we get if we start at [q] and substitute the third most probable option [o] (11.98%) for the first most probable option [d] (39.79%) the first time around. This word doesn't occur in the VM, but as Rene pointed out, [qokeokedy] does, and so does [qokeeoky], both of which are similar to it and "weird" in more or less the same way it is, with its two gallows.
Most other sequences of similar length "predicted" by this method will get broken up across pairs or even larger groups of words, but most of them are actually attested -- [qokeedychedy], [qotedyqokeedy], [arokeedyqokeedy], [aiinShedyqokeedy], etc. -- as long as each alternative choice of "transition" is individually somewhat probable.
(09-05-2024, 06:16 AM)ReneZ Wrote: You are not allowed to view links. Register or Login to view.what is the most likely valid word that is not attested?
That's a really interesting question. I suspect different methods would yield different answers, but probably still worthwhile to try. The method I outlined above would give us one way to identify the "most probable" sequence that doesn't actually occur (not necessarily the best way, but
a way). Another promising source of likely valid but unattested words is Torsten Timm's paper at You are not allowed to view links.
Register or
Login to view. starting at page 66 -- thinking of all the words marked with (---): [doir], [daiiral], etc. I gather he'd classify all of these as "likely," although I'm not sure he'd have a method for ranking any one of them as "the most likely."