The Voynich Ninja

Pages: 1 2 3

(09-05-2024, 09:12 AM)Koen G Wrote: You are not allowed to view links. Register or Login to view.Now Voynichese isn't a language like English, so the term phonotactics may not even be relevant. But still, if you were to count everything that could potentially be formed according to existing bi/trigrams, you're also assuming that there was no good reason why those words were not included in the first place. You'd be saying "they might as well have been there".

And you might be right. Smile

My first generator picked random existing trigrams (including symbols for spaces, starts of lines and ends of lines). It created long words like qopcheododaiin, otchalteedchy with a higher frequency than what exists in the VMs. I could improve it a bit with a parameter that selects a subset of a transliteration to extract trigrams from and a parameter for maximum word length. Apart from the unusual length of some words, I don't know if there is a statistical test that can detect a difference between these generated lines and actual VMs lines.

Setting constraints on bigrams is too crude to model the lack of certain trigrams like "eod" in Q13. Trigrams might be the right size (3) to set constraints on. Plus there is the fact that a horizontal or vertical segment in a grid is defined by 3 coordinates, which is the basis for my zigzag path cipher. Also there are reasons to put glyphs in a particular order that give them a dimensional quality.

(09-05-2024, 06:16 AM)ReneZ Wrote: You are not allowed to view links. Register or Login to view.While the word: "qokeeokeedy" is not attested, the similar word "qokeokedy" is:
You are not allowed to view links. Register or Login to view.

That begs the question: can we come up with some non-subjective way to define valid and invalid words?

I can at least describe how I came up with [qokeeokeedy].

For each bigram, we can calculate the probability that any given token of the first glyph will be followed by the second glyph, ignoring word breaks. For [qo], the probability that any [q] will be followed by [o] is something like 97%. The next most probable case varies by language: in Currier A it's [qk] at 1.4%, and in Currier B it's [qe] at 1.2%. Of course these calculations require some working assumptions about what counts as a glyph (which I won't go into here, except to acknowledge that different choices about this might have led to different conclusions).

For every glyph there's another glyph that's statistically more likely to follow it than any other. The statistics are very different overall for Currier A and Currier B. For Currier B, if we generate a string by following the most statistically probable steps from glyph to glyph, we get a continuous loop: [qokeedyqokeedyqokeedy]. For Currier A, the continuous loop is instead [choldaiincholdaiincholdaiin]. Inserting word breaks here between glyphs that are ordinarily separated by spaces gives us a repeating [qokeedy] in Currier B or a repeating [choldaiin] or [chol.daiin] in Currier A.

But for each glyph in these loops, there's also a second most probable choice of following glyph -- and third, and fourth, and so on. In the case of [qo], the next-most-probable choice isn't very probable at all. But in other cases, the probabilities of the first and other "choices" are much closer. So I experimented to see what sequences result if we take the [qokeedy] loop and substitute a single moderately less probable "transition" within it, or start with some other glyph that isn't part of the loop, such as [a], and do the same thing.

The word [qokeeokeedy] is what we get if we start at [q] and substitute the third most probable option [o] (11.98%) for the first most probable option [d] (39.79%) the first time around. This word doesn't occur in the VM, but as Rene pointed out, [qokeokedy] does, and so does [qokeeoky], both of which are similar to it and "weird" in more or less the same way it is, with its two gallows.

Most other sequences of similar length "predicted" by this method will get broken up across pairs or even larger groups of words, but most of them are actually attested -- [qokeedychedy], [qotedyqokeedy], [arokeedyqokeedy], [aiinShedyqokeedy], etc. -- as long as each alternative choice of "transition" is individually somewhat probable.

(09-05-2024, 06:16 AM)ReneZ Wrote: You are not allowed to view links. Register or Login to view.what is the most likely valid word that is not attested?

That's a really interesting question. I suspect different methods would yield different answers, but probably still worthwhile to try. The method I outlined above would give us one way to identify the "most probable" sequence that doesn't actually occur (not necessarily the best way, but a way). Another promising source of likely valid but unattested words is Torsten Timm's paper at You are not allowed to view links. Register or Login to view. starting at page 66 -- thinking of all the words marked with (---): [doir], [daiiral], etc. I gather he'd classify all of these as "likely," although I'm not sure he'd have a method for ranking any one of them as "the most likely."

But if there is a very plausible missing word, then why is this word missing in the first place? There must have been something working against its inclusion.

(09-05-2024, 12:46 PM)Koen G Wrote: You are not allowed to view links. Register or Login to view.But if there is a very plausible missing word, then why is this word missing in the first place? There must have been something working against its inclusion.

Let's say we have the ability to calculate the "plausibility" of a word (say, as the the product of bigram likelihoods). Then the plausibility for all possible strings could be calculated and checked against the text. This would provide us with a list of most plausible missing words.

The test then would be to discover if a) they're distributed randomly across all plausible words (which would be natural language like), or b) are governed by further rules, such as longer range dependencies (hypothetically, no word should contain two instances of [l] or have [k] exactly two characters after [d]). These further rules could be plugged back into the plausibility calculation, and the list of plausible missing word list rechecked. If done iteratively we should be left either with a situation where all the missing words are highly implausible or randomly spread across all plausible words.

Would something like this exhaust the word formation rules?

The same "plausibility index" could likewise be checked against not type presence but token amount, with a token/plausibility ratio. This might even better give a sense of the "naturalness" or not of the text. Is the ratio constant? Is [daiin] simply the most plausible word rather than the most common one?

(09-05-2024, 12:46 PM)Koen G Wrote: You are not allowed to view links. Register or Login to view.But if there is a very plausible missing word, then why is this word missing in the first place? There must have been something working against its inclusion.

This depends, of course, on how the text generation worked. A system to generate meaningless text will work quite differently from a system to encode a plaintext.

In case there is a plaintext behind it all, one theoretical possibility would be a nomenclator, or more specifically a numbering (enumeration) system. Then, a valid word might be a very rare word, which does not appear in the extant part of the MS. One could even imagine that a vocabulary was set up at the beginning, and some words simply do not occur in the entire text, even the parts that were generated or planned, but have not come down to us.
(We really do not know how much of the original we are missing).

This would also work in case the apparent words only encode syllables.

In case the text were generated ad hoc, and is meaningless, the cause would be a very different one. This as yet unknown cause would still need to include an explanation why there is such a clear word grammar.

One thing is easy to do, and I would do it now if I had my Linux laptop with me.

This is to declare one of the remaining folios in quire 20 as missing, and see which words are then lost from the vocabulary. Not sure what this will tell us, but we don't know until we try.

Any volunteer is welcome to try it, and I intend to try it next week in any case.

(09-05-2024, 08:11 PM)ReneZ Wrote: You are not allowed to view links. Register or Login to view.One thing is easy to do, and I would do it now if I had my Linux laptop with me.

This is to declare one of the remaining folios in quire 20 as missing, and see which words are then lost from the vocabulary. Not sure what this will tell us, but we don't know until we try.

There are many hapax legomena on every page, at least these (+ a few others, much less frequent, that appear twice on one page and nowhere else) would be lost if their page is removed.

You are not allowed to view links. Register or Login to view.

If the Universal Template is the relevant template then the rules of Voynichese will be implicit in the template, necessarily.

The first and golden rule will be: the preferred glyph combinations are those that move the text towards the paradigms, QOKEEDY and CHOLDAIIN.

Forbidden combinations are those that prevent the text from resolving into the paradigms.

We can extrapolate the more specific rules from the paradigms themselves. The rules are dictated by the paradigms.

Valid words - attested or otherwise - conform, and invalid words do not.

In the example given by Patrick in this thread, [qokeeokeedy] can be constructed from QOKEEDY. [iinqeeadk] cannot, and does not follow from any natural combination of the paradigms.

What is a natural combination?

An example: [qokaiin] is a valid word but [aiinqok] is not. In natural combinations the prefix-suffix order is preserved.

What is the relationship between the two paradigms? (Why are there two, anyway? Why do they overlap in the benched glyphs?) We can work out the rules by answering that question. That must follow from the template, logically.

(An important thing to notice is that QOKEEDY is tripartite bigrams, QO + KEE + DY, whereas CHOLDAIIN bifurcates into trigrams, CHOL + DAIIN. Many tangles of the text are the result of that tension and there must be rules concerning it.)

QOKEEDY adheres into tripartite forms. CHOLDAIIN fragments. QOKEEDY is continuous and cyclic. CHOLDAIIN is discontinuous. Thus the rule: [y] can appear at the start or end of words, but [iin] can only be final. We discern that rule from the paradigms.

Natural combinations also preserve CV alternation, although the two paradigms have different patterns. QOKEEDY has perfect CV alternation (with doubled vowel, VV) but CHOLDAIIN is bifurcated by CC combinations.

(09-05-2024, 06:16 AM)ReneZ Wrote: You are not allowed to view links. Register or Login to view.what is the most likely valid word that is not attested?

One approach to this question could be to synthesize a large text sample according to some language model, then examine the population of unattested words.

We can simulate text statistically by sampling the character correlation matrix. The largest that I can comfortably handle is fifth order; that is, the probability of each succeeding character is determined by the preceding 4-mer (for Bennett 1976, a "Fifth-Order Monkey"). Using all paragraph text from Takahashi IT2a-n.txt as input, typical output rambles along like this:

ar sheey qokar qokaiir ychey choltaiir chody shol daiin qokaiin y ycheey cheey qokeeo ypchd ar lkain qopar okaiin okal okedy qotaly s oteey roiir cthol dol qokeeo chockhey qokalchy olpchedy ysheedy qotchy dor chey qokai qokeeedy qokaiin chckhol daiin yteedy cheor aiin qokal kol cheody chey pshol cheky cthes qopchedy ol sheey teey okeoly olchey okeeolkaiin qoaiin ychedy pcheey shkeody qokal keeain cheol qotaiin yteey raiiin chedy qokaldy sairy poeeady pchey cthey dchckheody qotedy cphy oteodaiin...

When such a text of one million EVA characters is parsed into words, 90.% of the tokens are attested in the source. The exact frequencies change from run to run because the matrix is sampled pseudorandomly. But some words are common near the top of the most-frequent-but-unattested list:

orchy
shodar
chokchdy
qotcheey
lchedaiin
lchd
lches
ychaiin
chotchor
qoksho

Each of these represents about 0.2% of the unattested tokens generated, or ~0.02% of the synthetic text as a whole. Thus the statistical model will produce several instances of orchy in a sample of 35 000 words. The predicted frequency does not trivially account for its absence in the source.

The "model" here is a brute statistical description of the local EVA character sequence, so there is nothing to be learned by peeking under the hood. Could there still be interesting patterns in the population of possibly "unattested valid" words of this type?

(10-05-2024, 08:59 AM)obelus Wrote: You are not allowed to view links. Register or Login to view.But some words are common near the top of the most-frequent-but-unattested list:
orchy
shodar
chokchdy
qotcheey
lchedaiin
lchd
lches
ychaiin
olshy
dolchedy
oteedar
chotchor
qoksho

Unattested?

Some of these are attested in IT2a-n.txt paragraphs:
<f82v.14,+P0> You are not allowed to view links. Register or Login to view.
<f99v.34,+P0> You are not allowed to view links. Register or Login to view.
<f116r.8,+P0> You are not allowed to view links. Register or Login to view.

Some of these exist in other (better) transliterations:
You are not allowed to view links. Register or Login to view.
You are not allowed to view links. Register or Login to view.

Pages: 1 2 3

nablator

pfeaster

Koen G

Emma May Smith

ReneZ

ReneZ

nablator

HermesRevived

obelus

nablator