The Voynich Ninja

Pages: 1 2 3 4 5 6 7 8 9 10

(31-10-2022, 04:35 AM)ReneZ Wrote: You are not allowed to view links. Register or Login to view.Whether the advanced grammer of Stolfi, or the various slot representations, or the network set up by Torsten is the preferable representation is for me a matter of taste. They all have their shortcomings. There are two more that one can imagine, namely the tree diagram from the start of the word (or backwards from the end of the word), or a character transition diagram. Both are mentioned in Torsten's post.

The question how good each model predicts the occurrence of words in the Voynich text is not a matter of taste. For instance the table in the first You are not allowed to view links. Register or Login to view. of this thread suggests that you can combine any element in the V3 column with any element in the CF column. However this is not possible. EVA-e in V3 is only combinable with EVA-y in the CF-column whereas EVA-a and EVA-o did not combine with EVA-y. On the other hand certain combinations are far more likely than others. For instance is EVA-a before EVA-in/iin/iiin is far more likely than EVA-o (94% vs. 5%). Also EVA-n alone is very uncommon since EVA-n occurs in 98% of the cases after EVA-i. Therefore a plain table or slot machine is obviously not able to reproduce what we see in the VMS since it does not allow to exclude certain glyph combinations.

It is therefore an improvement if we use a model that takes only existing glyph combinations into account. This is what Stolfi tried to achieve with his word grammar. In his grammar he defined different branches for each stage. For instance there are three different rules for the "Final"-stage: "Y" + "A.M" + "A.IN" (see You are not allowed to view links. Register or Login to view.). This way Stolfis grammar describes some kind of tree since for each stage multiple rules/branches exists.

However Stolfis grammar is not able to explain why <aiiin> is far more common than <oiin>. Stolfi wrote about this question: "It would be nice if the predicted word frequencies matched the frequencies observed in the Voynich manuscript. Unfortunately this is not quite the case, at least for the highly condensed grammar given here" (see Stolfi You are not allowed to view links. Register or Login to view.). It is therefore possible to improve the model further by adding also some likelihoods for each rule. To do so Stolfi added the number of words for each grammar rule: "The primary purpose of the COUNT and FREQ fields is to express the relative 'normalness' of each word pattern. We think that, at the present state of knowledge, this kind of statistical information is essential in any useful word paradigm" [Stolfi You are not allowed to view links. Register or Login to view.].

By using the network approach it is also possible to describe the relations between words for for the VMS. The model works for any part of the manuscript as well as for the whole manuscript, e.g., that if <chedy> is more frequently used, this also increases the frequency of similar words, like <shedy> or <qokeedy>. The advantage is that the model covers the complete list of words for the VMS and is able to explain the word frequencies as well as the preference for certain glyph combinations. The model also explains some of the statistics for the Voynich text like the almost mathematically exact binomial-like word length distribution for word types as well as for word tokens. The explanation is that high-frequency tokens also tend to have high number of similar words and that similar words have similar length.

(31-10-2022, 04:35 AM)ReneZ Wrote: You are not allowed to view links. Register or Login to view.The tree diagram may be used as an example to show the main problem.
If one creates it, then it will quickly diverge in quite a wide tree.
Due to the fact that each of the three parts of the words have their own structure, what will happen is that the stem part will be replicated many times in this tree, under different word starts.

Since "replicated" parts belong together there is no replication, e.g. the <aiin>-part in <qokaiin>, <okaiin>, <kaiin>, <daiin>, <chaiin> and <aiin> is always the same root. With other words there is only one tree with one "root" and multiple branches. Starting point for this approach are the most frequently used words, e.g. for the whole manuscript the most frequently used words <daiin>, <ol>, and <chedy>. The question is only if also the leaves are included or if some threshold is used.

(31-10-2022, 04:35 AM)ReneZ Wrote: You are not allowed to view links. Register or Login to view.I do like the slot system, but it still requires a lot of tuning before it can really help to understand the word structure.

Unfortunately the slot approach doesn't work since Voynich characters did depend on each other. You write yourself "The picture for the Voynich MS text is much sparser. There are far fewer valid character combinations" [Zandbergen 2018 You are not allowed to view links. Register or Login to view.].

Hermes777 suggests that one advantage of paradigms and other such analytical tools is that they let us identify abnormal words so that we can puzzle over them and try to account for them. The fact that the various word grammars and networks and so on seem to show so much agreement as to which words are the abnormal ones is gratifying, even if the reasons *why* they're abnormal according to each model often seem to be strikingly different, and even if the specific "rules" that are being broken in each case often seem to be impossible to translate from one model into another.

In that spirit, I'd be curious to see how the abnormalities in the following two words would play out in this paradigm:

[qokeeoky] <f90v1.9>
[qokeokedy] <f111r.40>

Like [lkshykchy], they each have two [k]s in them. Unlike [lkshykchy], there's no noted ambiguity in spacing, and the issue doesn't seem to be resolvable as a concatenation of two valid words.

I'm particularly interested in these two words because they seem highly abnormal from the standpoint of word structure, but not from the standpoint of individual glyph-by-glyph transitional probabilities. That is, if glyphs are filling prefix-stem-suffix "slots" in words, it's hard to account for them, but if glyphs are just constrained by the other glyphs near them, it's not. One model I experimented with predicted the similar-looking but unattested word [qokeeokeedy] as reasonably probable.

Of course I'd welcome any other analyses of these words too, but they seem very much like the sort of thing Hermes777 is looking to use this tool to examine. I think I can imagine what a self-citation/network analysis of them would look like, for example.

(31-10-2022, 03:44 PM)pfeaster Wrote: You are not allowed to view links. Register or Login to view.In that spirit, I'd be curious to see how the abnormalities in the following two words would play out in this paradigm:

[qokeeoky] <f90v1.9>
[qokeokedy] <f111r.40>

In my eyes "abnormal" words are quite normal for the VMS.

The most common word similar to [qokeeoky] is with 13 instances [qokeeody]. The most similar words on f90v1 are [qekeody] in <f90v1.P.4>, [okeeody] in <f90v1.P.5> and [qocheoty] in <f90v1.P.7>.
The most common word similar to [qokeokedy] is with 272 instances [qokedy]. The most similar words on f111r.40 are [qofchedy] in <f111r.P.36>, [kechedy] in <f111r.P.38>, and [qokechdy] in <f111r.P.39>. There is also [qokeedy] as the word next to [qokeokedy].

(31-10-2022, 03:44 PM)pfeaster Wrote: You are not allowed to view links. Register or Login to view.One model I experimented with predicted the similar-looking but unattested word [qokeeokeedy] as reasonably probable.

This is indeed reasonable. A similar duplication pattern can be found for instance on f70v2 and f71r:
<f70v2.R3.1> [okey] ... [oteotey] ... [oteoteotsho]
<f71r.R1.1> [oky] ... [okeoky] ... [okeokeokeody]

[qokeeoky] and [qokeokedy] present the problem of the second [k]. In the template I'm using this is why there is a column in the third compartment C that allows [ch], [sh] and by extension gallow and bench gallows in that place. We can see where these vord are problematic by putting them onto the paradigm thus:

[attachment=6907]

In the case of [qokeeoky] the problem is resolved by allowing the compound vowel formation [eeo] in the vowel slot in compartment B.

[attachment=6909]

In the second case we allow [eo] as the vowel in compartment B and we must allow the double ending [dy] instead of just final [d]. This is a case of what I call the fortification of the ending. Some vords have fortified endings.

The final forms of glyphs provoke vord breaks. Sometimes we find them duplicated. Here [y] has been added to reinforce or fortify the [ed]. An emphatic ending. The reason endings are fortified, it seems, is to remove any ambiguity as to whether a word break (space) follows. Here [-ed] might be ambiguous (or 'weak') whereas final [y] insists on a word break.

[attachment=6910]

Here is another perhaps clearer way of presenting much the same thing. Different presentatons give emphasis to different habits and structures but they all describe the same processes, slippery though the are.

Any paradigm like this is useless without an accompanying set of rules. The model only depicts the elements from which vords are made. How they are made, how the elements combine, is governed by rules, or rather probabilities.

Yes, the model is misleading because it suggests all possibilities are equal, but it is only intended as a guide to the resources of vords. In fact, some possible combinations are far more likely than others. Indeed.

The template is not intended as a vord generator, but even so I don’t see it as a fault in the model that it might generate non-vords, or non-extant vords.

What of non-extant vords?

oltol occurs once, but oltor never.

Is there any hard and fast rule that eliminates oltor as a possibility? Arguably, it is not forbidden, it is just not present in our sample.

Similarly, qochol is to be found, but qochor is not, even though ochor is found 19 times and ochol only twelve.

okeel appears yet oteel does not, even though [k] and [t] are generally interchangeable.

Similarly there are any number of non-extant vords:

qopod
dosh
cheechom
dochor
qochor (yet chor occurs 527 times)
otolshy
keeral
dashaiin
otoro
qoteel
ochad
qotocho

All of them are conforming, and look the part, but none are in our text. It is a moot question whether they might exist in a larger sample. But they are not inherently impossible. They are possible vords. A template like this tells us that our text contains a selection of possible vords. It is not a fault that it does not eliminate all but the extant vords: it doesn't try to.

(31-10-2022, 02:50 PM)Torsten Wrote: You are not allowed to view links. Register or Login to view.The question how good each model predicts the occurrence of words in the Voynich text is not a matter of taste.

That is true, but that was not the original question. The original question was to find ways to model the words. A simple list of all words fully and perfectly predicts the occurrence of words in the Voynich text, but that is where its use ends.

What is a matter of taste is, which property of the text one finds most intersting, or which property one would like to highlight. The network of words highlights the fact that one can connect words that differ by only one edit distance. Stolfi's word grammar highlights that such a grammar exists in the first place. The slot tables do something similar, but are more visual (easy to understand) and less complete.

The network does not show the word structure, and the grammar and slot tables do not show the connectivity of similar words.

Non-existent words are a problem in properly understanding all this, though I don't want to over-emphasise the importance of them.

In any case, there are non-existent words, that did occur on the now missing pages, but we don't know them. They are valid words though.

Now only if the MS has a meaningful text, then there are likely to be valid words that did not exist in the entire MS.

(31-10-2022, 02:50 PM)Torsten Wrote: You are not allowed to view links. Register or Login to view.However Stolfis grammar is not able to explain why <aiiin> is far more common than <oiin>. Stolfi wrote about this question: "It would be nice if the predicted word frequencies matched the frequencies observed in the Voynich manuscript. Unfortunately this is not quite the case, at least for the highly condensed grammar given here" (see Stolfi You are not allowed to view links. Register or Login to view.).

For me, this is not a problem. If the MS text was the result of a truly arbitrary process, then we would expect to be able to predict such frequencies, but in case it is meaningful language, then not.

Compare English:

bat - bet - bit - bot - but
cat - cet - cit - cot - cut

Tehre is no pattern in the frequencies
(Dislcaimer: just the first example I could think of).

(01-11-2022, 12:13 AM)ReneZ Wrote: You are not allowed to view links. Register or Login to view.What is a matter of taste is, which property of the text one finds most intersting, or which property one would like to highlight. The network of words highlights the fact that one can connect words that differ by only one edit distance. Stolfi's word grammar highlights that such a grammar exists in the first place. The slot tables do something similar, but are more visual (easy to understand) and less complete.

The network does not show the word structure, and the grammar and slot tables do not show the connectivity of similar words.

There is only one Voynich text. Therefore every model for the text has to handle features like the connectivity of similar words. See for instance the word <qokeedy>.
For <qokeedy> there are similar words with ED=1 like: qokeey, qokedy, okeedy, qoteedy, qoeedy, qokeed, qokeeody, qolkeedy, qokeeedy, qodeedy, qokededy, qopeedy
There are words with ED=2: okeey, okedy, oteedy, qotedy, keedy, qoteey, qoked, qokdy, ...
With ED=3: otedy, qoky, oteey, okey, ...
With ED=4: oky, otey, qoty, ody, qok, qoy, qod, eey, ...
With ED=5: qo, dy, ok, qy, ...

How does Stolfi's grammar generates <qokeedy> as well as <qokeed>, <qoked>, <qokd>, <qok>, and <qo>? Since the order of glyphs does not change it is easy to generate some type of grammar covering all the words as long as the elements can be optional. Stolfis grammar does this by using optional rules and by repeating elements n-times (with n=1,2,3,4,5). Therefore the grammar contains rules like the CrS-Rule (see You are not allowed to view links. Register or Login to view.):
CrS:
.
OR
OR.OR
OR.OR.OR
OR.OR.OR.OR

This way Stolfi's grammar does in fact demonstrate the same as the following statement of D'Imperio: "There seem to be very strong constraints in combinations of symbols; only a very limited number of letters occur with each other letter in certain positions of a 'word'." (see Currier 1976 You are not allowed to view links. Register or Login to view.). However in my eyes it is easier to understand D'Imperio's statement.

A "word model" for the Voynich text is, of course, nothing new. Tiltman created a model many decades ago, and Stolfi's work stands out as an inspiration to many researchers. The question is really why we want such a model. A model should provide us with more than a simple restatement that words exist as strings of glyphs and go beyond bigram frequencies. It should reveal some structure of words. For a natural language (leaving aside the question of nature of the Voynich text) we would look to learn about syllable structure or morphology from such a model.

A word model doesn't need to do everything and it can't. Sometimes models have been assessed on how well they fit the Voynich lexicon: do they generate every possible word and does every possible word they generate occur? This is actually a somewhat poor measure. As Rene has pointed out, gaps occur in natural languages and minimal sets of possible alternations may be impossible to construct. Even minimal pairs of a particular contrast can be difficult. The rules of a language may allow words but do not demand them. This goes even more strongly for frequency: sphere is a phonologically marginal word in English, yet the OED puts it in the same frequency band as "dog" and colour terminology.

A good model explains common observations: why only one gallows glyph per word? why is /d/ before a gallows so rare, but a /d/ after a gallows so common? why is /ch/ so common, but not after /o, y/, except at the beginning of a line? And when we say "explain", we mean not that it provides a cause, but rather it provides a rule. So we might say, in answer to the foregoing questions: "there is only one normal slot for a gallows in a word", and "the slot for a gallows precedes the slot for /d/", and even "/o, y/ before /ch/ is abnormal and not part of regular word formation". These answer still require more explanation, but they take us forward: glyph have slots! slots have order! the final word may be the result of multiple processes!

I've looked into a word model for the Voynich text (thanks to Marco for mentioning) and the results were enlightening. It made me realise that the text can't simply be a unaltered natural language (there must be something more happening) but also that there are strong rules behind word formation and not an arbitrary process. Words certainly weren't formed on the fly, at the whim of the author.

1. words with multiple gallows are present.
The question is can I judge them as one word or are there two?

I have a correction in one gallows. Are they two letters in combination. Why would a split work here. Does a meaningless text work here?

3. obviously two of the same gallows in a row. Are they really the same or is it just a deception.

When I use EVA all the words look the same. But they are not. There are more differences than first apparent.
Do you know the difference between an f and a t. v or u.

It's not what it seems.

Translated with You are not allowed to view links. Register or Login to view. (free version)

Pages: 1 2 3 4 5 6 7 8 9 10

Torsten

pfeaster

Torsten

Hermes777

Hermes777

ReneZ

ReneZ

Torsten

Emma May Smith

Aga Tentakulus