The Voynich Ninja

Full Version: Vord paradigm tool
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Pages: 1 2 3 4 5 6 7 8 9 10
(09-11-2022, 02:00 AM)Emma May Smith Wrote: You are not allowed to view links. Register or Login to view.I'm currently rechecking my model for predictions/rules. Would you be interested in testing them?

The 'leapfrog' pattern from my previous post is unusual because it can easily be implemented as a functional operator on the big correlation matrix.  Other schemes, of greater linguistic merit, may be roundabout or impossible (for me!) to check in that way.

A practical strategy is to generate the statistics on a case-by-case basis.  We first formulate the rule scheme as a statement containing two ordered character variables, such as the proven-productive question
  • To what extent does the final glyph of a word predict the initial glyph of its sequel?
Expressed as a string pattern, we tabulate occurrences of the rule scheme

...(fist glyph) + (space) + (second glyph)...

and calculate correlations between the variable characters.  For the whole of IT paragraph text (without consolidating any EVA glyphs, and respecting line breaks) the matrix of correlation probabilities looks like this

[attachment=6947]

...again with the first character in rows, and the second in columns.  Following the procedure of Smith and Ponzi 2019, we can divide this matrix by one generated from the same text sample, except with the words within each line scrambled.  It shows the factor by which line break combinations deviate from statistical expectation:

[attachment=6950]

...which is just a visualization of your Tables 3.1-3.8 that highlights the positively deviant combinations, but suppresses their absolute frequency.  Our numerical values are in satisfactory agreement (considering haste and shortcuts).  It is to be hoped that the human mind can find patterns in the pixels.  Is the general anisotropy of the picture telling us anything?

It would be most interesting to test some some non-obvious rule schemes.  The work of dropping a new string pattern into the graphics code is minimal.
As Marco pointed out in this thread, my vord paradigm based on QOKEEDY bares contrast to Emma May Smith’s preferred example vord QOKEDAR.

The difference is the final consonant. As Marco writes:

If one matches [qokedar] to qokeedy, it becomes clear that a slot for the "coda" is missing: one must allow that the vowel of the last syllable is followed by consonants.

In my quest for simplicity I removed the provision for the final consonant. It was in my initial model but I attempted to eliminate it.

It is, however, a necessary complication (if QOKEEDY is our paradigm.) The paradigm, in fact, is QOKEEDY, with or without a Final, or a “coda” to the last syllable.

(I will just call this slot in the model ‘Final’ and the glyphs that tend to appear there ‘Finals’, alluding to scripts like Hebrew or Arabic where letters can take final forms. There is ample evidence that in Voynichese some of the ‘Finals’ are vord-final forms of other glyphs that are transformed when they appear at occur at the end of vords, i.e. before a space.)

To clarify things, a better iteration of the paradigm looks like this:

[attachment=6948]

QOKEEDY remains the default vord, but it has two forms, one ending [y] and one ending vowel/final. Arguably QOKEDAR makes QOKEEDY redundant, but there are other reasons for having QOKEEDY as the default.

I have also taken Marco’ sage advice and admit vords without a core (or subject.) The distinction between vords with and without a core then becomes a matter of interest.

* * *

Every graphic presentation gives emphasis to different aspects of the default vord structure. I choose to give emphasis to the tripartite structure (compartments A, B and C) and to the core glyph (almost invariably a gallow or at least consonant, shown in red.) I also want to emphasize the conspicuous alternation of consonants and vowels: CVCVCV etc.

All the same, this correction only fixes one end of the paradigm – it allows a final consonant. It does not fix the other shortcoming of QOKEEDY, namely that a good portion of vords begin with a VC prefix (eg. [ol-]. But QOKEDAR fails at that hurdle too. 

Assuredly, there are rules (or rather probabilities) that govern what glyphs can appear in what slots but the model becomes unweildly if we try to account for them all. Even with the best of rules there are exceptions. Few are iron fast. Rather, it suffices to make the general rule that any consonant glyph can appear in any consonant slot, and vowels in vowel slots, but some are much more likely than others on a scale from ‘almost always’ to ‘not once in this sample.’

The other rule for using this tool is simply: in every case match vords to the paradigm as nearly (and as consistently) as possible.

Although this is a vord paradigm, I am actually interested in LINES and apply the model to reveal line patterns. (I am specifically intrigued by how some lines seem to rehearse various permutations of one aspect or other of the paradigm, very often what I am calling the Finals. There is a lot of play with these Finals.)

[attachment=6953]

As a philosophical aside, I think the background to the manuscript is broadly Platonic (rather than Aristotelean) and it is an appropriately Platonic move to apply a paradigm in order to observe samenesses and differences. (I suspect what appear as the core of vords are, Platonically speaking, paradigmata.) Paradigm is the right term. 

Hermes, I would be interested in why you analyse okeom and qokeody differently.

You have:
  • o-ke-om
  • qo-keo-dy

So the B slot has two different possibilities in these words ke and keo. Yet these two words could be united with the analysis:
  • o-keo-m
  • qo-keo-dy

I see that your model allows empty slots, so would this be possible.

(Also, I think we're really close in our analysis of words. Once [o, y/a] are taken as the fundamental markers of the "sections" then the three slot model presents itself so clearly. Each possible choice for a slot also obeys the same glyph order: [t, k, p, f, d, s, l, r] > [ch, sh] > [e, ee, eee] > [o, y/a]. Given that much of the low entropy must be due to this glyph order within sections, I wonder if taking these sections as the fundamental units of analysis would be better? That is, we're not worried too much that [k] precedes [ch], but rather that [cheo] precedes [ty] or [o] precedes [ksho]? And further, whether [oksho] (or whatever) takes [r] over [l]?)

(10-11-2022, 04:53 AM)obelus Wrote: You are not allowed to view links. Register or Login to view.The 'leapfrog' pattern from my previous post is unusual because it can easily be implemented as a functional operator on the big correlation matrix.  Other schemes, of greater linguistic merit, may be roundabout or impossible (for me!) to check in that way.

I guess that's the big challenge to any theorist: state your theories clearly and plainly, and testably. It's better to have a simple statement which can be shown with statistics than complicated ones which are too difficult.
(10-11-2022, 04:53 AM)obelus Wrote: You are not allowed to view links. Register or Login to view.The 'leapfrog' pattern from my previous post is unusual because it can easily be implemented as a functional operator on the big correlation matrix.  Other schemes, of greater linguistic merit, may be roundabout or impossible (for me!) to check in that way.

There are a couple of different ways to analyze "leapfrog" patterns (I like that term!), and I believe they have different implications.  

One is to work out matrices for n-grams beginning with each glyph, such as [o*], [o**], [o***], etc., where each * is a single wildcard glyph: thus, what percentage of the time is [o***] followed by [y], as in [otedy]. 

An alternative is to compare specific values of [o*], [o**], [o***] against the matrices for specific corresponding sequences [*], [**], [***].  So, for example, [oted] makes up some percentage of tokens of [ted].  If the [o] had no statistical impact on the glyph that follows after [oted], we would expect the frequency of [otedy] to be roughly that same percentage multiplied by the frequency of [tedy].  Not exactly, of course; the data is likely to be rather noisy.  But that's how things ought to trend.

The first approach shows how likely one glyph is to appear a certain number of spaces ahead of another glyph, without factoring in the intervening glyphs.  This will tend to expose the same kinds of patterns as a "slot-based" word morphology.  Thus, [o***] might commonly be followed by [y] just because [o] and [y] commonly occupy "slots" that are separated by that many positions and the intervening slots are commonly filled rather than empty.

The second approach will instead show how the presence of the first glyph affects what usually happens after a specific given sequence of intervening glyphs.  This should factor out many of the strictly "slot-based" patterns and let us detect other kinds of correlation over a distance.  Thus: [y] may be common after [ted], but admitting that to be so, is it then more or less common than usual when [o] appears beforehand?

The results of the second approach can also be generalized by aggregating the statistics for all possible values of, say, [o***] against [***], examined case by case, to see if there are any consistent patterns.

The sample "rules" I listed earlier were worked out using the second approach.

So, for example, it's not just that the glyph four, five, or six places ahead of one [t] is more likely than average to be another [t].  That could be explained in terms of average word lengths and a "slot-based" word morphology.  Rather, it's that the glyph four, five, or six places ahead of one [t] is even more likely to be another [t] than it should be based on aggregate matrices for intervening glyph sequences.

Whether this sort of pattern explains or is explained by a tendency of similar words to recur near each other, I don't know!
I wonder if the observations about [qo] triplets could be analysed in a more statistically robust way?

The general finding was that the relationship/ratios between the counts of triplet words like [qotedy], [otedy], [tedy] were patterned according to the initial glyph of the third word. So that all triplets like [qot*], [ot*], [t*] would share similar ratios, and that they would differ from triplets like [qok*], [ok*], [k*].
(10-11-2022, 04:49 PM)Emma May Smith Wrote: You are not allowed to view links. Register or Login to view.The general finding was that the relationship/ratios between the counts of triplet words like [qotedy], [otedy], [tedy] were patterned according to the initial glyph of the third word. So that all triplets like [qot*], [ot*], [t*] would share similar ratios, and that they would differ from triplets like [qok*], [ok*], [k*].

Interesting -- so, contrastive pairs of triplets such as these?

[qotaiin] 76  [otaiin] 136  [taiin] 33
[qokaiin] 259  [okaiin] 199  [kaiin] 43

[qotain] 53  [otain] 89  [tain] 8
[qokain] 255  [okain] 186  [kain] 24

[qotal] 55  [otal] 104  [tal] 12
[qokal] 181  [okal] 116  [kal] 12

[qotar] 57  [otar] 99  [tar] 29
[qokar] 140  [okar] 107  [kar] 39

Where the [qot*]:[ot*] ratio seems on average to be something like 1:2 and the [qok*]:[ok*] ratio something like 4:3?
Yes, that's the one. I think a few other glyphs (like [d] and [l]) have their own patterns too.

The kind of arguments that we can make from these triplets are pretty substantial, if they can be reliably proven.
(10-11-2022, 06:50 PM)Emma May Smith Wrote: You are not allowed to view links. Register or Login to view.Yes, that's the one. I think a few other glyphs (like [d] and [l]) have their own patterns too.

The kind of arguments that we can make from these triplets are pretty substantial, if they can be reliably proven.

I'm not sure if this is what you have in mind, but I have one script that's designed to find all pairs of words that vary in some consistent way (e.g., beginning [ok] versus [qok]) and to provide statistics about the pairs side by side for comparison.  It would have to be modified a bit to handle triplets all at once, but maybe it can contribute something even in its current form.  Right now it's set up to use the ZL transcription and to ignore comma breaks, and limited to paragraphic text, though it would be easy to change those settings.

This may all be reinventing the wheel, but here are a few quick results, for whatever they may be worth:

Among word pairs with at least one [qot*] token and one [ot*] token:
 - The [qot*] group has 931 total tokens, and the [ot*] group has 1474 total tokens (ratio: 0.63 : 1).
 - In the top ten word pairs by frequency (#1-#10), the [qot*] group has 603 total tokens, and the [ot*] group has 938 total tokens (ratio: 0.64 : 1).
 - In the next lower group of ten word pairs by frequency (#11-#20), the [qot*] group has 154 total tokens, and the [ot*] group has 282 total tokens (ratio: 0.55 : 1).
 - Among all those twenty word pairs, there's only a single word pair where the [qot*] word has more tokens than the [ot*] word: [qotchy] 60 vs. [otchy] 40, in position #10.

Meanwhile, among word pairs with at least one [qok*] token and one [ok*] token:
 - The [qok*] group has 2761 total tokens, and the [ok*] group has 1754 total tokens (ratio: 1.57 : 1)
 - In the top ten word pairs by frequency, the [qok*] group has 1993 total tokens, and the [ok*] group has 1089 total tokens (ratio: 1.83 : 1).
 - In the next lower group of ten word pairs by frequency, the [qok*] group has 337 total tokens, and the [ok*] group has 264 total tokens (ratio: 1.28 : 1).
 - Among all those twenty word pairs, there are only three word pairs in which the [qok*] word has fewer tokens than the [ok*] word:  [okeol] 57 vs. [qokeol] 46 (position #11); [okchey] 31 vs. [qokchey] 24 (position #15); and [okam] 26 vs. [qokam] 22 (position #19).

The individual word-pair ratios vary a bit: the top five word pairs by frequency for [qot*]:[ot*] yield 0.64, 0.56, 0.78, 0.53, 0.58, and the top five word pairs by frequency for [qok*]:[ok*] yield 1.30, 1.97, 2.94, 2.02, 2.56.  But the former definitely tend to be well under 1, while the latter definitely tend to be well above 1.
(10-11-2022, 08:33 PM)pfeaster Wrote: You are not allowed to view links. Register or Login to view.use the ZL transcription and to ignore comma breaks, and limited to paragraphic text, though it would be easy to change those settings.

Are you using "ivtt" to make these choices? If not, you may find that it is in fact trivially easy to change these using that tool.
(10-11-2022, 08:33 PM)pfeaster Wrote: You are not allowed to view links. Register or Login to view.The individual word-pair ratios vary a bit: the top five word pairs by frequency for [qot*]:[ot*] yield 0.64, 0.56, 0.78, 0.53, 0.58, and the top five word pairs by frequency for [qok*]:[ok*] yield 1.30, 1.97, 2.94, 2.02, 2.56.  But the former definitely tend to be well under 1, while the latter definitely tend to be well above 1.

Thanks Patrick, those seem to fit well.
Pages: 1 2 3 4 5 6 7 8 9 10