The Voynich Ninja

Full Version: Vord paradigm tool
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Pages: 1 2 3 4 5 6 7 8 9 10
The slot-based word morphology approach does a good job of exposing consistent patterns that underlie a majority of word forms, while the "network" approach does a good job of predicting word frequencies based on edit distance from the [ol], [chedy], and [daiin] prototypes.

I'm not sure either of those approaches has a clear advantage over a model based on a transitional probability matrix for which glyph will follow next after a given glyph or sequence of glyphs, without regard for its position within a word, and in which spaces are inserted according to largely consistent patterns into a continuous stream of text.

Torsten's [ol], [chedy], and [daiin] prototypes correspond rather closely to closed loops [cholcholchol...], [qokeedyqokeedyqokeedy...], and [daiindaiindaiin...], made up of the highest-probability sequences of individual graphemes in particular parts of the manuscript, such that if transitional probability matrices have high predictive power in themselves, we should also expect to see word frequencies correlate with "edit distance" as Torsten has found -- though not exactly, since the words [ol] and [chedy] will often represent partial cycles in contexts such as [~r.ol~] and [~l.chedy~].  So I think it might be possible to predict the "network" patterns on the basis of transitional probability matrices, with no need to factor in "edit distance" as such.  Perhaps we could think of a statistical test that would yield a different result depending on which model is more effective.

I'm less sure about transitional probability matrices being able to account for the patterns exposed by slot-based word morphologies.  But I wouldn't rule out that they could.

A while ago I calculated matrices for each bigram in Currier B (e.g., what the next single glyph would be after each pair of glyphs, such as [k] after [qo]) and generated some text randomly based on them.  Of course I had to make some working assumptions about what a "glyph" is, all of which are open to challenge, and I don't think analysis of preceding bigrams goes deep enough.  But the results, which I may have quoted here before, came out looking like this (with spaces inserted between any two glyphs that are more often separated by a space than not):

[ol.qokeodar.ar.okaiin.Shkchedy.Shdal.qotam.ytol.dal.cheokeedy.chkal.Shedy.qokair.odain.al.ol.daiin.cheal.qokeeey.lkain.chcPhedy.kchdy.cheey.otar.cheor.aiin.Shedy.dal.dochey.opchol.okchy.Sheoar.ol.oeey.otcheol.dy.chShy.lkar.ain.okchedy.l.chkedy.oteedar.ShecKhey.okaiin.chor.olteodar.okal.qokeShedy.ol.ol.Sheey.kain.cheky.chey.chol.chedy]

I *think* most of these words would conform to most proposed Voynichese word morphologies (including this new one), but they were generated without reference to any slot system.  Any apparent word structure that appears here is a byproduct of the transitional probability matrices operating freely.
(04-11-2022, 04:21 PM)pfeaster Wrote: You are not allowed to view links. Register or Login to view.I'm less sure about transitional probability matrices being able to account for the patterns exposed by slot-based word morphologies.  But I wouldn't rule out that they could.

A while ago I calculated matrices for each bigram in Currier B (e.g., what the next single glyph would be after each pair of glyphs, such as [k] after [qo]) and generated some text randomly based on them.  Of course I had to make some working assumptions about what a "glyph" is, all of which are open to challenge, and I don't think analysis of preceding bigrams goes deep enough.  But the results, which I may have quoted here before, came out looking like this (with spaces inserted between any two glyphs that are more often separated by a space than not):
To make an analisys of bigrams within vords I took data from You are not allowed to view links. Register or Login to view.
 I made the following matrix:
[attachment=6920]
I expanded the analisys for some of the most common bigrams in the search of common trigrams with the idea of recreate a system based of monograms, bigrams and trigrams, and in the future also test the posibility of a homophonic cipher based on them, but it is off-topic for this thread.
The matrix gives a global vision and it is easy to see the most common bigrams, the preferences of some glyphs to be in first position in a igram or in second position ...
(04-11-2022, 01:05 PM)pfeaster Wrote: You are not allowed to view links. Register or Login to view.
(03-11-2022, 11:36 PM)Hermes777 Wrote: You are not allowed to view links. Register or Login to view.The sequence is continuous because:

Qokeedy – qokaain = the word break is [y.k]

Qokaiin – olkaiin = the word break is [n.o]

Olkaiin – Olkeedy = the word break is [n.o]

Thus: qokeedy.qokaiin.olkaiin.olkeedy. And so on. The line of least resistance.

Do you see evidence of a sequence of those four forms recurring continuously in that specific order, or could a model like the following one also express the pattern you have in mind?

Yes, thanks for the suggestion. I will explore along those lines. As you write, a vord paradigm is useful for "exposing consistent patterns that underlie a majority of word forms" and that is really all I expect the tool to do.
(04-11-2022, 04:21 PM)pfeaster Wrote: You are not allowed to view links. Register or Login to view.I'm less sure about transitional probability matrices being able to account for the patterns exposed by slot-based word morphologies.  But I wouldn't rule out that they could.

Adding to this digression, here is text generated  by sampling the fourth-order correlation matrix (each succeeding character selected according to the preceding trigram):

ch saiin Shekchdy qokaiin chol qokedy chol oteedy qokar dar am daiin ar Shaiin chedy qokaiin tolor chdy rchdy otedy orair al lkaiin omolShey chey lkeey olkear oly dal lkair opar ytar qoty dar otar aiin odaiin chdy chedy otchorol chl karaly qokal qoteey dchedeey qokain ol Shedyo daiin chey raiin Shos fShedy okain She otaiil olalchkalches aiis alchedy qoteedair otaiin darorar qokeedy pychedy r al cheedy okain okain checKhy oltal otaiin otar oly tchdar oteey dal chedaiin okaly qoky lkaiin aldShedy qotaiin todaiin otedy opchedy laiikam sain sar lkeedy dalom qoky chedy okeey chedy roiin oldy qokam yShedy qokeody qokal keody teey oky okam chedy qokeedy qotol qokeedy qoetchy qotain chor ol olar otchdy tcheokedeedy tary ypchedy okeol cPhy tar ar chor chedy okolkeey Shedy al Shedy otaiin an cheey lShlchep dal toy Shek eeo lol chody daiin qokedy Sheey chd okal okaoy rSheody chear okey rody tar Shdy okeey chcKhedy dal chekeedy lcheol qol qokedal orain otaiin cheo r air cheokeeos aiin ol cheey qoto

The source was Currier B paragraph text from the Takahashi IT transcription (140204 tokens).  There is no separate rule for spaces;  they are treated as part of the unmodified EVA set.  Already at this level of correlation, preferred word break combinations seem to be emerging (y.q n.ch etc).  Only 2/167 of the generated words contain two gallows.  Fully 142/167 of the word tokens are present verbatim in the source.  The remainders are

Shekchdy, tolor, rchdy, omolShey, olkear, otchorol, karaly, dchedeey, Shedyo, otaiil, olalchkalches, qoteedair, darorar, pychedy, tchdar, aldShedy, laiikam, qoetchy, tcheokedeedy, tary, okolkeey, lShlchep, okaoy, rSheody, qoto

...but every component 4-mer must exist somewhere in the source in order to have an entry in the correlation matrix.  So as the correlation depth approaches the typical word length, the slot-based generalizations will become valid.  With increasing order, the matrices become increasingly sparse, until sampling the 140204-th order matrix yields the unique complete source text.  In this rather dry sense, the higher-order matrices descriptively account for all patterns, and provide no explanations.
I am not sure I understand correctly, but I guess that, to reproduce the whole MS, the matrix for Ngrams with N = "length of longest repeating sequence +1" should be enough?
(05-11-2022, 09:18 PM)obelus Wrote: You are not allowed to view links. Register or Login to view.Fully 142/167 of the word tokens are present verbatim in the source.  The remainders are

Shekchdy, tolor, rchdy, omolShey, olkear, otchorol, karaly, dchedeey, Shedyo, otaiil, olalchkalches, qoteedair, darorar, pychedy, tchdar, aldShedy, laiikam, qoetchy, tcheokedeedy, tary, okolkeey, lShlchep, okaoy, rSheody, qoto

The parameters of your experiment were slightly different from mine, apart from your use of fourth-order rather than third-order analysis: I used the "ZL" transcription instead of Takahashi, ignored spaces for calculating matrices, and then inserted spaces only according to whether a given sequence of two glyphs has a space more often than not.  But I'd guess these differences shouldn't affect the results too much.

I just checked to see how many of the word tokens in my third-order sample are present verbatim in the source, and came up with 52/57 (or 53/57 if we count [sheoar], which turns up as a string but not a separate word).  The remainders, other than [sheoar], are:

qokeodar, cheokeedy, dochey, olteodar 

My 52/57 is about 91%, while obelus's 142/167 is about 85%.  I believe the percentage of hapax legomena in "real" Voynichese is about 18%, so a typical excerpt of "real" Voynichese text should only score around 82% at best when compared against the rest of the text.

I'm not sure why the percentage goes down from my third-order sample to obelus's fourth-order sample -- I'd have guessed it would be the other way around.  But it seems that moving beyond these orders for experiments of this kind is likely to yield diminishing returns, as far as simply generating attested words. 

(05-11-2022, 09:18 PM)obelus Wrote: You are not allowed to view links. Register or Login to view.In this rather dry sense, the higher-order matrices descriptively account for all patterns, and provide no explanations.

Like any raw data, they need analysis and interpretation.  Examples of the kind of "rule" we might infer from them could look something like this (ignoring spaces):
  • In Currier B, a glyph preceded by e is more likely than usual to be followed by y.
  • In Currier B, d is more likely to be followed by a than by y if it is preceded by y, r, p, n, or l.
  • t is more likely to occur if there's another t four, five, or six places before it.

Maybe some such "rules" reflect the ability of glyphs to occupy different word slots.  But maybe it's the other way around, and apparent word structure emerges from the application of these "rules."  I'm not sure how best to tell the difference.
(07-11-2022, 03:46 PM)pfeaster Wrote: You are not allowed to view links. Register or Login to view.I'm not sure why the percentage goes down from my third-order sample to obelus's fourth-order sample -- I'd have guessed it would be the other way around.

Possibly because you inserted spaces where they are more likely to be than not. Also, words tolor and qoto do exist.

I did something similar with spaces, line-start and line-end symbols: starting with a random line-start trigram, then matching the two last EVA letters to pick the next trigram randomly. The generated text has far too many long words:

ol qokair shar otol chotedy teey oeedy qokarol sar rosho lcheeal oty qotchey teodol dar ot chkcheaiin ykeey lchedy or okaiin olp okcholchey chey kchor solkain or chol chody sheeos chetsheolkain dypchekchykaiin choreal or chedy

Here, for example, chetsheolkain would match actual words with one or two inserted spaces: chet sheolkain (or sheol kain).
(04-11-2022, 04:21 PM)pfeaster Wrote: You are not allowed to view links. Register or Login to view.Torsten's [ol], [chedy], and [daiin] prototypes correspond rather closely to closed loops [cholcholchol...], [qokeedyqokeedyqokeedy...], and [daiindaiindaiin...], made up of the highest-probability sequences of individual graphemes in particular parts of the manuscript, such that if transitional probability matrices have high predictive power in themselves, we should also expect to see word frequencies correlate with "edit distance" as Torsten has found -- though not exactly, since the words [ol] and [chedy] will often represent partial cycles in contexts such as [~r.ol~] and [~l.chedy~].  So I think it might be possible to predict the "network" patterns on the basis of transitional probability matrices, with no need to factor in "edit distance" as such.  Perhaps we could think of a statistical test that would yield a different result depending on which model is more effective.

The reason that the words frequencies correlate is that "all pages containing at least some lines of text do have in common that pairs of frequently used words with high mutual similarity appear" [Timm & Schinner 2019, p. 3]. This means it is also possible to predict the occurrence of a words. A good example for this observation is the change from Currier A to Currier B, or the shift from <chol> dominated text to <chedy>-dominated text (see You are not allowed to view links. Register or Login to view.) e.g. that if <chedy> is used more frequently, this also increases the frequency of similar words, like <shedy> or <qokeedy>. In your blog post "Rightward and Downward in the Voynich Manuscript" you also describe some examples for the effect that the occurrence of a glyph like <sh> or <ch> increases the tendency towards another occurrence. This way the word frequencies confirm that a deep correlation between similar words exists (see also my examples given in this thread You are not allowed to view links. Register or Login to view.).
(07-11-2022, 03:46 PM)pfeaster Wrote: You are not allowed to view links. Register or Login to view.Like any raw data, they need analysis and interpretation.  Examples of the kind of "rule" we might infer from them could look something like this (ignoring spaces):
  • In Currier B, a glyph preceded by e is more likely than usual to be followed by y.
  • In Currier B, d is more likely to be followed by a than by y if it is preceded by y, r, p, n, or l.

Quite right, it was cranky of me to fault a descriptive data structure for not delivering concise heuristics.  Patterns emerge as soon as we visualize the raw data in light of a good question, such as the relationships between a glyph's nearest neighbors quoted above.  The pair correlations for a generic glyph's preceding and following neighbors can be calculated from the third-order matrix by aggregating over the second coordinate (summing entries for the middle glyph).  The resulting probabilities for Currier B paragraph text (with spaces) add up to unity, as they must:

[attachment=6937]

Rows are labelled by preceding character, columns by following character, in order of decreasing frequency from the upper left.  Pfeaster's e?y rule is a prominently dark, high probability element, and we can read off others relating to word boundaries:
  • A character preceded by d is likely to be followed by a space (true enough)
  • A character preceded by a space is likely to be followed by o or h (makes sense)
The general clustering of probability toward the upper left is due in part to the greater density of high-frequency characters per length of text, which increases their chance associations.  This bias is normally removed by dividing each element by its probability in the absence of correlations.  The values then represent the factor by which each pair probability deviates from that of a scrambled text:

[attachment=6938]

The e?y rule remains, occurring 4.8 times more often than expected by chance.  It and other deviants (q?k, i?n, y?q) are mostly driven by the highest-frequency trigrams
     dy_     _ qo     edy     in_     che     _ ch     y_q     ey_     qok     aii     iin
Impressive values in the lower right-hand corner are statistical flukes due to low character frequency:  m?s is represented by exactly two matches in the text sample, odam s and pcheam sokedy.

Since the baseline frequency factor factor (unity) appears deep in the blue end, only Russian speakers (or a logarithmic scale) will be able to pick out pairs that avoid the leapfrog relationship.  It looks as if high-frequency glyph i drives away most over-next-door neighbors.

So... once we have framed a rule scheme, we can certainly search for instances in the data, and unexpected ones may turn up.  But can we glean new schemes from the data by some "hypothesis free" method other than trial and error?

(P.S. to a You are not allowed to view links. Register or Login to view.:  Checking a 100x larger sample of fourth-order text, the fraction of 'correct' word tokens converges to 86%.  tolor and qoto only appear in Currier A, right?  Thus they are among the uncounted.)
I'm currently rechecking my model for predictions/rules. Would you be interested in testing them?

(I don't want to start posting them unless they're welcome. I have a blog for that.)
Pages: 1 2 3 4 5 6 7 8 9 10