The Voynich Ninja - descent with modification

Pages: 1 2

Many thanks to the presenters last Sunday. Not only did they deliver new results, but subsequent discussion on the forum has been stimulating. For example, Torsten crisply summarized a prediction that follows from the self-citation hypothesis:

(09-08-2024, 03:36 PM)Torsten Wrote: You are not allowed to view links. Register or Login to view....the scribe might introduce new spelling variants. For example, he could decide to add [aiin] alongside [daiin]. This change would affect only the text generated after [aiin] was introduced, leading to observable developments in the manuscript. Decide for yourself whether the patterns observed in the Voynich text align with this description.

OK, let us attempt to decide on quantitative grounds.

The minimal model is a scribe working page by page from top down. Self-citation would begin with the top line of text, and introduce variants as new lines are generated out of words visible above. Therefore the text is predicted to deviate further and further from the first line as we advance down the page. Space-delimited "words" are the units of composition in this picture. So here is a first-pancake metric of wordwise similarity between lines:

Take each word of a non-page-initial line, and compute its minimum possible edit distance to a word in the initial line. (It is to be hoped that this minimal-edit selection correlates somehow, retrospectively, with the scribal method.) Calculate the Mean Minimum Edit Distance for the words in that line. In some rough sense the MMED score captures how directly the collective of words can be derived from the initial line. As we proceed down the page, more and more mutated versions of the first-line words are visible to the scribe, so the compounded mutations are expected to increase the MMED score as a function of line rank.

Happily Torsten has posted a a You are not allowed to view links. Register or Login to view. that implements self citation. By chopping the generated text into 75 pseudo-pages of 16 lines each, we can approximate the bulk layout of a vms sample (below). The statistical traces of pagewise self-citation, if present, should manifest on each page independently. Therefore we stack all of the individual-page results together with a co-average representing each line. In the plot below, for example, the point at rank 2 represents an average of all 75 second lines' MMED scores relative to their own first lines, etc etc:

generated_text.txt
[attachment=9014]
This plot is not saying anything new or interesting about the text generator; it serves merely as validation of the MMED measure, showing that it can pick up a macro property that emerges from the word-by-word generation algorithm. The farther we progress down each page, the more the words deviate from their first-line exemplars. Open markers show the same calculation performed with words randomly shuffled among the available positions, in which case no trend is expected or observed. MMED values appear to saturate as more than 15 lines are included.

Finally, what the crowd paid to see: We repeat the analysis with paragraph text from Takahashi IT2a-n.txt, using 84 pages that contain at least 16 lines.

IT2a-n.txt
[attachment=9013]
Oh well... I have not yet decided for myself whether the patterns align. The greater noise present in the real text might just obscure a trend of the magnitude seen in the synthetic text.

One way forward would be to refine the line-comparison function, in hopes of increasing its sensitivity, decreasing the noise, or accounting for reference lines other than the page-initial one.

Another is to observe the Perseids from a dark location; at mine the radiant is just now rising.

You are not allowed to view links. Register or Login to view.

I think, for comparison it would be great to have this graph for a text in Latin, or any other highly inflected language. I have no idea what the baseline is for MMED.

I believe that the mean edit distance should increase at a declining rate as changes accumulate, up to a distance of "equal change likelihood". Individual glyphs may "revert" closer to the original, but so long as the average edit distance is low any change is more likely to be away rather than toward. Only once words are more distinct from the initial state will a random change be equally likely to lower the edit distance than raise it. (I guess that the word patterns are more complex than the pages long, so that point would never(?) be reached in practice.)

So changing [qokeedy] to [okeedy] would give an edit distance of one. But given the number of possible further changes to [okeedy], all but one---the addition of [q]---will increase the edit distance further. It's not until we end up with a word like [ytcheo] that we can genuinely say that any possible change will lower the edit distance. But the mean will hover around an equal edit distance word like [otchody] (for example), where given changes might have a 50/50 chance to raise/lower the edit distance.

If that is right, then we shouldn't see large or continuous drops in edit distance, but only occasional blips in an upward trend until the "equal change likelihood" plateau. The first graph shows the kind of trend possible, but the second graph doesn't. Even were the second graph to show an overall increase in variation over lines, the pattern of leaps and falls in variation don't support the proposed method. (Though note that the scale is already higher, so it could represent a plateau. But that itself would invalidate the hypothesis which requires an initial sustained increase.)

Maybe I have it wrong though, so I would be happy to be corrected. The only possible excuse is the esthetic preference of the scribe somehow controls edit distance in a way which cannot be modeled.

(12-08-2024, 08:03 AM)obelus Wrote: You are not allowed to view links. Register or Login to view.
(09-08-2024, 03:36 PM)Torsten Wrote: You are not allowed to view links. Register or Login to view....the scribe might introduce new spelling variants. For example, he could decide to add [aiin] alongside [daiin]. This change would affect only the text generated after [aiin] was introduced, leading to observable developments in the manuscript. Decide for yourself whether the patterns observed in the Voynich text align with this description.

OK, let us attempt to decide on quantitative grounds.

The minimal model is a scribe working page by page from top down.

My statement is about "developments in the manuscript" and not about a "development on a page". See for instance the distribution for the vords [You are not allowed to view links. Register or Login to view.] and [You are not allowed to view links. Register or Login to view.], or the distribution of vords containing [You are not allowed to view links. Register or Login to view.].

Note: "Of course, it is possible to pinpoint quantitative differences between the real VMS and the used facsimile text (most likely any facsimile text). An example is the quantitative deviation of the <q>-prefix distribution from the original VMS text. However, we are not aware of any statistical property of the VMS that qualitatively contradicts our proposed self-citation algorithm. We deliberately did not fine-tune the algorithm to pick an 'optimal' sample for this presentation. ... Keep in mind that the VMS was created by a human writer who had complete freedom to vary some details of the generating algorithm on the spur of a moment. An exact reproduction of all of his/her mental rules is not only most likely impossible, but would still leave the problem of unpredictable random (aesthetic) decisions." [Timm & Schinner, 2020, p. 15]

@Torsten:
My apologies for attributing to you a prediction that you did not make; and my thanks for planting a different idea. Would not self citation necessarily obey one-way causation of textual novelty—as you describe—but on other scales? On individual pages we have some confidence about the order of composition, and a scribe could not have cited words that they had not yet created. The MMED plot for your simulated text clearly shows this causal horizon progressing down the pseudo-pages. The trend would be baffling if observed in a conventional "linguistic" text.

Of course texts generated by human computers will contain various confounds, but their past will not have been affected by their future, so we potentially could find telltale indicators of citation-with-variation in their time-ordered output. My first-attempt plot for Takahashi pages is not definitive... but if a refined treatment (with a better autocorrelation function, more appropriate transcription, whatever) were to find a statistically significant slope, then I would be statistically significantly persuaded that you are on the right track.

I fully agree that the value of SelfCitationTextgenerator lies in the clarity of its function, not the verisimilitude of its output. That is precisely why we can attribute downpage MMED increase to self citation in this one case. An eighth-order correlation matrix generates impressive midline Voynichese, with adjacent-word phenomena and all, but is less useful as a research tool.

@Emma May Smith:
Your reasoning that word mutations must eventually converge on a statistically equilibrated vocabulary appears to be borne out by Torsten's simulated text. Parsing it into 45 pages of 26 lines each,

generated_text.txt
[attachment=9017]
MMED values for the first 15 lines are not exactly as before, since now fewer pages are co-averaged. Truly random scatter could be lowered by processing a larger sample, but the finite vms will require a sharper tool.

@oshfdk:
You are right to recommend that tools invented for the vms should be tested on non-mysterious samples; Torsten's procedurally simulated text seemed to be the most relevant last night. Although Early Modern English is not highly inflected, I do have the 1611 King James Bible suitably formatted. Treating each of the 89 Gospel books as a page, and computing MMED relative to their first verses,

1611 King James Gospels
[attachment=9018]
The word-scrambled results (open markers) give some impression of the inherent statistical noise. As expected (?) no trend is evident. The absolute values for a Finnish or classical Latin text will indeed be different. It is interesting to consider what kind of document, at what levels of organization, would deviate strongly from zero slope.

(13-08-2024, 12:17 PM)obelus Wrote: You are not allowed to view links. Register or Login to view.@Torsten:
My apologies for attributing to you a prediction that you did not make; and my thanks for planting a different idea. Would not self citation necessarily obey one-way causation of textual novelty—as you describe—but on other scales? On individual pages we have some confidence about the order of composition, and a scribe could not have cited words that they had not yet created. The MMED plot for your simulated text clearly shows this causal horizon progressing down the pseudo-pages. The trend would be baffling if observed in a conventional "linguistic" text.

Of course texts generated by human computers will contain various confounds, but their past will not have been affected by their future, so we potentially could find telltale indicators of citation-with-variation in their time-ordered output. My first-attempt plot for Takahashi pages is not definitive... but if a refined treatment (with a better autocorrelation function, more appropriate transcription, whatever) were to find a statistically significant slope, then I would be statistically significantly persuaded that you are on the right track.

Indeed, in a text generated by self-citation it must be possible to observe a gradual evolution over time.

"It is possible to distinguish Currier A and B based on frequency counts of tokens containing the sequence <ed>. The summary in Table 2 shows, e.g., that if <chedy> is used more frequently, this also increases the frequency of similar words, like <shedy> or <qokeedy> .... At the same time, also words using the prefix <qok-> are becoming more and more frequent, whereas words typical for Currier A like <chol> and <chor> vanish gradually. Now, reordering the sections with respect to the frequency of token <chedy> replaces the seemingly irregular mixture of two separate languages by the gradual evolution of a single system from 'state A' to 'state B'. Since words typical for Currier A also exist in Currier B, but not the other way round, it is reasonable to assume that the order shown in Table 2 indeed represents the original sequence in which the sections of the VMS had been created" [Timm & Schinner 2020, p. 6].
[attachment=9019]

Another observation is that a word type tends to be frequently used only in conjunction with similar words. This occurs because a more frequently used word token is more often selected as a source word, leading to the generation of additional similar word tokens. As these similar tokens accumulate, the likelihood of the word being generated again increases, reinforcing the pattern. This is also observable for the Voynich text since "high-frequency tokens also tend to have high numbers of similar words. This is illustrated in greater detail in Figure 3: 'isolated' words usually appear just once in the entire VMS while the most frequent token [daiin] (836 occurrences) has 36 counterparts with edit distance 1" [Timm & Schinner 2020, p. 6].
[attachment=9020]

(13-08-2024, 09:09 PM)Torsten Wrote: You are not allowed to view links. Register or Login to view.Indeed, in a text generated by self-citation it must be possible to observe a gradual evolution over time.

There appear to be three different / complementary processes proposed here:

1. Similar words co-occur near each other on an individual page because later tokens have been copied from earlier ones on the same page. This is the process obelus has been exploring with the "descent with modification" approach.

2. Similar words co-occur within groups of pages because tokens on destination pages have been copied from tokens on source pages (which in turn may mutually resemble each other because of process #1).

3. The process of making minor changes to words is asymmetrical, in the sense that changes in one direction are (or become) more probable than changes back in the other direction, and tend therefore to accumulate and become progressively more frequent:

(13-08-2024, 09:09 PM)Torsten Wrote: You are not allowed to view links. Register or Login to view."It is possible to distinguish Currier A and B based on frequency counts of tokens containing the sequence <ed>. The summary in Table 2 shows, e.g., that if <chedy> is used more frequently, this also increases the frequency of similar words, like <shedy> or <qokeedy> .... At the same time, also words using the prefix <qok-> are becoming more and more frequent, whereas words typical for Currier A like <chol> and <chor> vanish gradually. Now, reordering the sections with respect to the frequency of token <chedy> replaces the seemingly irregular mixture of two separate languages by the gradual evolution of a single system from 'state A' to 'state B'. Since words typical for Currier A also exist in Currier B, but not the other way round, it is reasonable to assume that the order shown in Table 2 indeed represents the original sequence in which the sections of the VMS had been created" [Timm & Schinner 2020, p. 6].

So in this particular case, the hypothesis seems to be that the writer (if only one) would at some point have begun (for subjective aesthetic reasons, consciously or unconsciously) substituting the previously absent sequence [ed] in place of other sequences when copying words, but not substituting other sequences for [ed] in turn, or at least not at anything like the same rate. Is that a correct summary?
Here's a chart of the relative percentages of [Ed], [Eo], and [Ey] by page of running text, sorted first by overall percentage of [Ed], and secondly (when there are no [Ed] present) by ratio of [Eo] to [Ey]. [E] is any quantity of [e].

[attachment=9043]

A "gradual shift" from [Ey] to [Eo] looks as compelling here on the left as the "gradual shift" towards [Ed] does on the right, and seemingly "early" pages such as You are not allowed to view links. Register or Login to view. do get sorted towards the far left side of this chart. I suppose this sequence would be consistent with the whole of Currier A having been written (with the shift in preference over time from [Ey] to [Eo]), with the resulting pages then being taken out of order and used as sources for copying during a second stage, after the shift in preference to [Ed]. But the apparent [Ey]-to-[Eo] trajectory within Currier A doesn't continue seamlessly into Currier B.

(18-08-2024, 06:31 PM)pfeaster Wrote: You are not allowed to view links. Register or Login to view.There appear to be three different / complementary processes proposed here:

1. Similar words co-occur near each other on an individual page because later tokens have been copied from earlier ones on the same page. This is the process obelus has been exploring with the "descent with modification" approach.

2. Similar words co-occur within groups of pages because tokens on destination pages have been copied from tokens on source pages (which in turn may mutually resemble each other because of process #1).

3. The process of making minor changes to words is asymmetrical, in the sense that changes in one direction are (or become) more probable than changes back in the other direction, and tend therefore to accumulate and become progressively more frequent:

Using only recently written pages as a source in 2. would increase the cumulative effect, if there is any tendency to drift. This tendency would need to be asymmetrical, even if the transitions in both directions are possible. More probable additions than removals could maybe explain the asymmetry? For example: eey would evolve to edy more often than change back to ey.

Quote:But the apparent [Ey]-to-[Eo] trajectory within Currier A doesn't continue seamlessly into Currier B.

It would require a lot of reordering in Currier A to follow the trajectory: maybe there is no trajectory and the ratio varies randomly both in Currier A and Currier B? I don't know.

You are not allowed to view links. Register or Login to view.

(18-08-2024, 07:30 PM)nablator Wrote: You are not allowed to view links. Register or Login to view.It would require a lot of reordering in Currier A to follow the trajectory: maybe there is no trajectory and the ratio varies randomly both in Currier A and Currier B? I don't know.

Nor do I -- I'm just suggesting that the evidence for a gradual [Ey] to [Eo] shift in Currier A is comparable to the evidence for a gradual shift towards [Ed] in Currier B. In both cases, we can arrange pages along a continuous spectrum, but what that means isn't necessarily clear.

I just modified my "Voynich Imager" script so that instead of analyzing paragraphs separately, it can now instead track the positions of features within whole pages (to test hypotheses about self-citation).

If [Ed] became more frequent over time because of the cumulative effects of asymmetrical changes made in its favor during copying, and if a significant number of words were typically sourced from earlier positions on the same page (as needed to explain the co-occurrence of similar words nearby each other), I believe we should expect to see the relative frequency of [Ed] increasing on average from the top of each page to the bottom of each page. At the top of the page, we should expect more words to be "first generation" copies from some earlier source which should (because earlier) have contained a smaller proportion of [Ed]; each will have had perhaps one opportunity to undergo a change. At the bottom of the page, we should instead expect a higher proportion of words to be "second generation" or later copies of recently re-copied words, and hence more likely to contain the favored change because they've had more opportunities to undergo it.

However, the relative frequency of [Ed] doesn't perceptibly increase, on average, as we move down a page (brighter = more prevalent):

[attachment=9047]

Within Currier B, if we plot [Ed] in red, [Ey] in blue, and [Eo] in green, we get this:

[attachment=9048]

The proportion of [Ed] relative to [Ey] and [Eo] doesn't increase, on average, as we move down the page.

(18-08-2024, 06:31 PM)pfeaster Wrote: You are not allowed to view links. Register or Login to view.So in this particular case, the hypothesis seems to be that the writer (if only one) would at some point have begun (for subjective aesthetic reasons, consciously or unconsciously) substituting the previously absent sequence [ed] in place of other sequences when copying words, but not substituting other sequences for [ed] in turn, or at least not at anything like the same rate. Is that a correct summary?

This description only holds under two assumptions: (1) an infinite copy iteration, and (2) direct relations for [Ed], [Eo], and [Ey]. According to my theory, both conditions are not fulfilled.

The first important point is that far more potential variations than just the changes between [Ed], [Eo], and [Ey] exist. For example, [chedy] could change to [chey], but also to [shey], [cheedy], [keedy], [tedy], and so on. The relation between words containing [Ed], [Eo], and [Ey] is therefore more complex.
Secondly, not for all words containing [Eo] similar words containing [Ed] or [Ey] exist. There are for instance also words like [cheol], [cheor], [sheol], and [sheor] which are not directly related to vords like [chey], [chedy] or [cheody]. (Therefore the part for Currier A in your graph shows the shift from words like [chol/chor] to words like [cheol/cheor] within Currier A).
Third, each page essentially starts fresh. Take folio 103r, for instance. The most common word on You are not allowed to view links. Register or Login to view. is [qokeey], but you also find variations like [okeey], [qokeedy], and [oteey] (and notably, [shedy], [shey], [chedy], and [chey] are also You are not allowed to view links. Register or Login to view.). However, this doesn’t imply that the same vords are frequently used on f103v. Instead, the most common word on You are not allowed to view links. Register or Login to view. is [shedy], along with related forms like [shey], [chey], and [chedy]. This means it is not possible to describe the Voynich text as an infinite copy iteration.

In essence, a word like [chedy] appears more frequently because it has many possible substitutions for its components. For example, [ch] can be replaced with [sh]. The [e] can be substituted with [ee] or [eee], and [dy] can be replaced with [y]. On the other hand, [cheody] includes an [o], which normally can be substituted with [a] or [y]. However, the scribe seems to avoid writing sequences like [ad] (possibly for aesthetic reasons, whether consciously or unconsciously). [y] on the other hand is only used frequently at first or last glyph of a word. There are only 47 instances of [ad] (There is for instance only one instance of [cheady] on You are not allowed to view links. Register or Login to view. and there is no instance of [cheydy]). This means the only available transition for [od] is to remove the [o]. This suggests that the lower frequency of vords like [cheody] compared to vords like [chedy] is due to the scribe's tendency to avoid words like [cheady] and [cheydy].

Pages: 1 2