The Voynich Ninja - descent with modification

Pages: 1 2

An interesting observation is that words similar to [chedy], such as [chey] and [cheody], already appear in Currier A. This indicates that by the time words like [chedy]/[chedy] were introduced into the text, a number of potential source vords were already present.

See You are not allowed to view links. Register or Login to view.:

1. <chey> in Currier A
The word <chey> is frequently used in Currier A on pages like You are not allowed to view links. Register or Login to view. and You are not allowed to view links. Register or Login to view.. It occurs in close vicinity to similar tokens as <shey> and <chy>.

Frequency of vords similar to <chey> in Currier A:

  ch- che- chee-
-ol chol (280) cheol (71) cheeol ( 3)
-o cho (48) cheo (25) cheeo ( -)
  -y chy (104) chey (78) cheey (34)
-dy chdy  ( 9) chedy ( 2) cheedy ( 3)
-ody chody (44) cheody (27) cheeody ( 4)

2. <chey> in Currier B

<chey> is frequently used in Currier B on pages like You are not allowed to view links. Register or Login to view., You are not allowed to view links. Register or Login to view., and You are not allowed to view links. Register or Login to view.. It occurs in close vicinity to similar tokens as <chedy>, <shey>, and <shedy>.

Frequency of vords similar to <chey> in Currier B:

ch- che- chee-
-ol chol (89) cheol (88) cheeol ( 6)
-o cho (12) cheo (33) cheeo (16)
-y chy (31) chey (238) cheey (122)
-dy chdy (119) chedy (470) cheedy (52)
-ody chody (42) cheody (46) cheeody ( 8)

(19-08-2024, 01:16 AM)Torsten Wrote: You are not allowed to view links. Register or Login to view.The first important point is that far more potential variations than just the changes between [Ed], [Eo], and [Ey] exist. For example, [chedy] could change to [chey], but also to [shey], [cheedy], [keedy], [tedy], and so on. The relation between words containing [Ed], [Eo], and [Ey] is therefore more complex.

That's true if what we're interested in is the full range of words that could result from modifying each source word according to your model. But in each case, a source word containing [Ed], [Eo], or [Ey] would still have that specific sequence either kept or changed when copied, regardless of whether other parts of the same word might change or not. Unless your hypothesis posits that a change to one part of a word makes another simultaneous change to another part of the same word less likely, I wouldn't expect this to make a difference.

I compared those three sequences because they happen to represent the three highest-probability transitions from [E]. However, if you prefer not to compare those three features with each other, there's still the grayscale display limited to [Ed] itself, which I believe shows more or less the same thing.

What I'm trying to assess is the hypothesis that [Ed] becomes more common over time as the cumulative effect of a tendency to switch towards it more often than away from it (please correct me if that misrepresents what you're arguing). Could that happen between pages if it didn't also occur to some degree within pages? It must happen somewhere if it happens at all. I suppose it's possible that the writer favored [Ed] when copying words from other pages, but not when copying words from earlier on the current page -- but that would introduce an added complication.

An alternative hypothesis would be that each page starts out with a slightly different set of transitional probabilities (either as a consequence of how it begins or from some other cause) and that these manifest themselves through the mix of word forms found on that page, with certain "types" of words being more or less common because of the different transitional probabilities implicated in forming them (rather than the other way around -- i.e., the different probabilities result from a page-specific mix of word "types" that arises in some other way). How would you go about ruling that out?

(19-08-2024, 01:16 AM)Torsten Wrote: You are not allowed to view links. Register or Login to view.Secondly, not for all words containing [Eo] similar words containing [Ed] or [Ey] exist. There are for instance also words like [cheol], [cheor], [sheol], and [sheor] which are not directly related to vords like [chey], [chedy] or [cheody].

In this context, variation among [cheol], [cheor], [Sheol], and [Sheor] would count as "no change": [Eo] remains [Eo]. If [Ed] is steadily increasing, it should still increase relative to any unchanged proportion of [Eo], shouldn't it?

(19-08-2024, 01:16 AM)Torsten Wrote: You are not allowed to view links. Register or Login to view.Third, each page essentially starts fresh. Take folio 103r, for instance. <...> this doesn’t imply that the same vords are frequently used on f103v.

I've been assuming for this discussion only that the words at the start of each new page have been copied from some existing page(s). If new pages didn't build on previously written text, there couldn't be any cumulative effects. I believe your arguments at least require Currier A text to have been available as a source for all Currier B text, and for earlier Currier B (with a lower proportion of [Ed]) to have been available in turn as a source for later Currier B (with a higher proportion of [Ed]) -- correct?

(24-08-2024, 04:38 PM)pfeaster Wrote: You are not allowed to view links. Register or Login to view.What I'm trying to assess is the hypothesis that [Ed] becomes more common over time as the cumulative effect of a tendency to switch towards it more often than away from it (please correct me if that misrepresents what you're arguing). Could that happen between pages if it didn't also occur to some degree within pages? It must happen somewhere if it happens at all. I suppose it's possible that the writer favored [Ed] when copying words from other pages, but not when copying words from earlier on the current page -- but that would introduce an added complication.

I did not argue with a cumulative effect of a tendency to switch towards [ed] more frequently than away from it. This might suggest that the text was generated by a stateless machine, such as a mechanical device. However, my hypothesis posits that the text was actually produced by a medieval scribe, unaided by any additional tools. Unlike a machine, a human being has the capacity to learn and adapt, meaning their behavior can change in similar situations over time.

For instance, in Currier A, the scribe writes the word [chey] in close proximity to similar tokens like [shey] and [chy] (see, for example, folios You are not allowed to view links. Register or Login to view. and You are not allowed to view links. Register or Login to view.). In Currier B, however, the scribe writes [chey] near tokens such as [chedy], [shey], and [shedy] (see folios You are not allowed to view links. Register or Login to view., You are not allowed to view links. Register or Login to view., and You are not allowed to view links. Register or Login to view.). This suggests there was a period before the scribe began using [ed] and a period after they adopted this practice.

(24-08-2024, 04:38 PM)pfeaster Wrote: You are not allowed to view links. Register or Login to view.
(19-08-2024, 01:16 AM)Torsten Wrote: You are not allowed to view links. Register or Login to view.Secondly, not for all words containing [Eo] similar words containing [Ed] or [Ey] exist. There are for instance also words like [cheol], [cheor], [sheol], and [sheor] which are not directly related to vords like [chey], [chedy] or [cheody].

In this context, variation among [cheol], [cheor], [Sheol], and [Sheor] would count as "no change": [Eo] remains [Eo]. If [Ed] is steadily increasing, it should still increase relative to any unchanged proportion of [Eo], shouldn't it?

Words like [cheol], [cheor], [Sheol], and [Sheor] are related to words like [chol], [chor], [Shol], and [Shor]. This suggests a possible transformation from [cho] to [cheo], leading to instances of [cheo]. This pattern is evident in Currier A. For example, in Herbal A, there are 228 occurrences of [chol], but only 28 occurrences of [cheol]. In contrast, in Pharma A, there are 45 instances of [chol], alongside 40 instances of [cheol].

The four most frequent 'ch/sh'-word types for Herbal (A) and Pharma (A) (see You are not allowed to view links. Register or Login to view.):

1nd 2nd 3rd 4th word count
Herbal (A) chol (228) chor (155) shol (104) sho ( 96) 8,087
Pharma (A) chol ( 45) cheol ( 40) chor ( 24) cheor (24) 2,529

Presumably there are spatial clusters of related words in Anna Karenina, and they evolve in correlated ways, since it is a coherent narrative generated by a stateful author. But is the text intrinsically directional? With a transcription alone, would Venusians be able to determine whether it was written from front to back or back to front?

A more constrained copy-and-mutate process will definitely yield text with a kind of abstract statistical directionality. With reference to a single specific line of text, the universe of all possible lines contains a greater number of lines at greater edit distance than at lesser edit distance. On these grounds alone, sequential random modification of the reference line will usually produce lines at greater and greater edit distance from the original (until equilibrated). Obviously this asymmetry (Patrick's characteristic #3?) might be obscured by a human scribe's distractions, an inconvenient mutation rate, or the macro-state metric chosen to detect it. Given the very small number of word tokens per vms line, the present attempt to spot intra-page evolution of the vocabulary (Patrick's characteristic #1) uses a metric (MMED) informed by the proposed mutation process. Maybe statistical directionality could even be detected in poetry or musical notation, which are composed by humans using incomplete and not-strictly-random self-citation. A minimal positive result is shown at the end of the post.

Torstens's generated_text.txt is a transparent case of Voynichesque self-citation. Dividing it this time into 37 consecutive pages of 32 lines each, we can calculate the MMED measure with respect to the first line of the current page, or the first line of the following page:
[attachment=9119]
Each point represents the average, over all pages, of a line's MMED score. On the left, this measure of edit distance evolves away from the first line as we progress down the page. On the right the trend is inverted, as expected: edit distance decreases as we approach the reference line on the following page. By construction, there is page-to-page continuity across the 1200 lines of generated text, so a reference line from the 'future' of each page still affects its MMED scores. In this test with a sample size and mutation texture that is broadly similar to the vms, directionality is observable in the co-averaged pages. (For You are not allowed to view links. Register or Login to view. I overlapped the page contents, so the absolute MMED values were not correct. The plots above are... less erroneous.)

Apparently the scatter in results for the vms can be reduced by analyzing a subset of paragraph text that is more uniform, from page to page, than the text as a whole. Quire 20 has rough unity of layout, Currier language, and hand, with many long lines per page. Now considering the first 32 lines of 23 separate Q20 pages, and analyzing exactly as above,[attachment=9120]
A trend does seem to emerge in the current-page plot on the left, with later lines again evolving away from the first. On the right, with edit distance keyed to the following page, no trend is visible through the noise. Unlike generated_text.txt, Q20 shows no evidence here of page-to-page continuity (in the canonical page order).

Curiously, the vms trend may be driven by the first line in particular. When we use the second line on the page as a reference, subsequent Q20 lines do not evolve away from it as convincingly as they did from the first:
[attachment=9130]
Meanwhile generated_text.txt on the right (with its consistent line-by-line generation algorithm) behaves as it did with the first-line reference.

Nablator posted a transcription of Speculum Humanae Salvationis over in the You are not allowed to view links. Register or Login to view.. Using 85 columns from the long central chapters (with headings, capital letters, and punctuation removed), the MMED trend keyed to the first line of each column is understandable:
[attachment=9128]
As with chapters of the King James Bible, no large-scale down-column trend is evident. But the value for the second line specifically is a clear outlier, with a significantly lower MMED value. Just skimming the transcription with my Venusian grasp of Latin, I see that the first couplet of each column rhymes, and often contains a word repeated or inflected between the lines. This degree of line-to-line similarity is apparently enough to show up in the plot, and indeed to account for the effect size seen in Q20.

Of course the trends are maddeningly weak. Since EVA is a graphical notation, 'edit' similarity is mostly capturing visual resemblance. And the Word As A Functional Unit is baked in by default, until a better-preforming analysis is discovered.

Thematically appropriate soundscape: Alvin Lucier, You are not allowed to view links. Register or Login to view..

(26-08-2024, 06:19 PM)Torsten Wrote: You are not allowed to view links. Register or Login to view.I did not argue with a cumulative effect of a tendency to switch towards [ed] more frequently than away from it. This might suggest that the text was generated by a stateless machine, such as a mechanical device.

What I had in mind here wasn't a strictly algorithmic process such as a machine might carry out, but a scribe's evolving and subjective preferences about which parts of words to change, and to what, and which parts of words not to change.

"It is possible to distinguish Currier A and B based on frequency counts of tokens containing the sequence <ed>. The summary in Table 2 shows, e.g., that if <chedy> is used more frequently, this also increases the frequency of similar words, like <shedy> or <qokeedy> ...."

There seem to be two paths by which a word containing [ed] could come to be used under your hypothesis: (1) the scribe has changed a source word such as [chey] to a "related" word such as [chedy]; or (2) the scribe has copied a preexisting source word containing [ed] and kept the [ed].

Since Currier A doesn't contain source words with [ed], path (2) requires that path (1) must have happened at some earlier point to produce an appropriate source word. So the process apparently needs to start, either directly or indirectly, with path (1). And for path (1) to furnish an increasing proportion of source words for path (2), the scribe would need to be changing words without [ed] to words with [ed] more often than the other way around, perhaps not according to any strict algorithm, but nevertheless consistently over the long term.

"Now, reordering the sections with respect to the frequency of token <chedy> replaces the seemingly irregular mixture of two separate languages by the gradual evolution of a single system from 'state A' to 'state B'. Since words typical for Currier A also exist in Currier B, but not the other way round, it is reasonable to assume that the order shown in Table 2 indeed represents the original sequence in which the sections of the VMS had been created."

If path (1) were to be followed consistently, the proportion of previously-modified source words containing [ed] would steadily increase, and path (2) would then also be more likely to occur because the probability of selecting a source word that already contains [ed] would go up, leading to a kind of "snowballing" effect. This is the (subjective, human-powered) mechanism I assumed you were describing here, but maybe I read more into your explanation than you intended. Did you have a different mechanism in mind for the gradual evolution from Currier A to Currier B?

(28-08-2024, 09:24 AM)obelus Wrote: You are not allowed to view links. Register or Login to view.A trend does seem to emerge in the current-page plot on the left, with later lines again evolving away from the first. On the right, with edit distance keyed to the following page, no trend is visible through the noise. Unlike generated_text.txt, Q20 shows no evidence here of page-to-page continuity (in the canonical page order).

These are some very interesting plots. If we disregard the single value for line 31 on the left as an outlier, the trend is less pronounced, but would still slope upward a bit.

(28-08-2024, 09:24 AM)obelus Wrote: You are not allowed to view links. Register or Login to view.Curiously, the vms trend may be driven by the first line in particular. When we use the second line on the page as a reference, subsequent Q20 lines do not evolve away from it as convincingly as they did from the first:

We know (e.g., from tavie's work) that there are systematic differences between the composition of first lines of paragraphs and the lines that follow -- differences that go well beyond the conspicuously disproportionate frequencies of [p] and [f] in first lines. I wonder what would happen if you were to try to control for that somehow, maybe by disregarding a few select edit types ([p] to [t], [f] to [k], [Sh] to [ch], etc.). Or if you were to run a similar study at the paragraph level rather than the page level -- or else at the page level, but omitting the first paragraph and always starting with the first line of the second paragraph. Or reversing the process and gauging edit distance from the last full line, working "backwards" up the page. Or.....?

(29-08-2024, 12:30 AM)pfeaster Wrote: You are not allowed to view links. Register or Login to view.Since Currier A doesn't contain source words with [ed], path (2) requires that path (1) must have happened at some earlier point to produce an appropriate source word. So the process apparently needs to start, either directly or indirectly, with path (1). And for path (1) to furnish an increasing proportion of source words for path (2), the scribe would need to be changing words without [ed] to words with [ed] more often than the other way around, perhaps not according to any strict algorithm, but nevertheless consistently over the long term.

In my eyes it necessary to distinguish between source words the scribe selects and the methods the scribe uses to modify these words. If I understand you correctly, your focus is on the source words. However, from my perspective, the key change lies in how the scribe modified these words. In Currier A, it's already possible to find words like [chdy], [chey], [cheody], or [keody]. However, at that stage, the scribe did not modify such words into words containing [ed], such as [chedy], [shedy], or [okedy]. At some point, though, the scribe began incorporating [ed] into words, making it possible to transform [cheody] into [chedy], or [shey] into [shedy]. This has two effects. First, it impacts the text, as more and more word types containing [ed] begin to appear. Second, it also influences the scribe, as such words become increasingly familiar to him.

(29-08-2024, 12:30 AM)pfeaster Wrote: You are not allowed to view links. Register or Login to view."Now, reordering the sections with respect to the frequency of token <chedy> replaces the seemingly irregular mixture of two separate languages by the gradual evolution of a single system from 'state A' to 'state B'. Since words typical for Currier A also exist in Currier B, but not the other way round, it is reasonable to assume that the order shown in Table 2 indeed represents the original sequence in which the sections of the VMS had been created."

If path (1) were to be followed consistently, the proportion of previously-modified source words containing [ed] would steadily increase, and path (2) would then also be more likely to occur because the probability of selecting a source word that already contains [ed] would go up, leading to a kind of "snowballing" effect. This is the (subjective, human-powered) mechanism I assumed you were describing here, but maybe I read more into your explanation than you intended. Did you have a different mechanism in mind for the gradual evolution from Currier A to Currier B?

If you would remove the word "steadily" I would agree with your description. In my eyes it is more an up and down since the text is changing from page to page. See for instance the pages in Currier B. The folios f107r/f107v did contain less [ed] words than the folios f108r/f108v (You are not allowed to view links. Register or Login to view.). This even happens for folios of the same bifolio (see the number of instance for [ed] on f111r vs f111v).

One open question is why the "snowballing" effect occurred with words containing [ed], but not with words containing [eod], despite the existence of words like [cheody] and [keody] in Currier A. I have some ideas, but I don’t have a definitive answer to this question. One idea might be that for some reason in the eyes of the scribe [o] was not interchangeable with [a] before [d] and therefore in some way "dispensable" (You are not allowed to view links. Register or Login to view. only list 47 words with [ad]).
Note: After the Curve-Line-Hypothesis a line glyph like [a] cannot precede a curve glyph like [d]. Possible glyphs after [a] are line glyphs like [i], [r] and also [l] (see You are not allowed to view links. Register or Login to view. or "Die Harmonie der Glyphenfolgen" by You are not allowed to view links. Register or Login to view.).

Pages: 1 2