12-08-2024, 08:03 AM
Many thanks to the presenters last Sunday. Not only did they deliver new results, but subsequent discussion on the forum has been stimulating. For example, Torsten crisply summarized a prediction that follows from the self-citation hypothesis:
OK, let us attempt to decide on quantitative grounds.
The minimal model is a scribe working page by page from top down. Self-citation would begin with the top line of text, and introduce variants as new lines are generated out of words visible above. Therefore the text is predicted to deviate further and further from the first line as we advance down the page. Space-delimited "words" are the units of composition in this picture. So here is a first-pancake metric of wordwise similarity between lines:
Take each word of a non-page-initial line, and compute its minimum possible edit distance to a word in the initial line. (It is to be hoped that this minimal-edit selection correlates somehow, retrospectively, with the scribal method.) Calculate the Mean Minimum Edit Distance for the words in that line. In some rough sense the MMED score captures how directly the collective of words can be derived from the initial line. As we proceed down the page, more and more mutated versions of the first-line words are visible to the scribe, so the compounded mutations are expected to increase the MMED score as a function of line rank.
Happily Torsten has posted a a You are not allowed to view links. Register or Login to view. that implements self citation. By chopping the generated text into 75 pseudo-pages of 16 lines each, we can approximate the bulk layout of a vms sample (below). The statistical traces of pagewise self-citation, if present, should manifest on each page independently. Therefore we stack all of the individual-page results together with a co-average representing each line. In the plot below, for example, the point at rank 2 represents an average of all 75 second lines' MMED scores relative to their own first lines, etc etc:
generated_text.txt
[attachment=9014]
This plot is not saying anything new or interesting about the text generator; it serves merely as validation of the MMED measure, showing that it can pick up a macro property that emerges from the word-by-word generation algorithm. The farther we progress down each page, the more the words deviate from their first-line exemplars. Open markers show the same calculation performed with words randomly shuffled among the available positions, in which case no trend is expected or observed. MMED values appear to saturate as more than 15 lines are included.
Finally, what the crowd paid to see: We repeat the analysis with paragraph text from Takahashi IT2a-n.txt, using 84 pages that contain at least 16 lines.
IT2a-n.txt
[attachment=9013]
Oh well... I have not yet decided for myself whether the patterns align. The greater noise present in the real text might just obscure a trend of the magnitude seen in the synthetic text.
One way forward would be to refine the line-comparison function, in hopes of increasing its sensitivity, decreasing the noise, or accounting for reference lines other than the page-initial one.
Another is to observe the Perseids from a dark location; at mine the radiant is just now rising.
(09-08-2024, 03:36 PM)Torsten Wrote: You are not allowed to view links. Register or Login to view....the scribe might introduce new spelling variants. For example, he could decide to add [aiin] alongside [daiin]. This change would affect only the text generated after [aiin] was introduced, leading to observable developments in the manuscript. Decide for yourself whether the patterns observed in the Voynich text align with this description.
OK, let us attempt to decide on quantitative grounds.
The minimal model is a scribe working page by page from top down. Self-citation would begin with the top line of text, and introduce variants as new lines are generated out of words visible above. Therefore the text is predicted to deviate further and further from the first line as we advance down the page. Space-delimited "words" are the units of composition in this picture. So here is a first-pancake metric of wordwise similarity between lines:
Take each word of a non-page-initial line, and compute its minimum possible edit distance to a word in the initial line. (It is to be hoped that this minimal-edit selection correlates somehow, retrospectively, with the scribal method.) Calculate the Mean Minimum Edit Distance for the words in that line. In some rough sense the MMED score captures how directly the collective of words can be derived from the initial line. As we proceed down the page, more and more mutated versions of the first-line words are visible to the scribe, so the compounded mutations are expected to increase the MMED score as a function of line rank.
Happily Torsten has posted a a You are not allowed to view links. Register or Login to view. that implements self citation. By chopping the generated text into 75 pseudo-pages of 16 lines each, we can approximate the bulk layout of a vms sample (below). The statistical traces of pagewise self-citation, if present, should manifest on each page independently. Therefore we stack all of the individual-page results together with a co-average representing each line. In the plot below, for example, the point at rank 2 represents an average of all 75 second lines' MMED scores relative to their own first lines, etc etc:
generated_text.txt
[attachment=9014]
This plot is not saying anything new or interesting about the text generator; it serves merely as validation of the MMED measure, showing that it can pick up a macro property that emerges from the word-by-word generation algorithm. The farther we progress down each page, the more the words deviate from their first-line exemplars. Open markers show the same calculation performed with words randomly shuffled among the available positions, in which case no trend is expected or observed. MMED values appear to saturate as more than 15 lines are included.
Finally, what the crowd paid to see: We repeat the analysis with paragraph text from Takahashi IT2a-n.txt, using 84 pages that contain at least 16 lines.
IT2a-n.txt
[attachment=9013]
Oh well... I have not yet decided for myself whether the patterns align. The greater noise present in the real text might just obscure a trend of the magnitude seen in the synthetic text.
One way forward would be to refine the line-comparison function, in hopes of increasing its sensitivity, decreasing the noise, or accounting for reference lines other than the page-initial one.
Another is to observe the Perseids from a dark location; at mine the radiant is just now rising.
The initial line of You are not allowed to view links. Register or Login to view. is
<f1r.1> fachys ykal ar ataiin shol shory cthres y kor sholdy
The first word of the next line <f1r.2> is sory, at Levenshtein distance 1 from shory in the line above. Averaging the minimum edit distances for all 11 words in <f1r.2> yields a MMED of 1.81. Collecting all 84 MMED values for the pages considered, the value plotted at line rank 2 is 2.63.
<f1r.1> fachys ykal ar ataiin shol shory cthres y kor sholdy
The first word of the next line <f1r.2> is sory, at Levenshtein distance 1 from shory in the line above. Averaging the minimum edit distances for all 11 words in <f1r.2> yields a MMED of 1.81. Collecting all 84 MMED values for the pages considered, the value plotted at line rank 2 is 2.63.