The Voynich Ninja

Pages: 1 2 3 4 5 6 7 8 9

(19-12-2021, 07:00 PM)MarcoP Wrote: You are not allowed to view links. Register or Login to view.Treating sentences as lines is certainly interesting, but I doubt that VMS lines are sentences (but who knows?).

Is there any way to test this? What data would you expect if it was sentences?

(20-12-2021, 01:45 AM)MichelleL11 Wrote: You are not allowed to view links. Register or Login to view.
(19-12-2021, 07:00 PM)MarcoP Wrote: You are not allowed to view links. Register or Login to view.Treating sentences as lines is certainly interesting, but I doubt that VMS lines are sentences (but who knows?).

Is there any way to test this? What data would you expect if it was sentences?

Hi Michelle,
Obelus' experiments with King James Bible treat sentences as lines:

(19-12-2021, 10:15 AM)obelus Wrote: You are not allowed to view links. Register or Login to view.* King James Bible, Scofield Ed., Books 1-5: chapters parsed as "paragraphs," grammatical sentences as "lines."

I guess that Obelus' findings are consistent with what could be expected:

(19-12-2021, 10:15 AM)obelus Wrote: You are not allowed to view links. Register or Login to view.One pattern is apparent even to a non-linguist: complementary distributions of objective vs subjective pronouns. For example, "thou" and "thee:"

.....

Distribution of the objective case is reversed. "He" and "him" show the same pattern:

....

Similarly, "they" and "them:"

....

Whether such patterns are observable in English-language texts generally, or other subject-verb-object languages, is a question for the technical literature I suppose.

I believe that these patterns could be observed in any other English text as well as in other languages with a more-or-less rigid positional syntax.
These patterns are thought-provoking parallels for some of the VMS phenomena, but there are significant differences e.g.:

VMS line-start and line-end words often are morphologically different from other words, with distinct prefixes (at line start) and suffixes (at line end).
Graphically, VMS lines do not appear to be sentence like, in particular, they are reasonably aligned on the right margin in comparison to (say) verses in a poem.
The VMS shows significant paragraph structure, apparent for instance in Obelus' 'shol' and '*ta*' plots: lines at the top and bottom of paragraphs behave differently. This does not seem to be the case with King James Bible (at least, in the cases shown but Obelus; I guess that processing text differently or choosing different words could show paragraph patterns).

I think Obelus' little graphs to the right are a nice addition to the visualization.

Considering sentence structure is interesting, and perhaps the most obvious solution to LAAFU, after all sentences are structural units as well. But Marco mentions the most important objections to this possibility.

I think especially the fact that the VM text fills up the available space is interesting in this regard. This is another of the VM's paradoxes: the text fills up the space like water, but its lines have rigid properties. It's like a liquid and a solid at the same time Smile

I guess this impression might be an effect of generalization, and closer inspection of relevant tokens vs margin behavior could reveal something else though, but as far as I'm aware the VM only leaves a right-side indent at the end of a paragraph.

Marco and Koen, one potential solution I see to this paradox is to identify ngrams that would make good candidates for metadata, and ignoring these whilst looking for ngram patterns across lines and down paragraphs and pages. The finer points of the experimental methodology I'd want to spend some time crafting include:

A better-argued justification, with reference to mainstream history and information science, for why the VMs text containing embedded metadata is likely enough to be worth taking seriously as a hypothesis.
A point-system rubric for vetting candidate ngrams on their likelihood of functioning as metadata, if there are any that do in the first place.
Specifying and justifying a cutoff score, and marking those vords that exceed it for non-consideration in statistical analyses. This could be done in several different ways, each with its own altered copy of a machine-readable VMs transcription. For example, one copy could replace each glyph of suspected metadata with EVA=w, while another could delete them as if they were not there at all. One version could replace or delete candidate metadata string wherever they occur in the text, while another could leave ngrams unchanged if they don't appear to occur at any obvious transition point.
Feed these stripped VMs transcriptions to Patrick and Obelus's bots, and compare the outputs to those of the virgin VMs transcription.

If this went well, I would then want to compare the plots of ngram occurrence by horizontal and vertical position in the VMs, with those of English texts that clearly do not contain embedded metadata. I'd be especially curious to see a comparison to an English text that was formatted to insert a line break after every sentence, continue no sentences over more than one line, and continue no paragraphs over more than one page.

I'm not sure how many members of this forum are comfortable working with Python scripts, but I've just uploaded a version of the code I've been using in case anyone would like to use or adapt it for further experiments (or just look it over, for that matter).

You are not allowed to view links. Register or Login to view.

There are two scripts in the zip folder, one that I used for pre-processing the ZL transcription, and then another script that generates images based on it. I think the only dependency that wouldn't come by default with something like Anaconda is OpenCV.

I'm afraid I haven't gone to the trouble of creating any kind of GUI or command-line argument parser. Instead, all the various parameters are laid out in lines 4-56 and need to be set there by editing the script itself. I'll admit that's not ideal, and it can be hard to keep track of all the variables (in fact, I just discovered to my chagrin that I'd inadvertently left a "switch" on limiting analysis to recto pages when I generated the images to go with my forum posts of December 8th and 14th -- oops). I hope my explanations for what the individual "switches" do will be clear, but here are configurations for a few common scenarios in case the overarching logic isn't:

To track distributions of discrete vords:
ignore_spacing=0; by_vord=1; vord_position=0

To track distributions of glyphs, bigrams, etc., irrespective of spacing:
ignore_spacing=1; by_vord=0

To track distributions of exact strings including spaces, such as [or.d]:
ignore_spacing=0; by_vord=0

To track distributions of prefixes (disregarding cases in isolation as self-standing vords):
ignore_spacing=0; by_vord=1; vord_position=1; exclude_total_match=1

I've tried to comment the script enough for someone else to follow what it does and make changes if wanted. I'm sure it's less efficient and streamlined than it could be, and I'm a bit self-conscious about releasing it as a result -- but at least it seems to work, which is probably the important thing.

I agree that separate rightwardness and downwardness graphs as shown on the right in Obelus's displays would be a nice addition.

I'm doubtful about lines corresponding to grammatical sentences for the reasons Marco listed, as well as because of differences in line length that seem attributable to foldout page width (longer) or illustrations (shorter). Why should the sentences on pages where a plant illustration happens to fill up the whole right-hand side of the page be consistently shorter than usual?

But even so, looking for sentence-like patterns in lines might still be just as productive as looking for word-like patterns in vords. I don't think anyone would say the latter approach hasn't been illuminating regardless of whether vords actually correspond to words or not.

As an alternative scenario, it's not hard to imagine ways of encoding lines that would show forms of left-to-right variation scalable to any length and independent of content. Take the following text copied in a clockwise spiral with spaces preserved after the letters that precede them, an asterisk inserted at each turn, and empty cells ignored:

[attachment=6127]

THIS I*HN A*EG*MIS *THE *ES*SE*D*D

Regardless of how long a line is, the average profile of its "words" will change steadily from left to right, with [*] becoming increasingly prevalent and all other glyphs becoming less prevalent.

THIS IS A*HEUER *ENO *R T SIN*OTHER *MBHE*GNOL *A SD*DEN *E T*AR*A*G

The first line of a paragraph might need to include additional signposts to initiate the path, if there are multiple possibilities, leading to the presence there of glyphs that are rare (though not forbidden) anywhere else.

→THIS IS ↓RE RH←PARGA↑OST→HE FI↓NA←PA F↑ T →LI

I don't at all mean to suggest this is actually how Voynichese works, and I'm offering it only as one way paragraph and line patterns could conceivably arise as a byproduct of a simple kind of cipher, rather than as a consequence of grammatical or narrative structure. Which is to say, I don't think anything's off the table.

Hi Patrick,
thank you for sharing your scripts!
I just ran a quick check and they work for me. I easily reproduced the balneo/Q13 'daiin' plot you posted You are not allowed to view links. Register or Login to view.. A parameter parser could probably help when running multiple experiments, but I understand that coding it in an optimal way may take some time.

Following from Patrick's talk on this topic at the conference I wanted to revisit our discussion here. The research is very interesting and demands that we either identify a cause or adjust our approach to the text. There must be a systematic and pervasive reason for directionality and any theory needs to address it.

When Patrick first posted his findings I thought that line length must be a main cause (we'll leave downwardness to the side for now). Because the rightwardness of a word is relative the positional value of a word in the same ordinal position differs between lines of different lengths. So the third word in a line of nine words has a positional value of 0.333 and the third word in a line of ten words has a value of 0.300.

Thus we could potentially explain rightwardness in this way: words which appear more to the left have the "ability" to make lines longer or words which appear more to the right have the "ability" to make lines shorter. This would not only provide an explanation for the pervasiveness of the effect across multiple different glyph pairs, but also differing strengths of the effect: we're simply dealing with single parameter by which glyphs vary.

A good candidate for that parameter would be word length. Were words starting with/containing a certain glyph to be shorter or longer on average then they would influence how much space is left for more words in that same line. Short words would appear more leftward because they leave more space for further words and appear more frequently in longer lines. All of this should be testable, and I think we may have at least some of the answers already to prove/disprove this hypothesis.

However, I note that [qo] is found to be more leftward than [o], yet words beginning [qo] are definitely longer on average than words beginning [o]. So there must be something else happening. Maybe there is interplay between both the addition of a new word reducing the rightwardness of other words in that line and the likelihood that the new word will begin with a certain glyph. So (ignoring for the moment that last words are excluded) a word beginning [o] in the third position of a nine word line would have its rightwardness reduced by the presence of a tenth word, but if that word began with [o] then the overall rightwardness of all words beginning [o] would be 3/10 + 10/10 = 0.65.

Of course, these explanations aren't likely for individual words pairs, such as [chckhy] and [shckhy], which a) are the same length and b) are unlikely to commonly appear in the same line. So we would be back to explaining the phenomenon by reference to their relationship to other words or selectional procedure which causes them to occur earlier in the line, rather than their ability to control the line length. The question is whether there is such a relationship other than the linestart, and whether those words could influence line length (if they even need to).

I suppose I'm saying that I want to continue discussing this topic, but am not much further forward understanding potential causes. Thoughts?

In order to investigate the issues pointed out by Emma, I quickly edited one of the scripts to remove all white space and work on character position in the line, rather than word position. Positions are still mapped into 10 discrete slots.

The plot for '[NOTq]o' vs 'qo' (where I inserted an L at the beginning of each line, in order to match line initial o)
[attachment=7022]

The plot for 'al' vs 'ol'
[attachment=7023]

Of course, these experiments are different, since strings like 'ol' or 'ko' are counted even when they appear "inside" words and not as prefixes or suffixes. But if one compares with You are not allowed to view links. Register or Login to view. (You are not allowed to view links. Register or Login to view. and You are not allowed to view links. Register or Login to view.), it is possible to see that the ratios are qualitatively very similar. So (unless I messed up something) it seems that the phenomenon does not strictly depend on words and spaces but on actual horizontal positions in the line.

(03-12-2022, 02:43 PM)MarcoP Wrote: You are not allowed to view links. Register or Login to view.In order to investigate the issues pointed out by Emma, I quickly edited one of the scripts to remove all white space and work on character position in the line, rather than word position. Positions are still mapped into 10 discrete slots.

The contrastive distribution patterns among word sets (such as those containing [k] versus those containing [t]) often seem to recapitulate patterns among the characters themselves, so from that standpoint this seems like a defensible strategy. There may also be differences among sets of words where the characters are initial, medial, or final, but multiple factors could be at play there. One possibility that could be worth investigating is whether abstract word patterns (e.g., body rank structures) display positional preferences of their own.

I had to cut a few points when turning my written paper submission into a presentation, a couple of which had to do with possible implications for "hot topics" in Voynichology, including your (Emma's and Marco's) discovery of statistical anomalies among word-break combinations. What I'm wondering is whether differences between predicted and actual token counts for word-break combinations might be due to positional distribution patterns -- e.g., words that end [y] and words that begin [q] tend to occupy similar positions within lines and paragraphs, so the word-break combination [y.q] is overrepresented, whereas the positions of words that end [n] and words that begin [q] don't share much overlap, so the word-break combination [n.q] is underrepresented.

Since finishing the paper, I've tweaked my position-visualization script a bit to work with word-break combinations. Basically, I subtract the first word from lines when graphing word-initial elements and the last word from lines when graphing word-final elements, so that each pixel (before resizing) now represents a word break rather than a word. Then I can create two graphs for review:

(1) A color graph with words ending with a particular glyph (the left-hand element in the word-break combination) represented in blue and words starting with another glyph (the right-hand element in the word-break combination) represented in red, each normalized separately so that the area of highest frequency is at 100% brightness. In RGB color space, blue plus red equals pink, so pink parts of the image should reveal where the distributions of the two word sets overlap.

(2) A grayscale graph of actual tokens of the word-break combination.

Here are graphs for the word-break combination [n.q], which is underrepresented:

[attachment=7027]

There's not much pink in the graph on the left, but the graph on the right seems to be brighter where the graph on the left is pinker (or purpler).

Here are graphs for the word-break combination [y.q], which is overrepresented:

[attachment=7026]

The graph on the left shows a lot more pink (indicating more overlap), and the brightness in the graph on the right is more evenly spread out throughout the paragraph.

So I think we might have been able to predict that [n.q] would be underrepresented, and that [y.q] would be overrepresented, based just on the degree of overlap in distribution of words ending [n] and [y] and words beginning [q].

I haven't done enough testing to say yet whether this case is typical, but it hints at a possible connection between these two phenomena (uneven positional distributions and statistical anomalies among word-break combinations).

Hi Marco, thanks for that. If we assume that lines tend to contains the same number of glyphs (which is reasonable over long stretches) then it's clear then word-line length isn't strictly important. However, I suppose that we could still have some effect from two sources:

That some glyphs are so tightly-bound to a certain position in words, such as /q/ or /n/, that word-line length would still be important. New instances could only be added to the right hand side of the line with a new word, so the glyph-line length would be less important.
Other relationships - such as exclusion or inclusion of similar words or proxy line length control - might still play a part.

I'm a little vague on my thinking, I know, so will do some experiments myself and see if I can't come up with some actual examples of what I think might be happening.

Patrick, I love that we can bring different phenomenon together and consider how they combine. Do we know the directional cause of overlap vs under/overrepresentation? Can we exclude the influence of the final glyph of a word from impacting the distribution of the initial glyph?

Pages: 1 2 3 4 5 6 7 8 9

MichelleL11

MarcoP

Koen G

RenegadeHealer

pfeaster

MarcoP

Emma May Smith

MarcoP

pfeaster

Emma May Smith