The Voynich Ninja - Rightward and Downward in the Voynich Manuscript

Pages: 1 2 3 4 5 6 7 8 9

(24-08-2021, 08:26 PM)Emma May Smith Wrote: You are not allowed to view links. Register or Login to view.Interestingly, these aren't consistent with your results in every case, but are the differences enough to have an impact? Could something like this be the cause?

Whether it's the cause or not, I agree that investigating possible effects of different line lengths on rightwardness scores would be very worthwhile. It would seem odd somehow if line length had absolutely no impact on them at all. I do find that limiting analysis to lines of a single length (in words) will occasionally "flip" a pair of averages, and although I just chalked this up to random noise, it would be extremely interesting if there were a pattern to it.

(21-08-2021, 05:53 PM)MarcoP Wrote: You are not allowed to view links. Register or Login to view.Three days ago, Patrick Feaster published You are not allowed to view links. Register or Login to view.. Like the previous one (discussed You are not allowed to view links. Register or Login to view.), it is full of observations, data and graphics (that's why reading it took me a few days). Patrick's work is innovative and thought-provoking: not only he computes interesting quantitative measures, but he presents them visually, making his results much more accessible.

For instance, these two graphs are about line-position for words starting with o- and qo-

Words in each line are assigned to one of 10 positions (longer lines will have more than one word for some positions, while shorter lines will not contribute to some positions). In the first plot, the Y axis shows the % of words starting with o- and qo-.

The plot on the right has the same 10 positions on the X axis and shows the o/qo ratio on the Y. It makes visually clear that, in the first half of the line (with the exception of position 2), the two classes of words have similar frequencies; in the second half of the line, the frequency of o- rises, while that of qo- drops, so that the ratio increases steeply.

This research really deserves to be read carefully. But if you don't feel like reading the whole post, examining the graphs and reading the conclusions will give you an idea of Patrick's results.

Hi Patrick, Marco,

I tried replicating the above graphs. I kept paragraph lines only (+P, @P, =P, *P).

All frequencies relative to the total count of o* + qo* words.
In ZL_ivtff_1r:
o* words count = 6917
qo* words count = 5138

[Image: KY0Gy.png]

Not nearly the same as yours... Undecided

EDIT: ZL_ivtff_1r graph modified (not sure what happened)

(27-08-2021, 11:33 AM)nablator Wrote: You are not allowed to view links. Register or Login to view.I tried replicating the above graphs. I kept paragraphs lines only (+P, @P, =P, *P) and removed half spaces (",").
Not the same...

Your curves do look different from the ones Marco and I came up with, although they still show variation in the ratio over the course of a line. I wonder whether you might be dividing the line into segments differently than we both did. (Of course, there's no one "right" way to do this.) My first impression is that your first group might overlap our groups 1 and 2, and that our last group might overlap your groups 9 and 10 -- or something like that. Maybe this has to do with rounding up or down when defining group boundaries?

Hi Nablator,
my impression is that there are at least two differences with Patrick's experiments:

1.As Patrick said, it could be that there are differences in how the mapping to 10 positions is done; I used this method (where i is word position in the range 0..lineLength-1):
pos=1+int(round(9.0*i/float(lineLength-1)))
I am not sure this is exactly what Patrick did, but it appears to be close enough.

2. qo-words are about 16% of all words: accordingly, the qo-line in Patrick's plot averages about 0.16; in your graph, it's about 0.04 (much lower); maybe you are normalizing in a different way? You don't seem to be dividing position counts by the total of words in each position.

(27-08-2021, 01:00 PM)MarcoP Wrote: You are not allowed to view links. Register or Login to view.1.As Patrick said, it could be that there are differences in how the mapping to 10 positions is done; I used this method (where i is word position in the range 0..lineLength-1):

pos=1+int(round(9.0*i/float(lineLength-1)))

I am not sure this is exactly what Patrick did, but it appears to be close enough.

Yes, thanks, rounding could be the issue, I did it differently: int(10*(i-1)/NF) to get the index in [0-9].

@awk -F '.' "BEGIN {} {for (i=1; i<=NF; i++) {if (substr($i, 0, 1) == \"o\") {cc++; cpo[int(10*(i-1)/NF)]++} if (length($i) > 1 && substr($i, 0, 2) == \"qo\") {cc++; cpqo[int(10*(i-1)/NF)]++}}} END {for (i=0; i<10; i++) print cpo[i]/cc\"\t\"cpqo[i]/cc}" zl_paragraphs_no_formatting.txt

As you can see I'm dividing the counts by cc, the total count of qo + o words. To check, the total line sums correctly to 1.

(27-08-2021, 01:00 PM)MarcoP Wrote: You are not allowed to view links. Register or Login to view.1.As Patrick said, it could be that there are differences in how the mapping to 10 positions is done; I used this method (where i is word position in the range 0..lineLength-1):
pos=1+int(round(9.0*i/float(lineLength-1)))

Assuming it's Python, this method disadvantages the first and last position. EDIT: It's more complicated than that, depending on lineLength, realistic values (much lower than 100) seem less problematic, but I don't know if the sizes of intervals assigned to each position average out evenly.

Try:
lineLength = 100
for i in range(0, lineLength):
pos=1+int(round(9.0*i/float(lineLength-1)))
print(i, pos)

Mine is, I believe, equivalent to:
lineLength = 100
for i in range(0, lineLength):
pos=1+int(10.0*i/float(lineLength))
print(i, pos)

In this case (lineLength = 100) it gives an equal interval (of size 10) to each position. 10x10 = 100, no unfair advantage to any position.

As I don't have Python installed, I used this online interpreter: You are not allowed to view links. Register or Login to view.

(27-08-2021, 02:44 PM)nablator Wrote: You are not allowed to view links. Register or Login to view.[quote="MarcoP" pid='47087' dateline='1630065649']

Assuming it's Python, this method disadvantages the first and last position.... Mine...gives an equal space (of size 10) to each position. 10x10 = 100, no unfair advantage to any position.

For those who aren't set up to run the experiment easily, the first method splits a hypothetical hundred-word line into groups of 6, 11, 11, 11, 11, 11, 11, 11, 11, and 6 words. But here's how the two methods play out with more Voynich-scale lines:

5 words = 1, 3, 5, 8, 10 versus 1, 3, 5, 7, 9
6 words = 1, 3, 5, 6, 8, 10 versus 1, 2, 4, 6, 7, 9
7 words = 1, 3, 4, 5, 7, 9, 10 versus 1, 2, 3, 5, 6, 8, 9
8 words = 1, 2, 4, 5, 6, 7, 9, 10 versus 1, 2, 3, 4, 6, 7, 8, 9
9 words = 1, 2, 3, 4, 5, 7, 8, 9, 10 versus 1, 2, 3, 4, 5, 6, 7, 8, 9
10 words = 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 versus 1, 2, 3, 4, 5, 6, 7, 8, 9, 10
11 words = 1, 2, 3, 4, 5, 5, 6, 7, 8, 9, 10 versus 1, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10
12 words = 1, 2, 3, 3, 4, 5, 6, 7, 8, 8, 9, 10 versus 1, 1, 2, 3, 4, 5, 6, 6, 7, 8, 9, 10

So at this scale, the first method doesn't seem quite as egregiously unequal as it does for a hundred-word line. But this all goes to underscore how subtle differences in methods of calculation can have big effects.

Hi Nablator,
indeed the two methods are different. As one can also see from Patrick's numbers, your method apparently results in fewer words in positions 5 and 10.
These are the total word counts I get for each position:

1 4809
2 3240
3 3503
4 3093
5 2659
6 3802
7 3456
8 3140
9 3603
10 1278

Of course, which method is best is debatable. If you normalize by the total number of words in each position (the totals listed above), instead of the sum of o+qo, you get something closer to Patrick's plot. Also, the numbers on the Y axis actually correspond to pattern frequency. The anomalies at 5 and 10 due to how positions are computed disappear.

Top is my attempt at writing in Python what you did with awk.
Middle is the same, but with the normalization I described above.
Bottom: my re-computation of Patrick's plot.

You are not allowed to view links. Register or Login to view.

The plots are much closer with the normalization, but there still are differences, most notably at position 2. Unless I messed up something, these are likely due to how positions are computed. The normalized version of your method appear to produce the smoother results (these was already apparent from your ratio plots, which are not affected by the change in normalization).

I've been experimenting with different strategies for displaying rightward/downward tendencies and thought I'd share a few examples of one particular kind of display I think seems promising. My examples will involve statistics for individual glyphs.

I start by dividing paragraphs into five sectors: 1 = first line, 5 = last line, 2-4 = whatever remains in the middle, divided into thirds. Next, I divide the lines into fifths (for a first analysis) and sevenths (for a second analysis), in each case splitting off the first and last adjacencies as well. This maps positions onto a 2D grid with unequally sized "cells," 5x7 in one case, 5x9 in the other.

I then track pairs of glyphs adjacency by adjacency through each line, counting the first glyphs and second glyphs assigned to each line segment. That may seem like a clunky approach, and it comes partly from re-using a script I designed with other things in mind; but still, it gives me totals for all glyph tokens in first position, second position, last position, and second-to-last position, as well as a pair of totals for each group in the middle with their "windows" offset by just one position. The second and second-to-last glyphs are also factored into the outermost groups at either extremity. Finally, I divide each result by the total count in each of the cells in the grid to normalize the figures for group size.

On the left of my display, I show the results for the first two positions; on the right, I show the results for the last two positions; and in the middle I show the results for intermediate positions, with the fifths and sevenths "interleaved," and with the pairs of measurements offset by a slight (arbitrary) distance along the x axis. Each of the four "interleaved" series of plot points is connected by a separate line. On the top, I show the separate results by paragraph sector color-coded in a way that's meant to suggest a blue-to-red spectrum: (1) dark blue, (2) light blue, (3) gray, (4) yellow-orange, (5) red. On the bottom, in green, I show the aggregate values for all paragraph sectors lumped together.

Note: I count benched gallows and sequences of multiple [i] or [e] as single glyphs.

Here's a sample display showing data for [Sh] in Quires 1-3.

[Image: voynich-q1-3-Sh.jpg]

The bottom shows the composite curve for all paragraph positions sloping gently downward from around 4.5% to 2% of total glyphs by line position. Meanwhile, the top shows further variation across paragraphs. The points in dark blue indicate that [Sh] is disproportionately well-represented in the first lines of paragraphs, while the points in red show that it's disproportionately underrepresented in the last lines of paragraphs. And then there's a snarl in the middle, although light blue (first third) seems to fall mostly below yellow-orange (last third). Similar patterns turn up for [Sh] in other parts of the manuscript, especially that first-line-of-paragraph boost, which seems to be a constant.

Here's another sample, this one for [ii] in the whole of Currier B, as distinct from [i] and [iii].

[Image: voynich-b-ii.jpg]

There's not much to the aggregate (green) curve at the bottom, except for a shallow middle-of-the-line dip. But the display at the top shows the different paragraph positions sorting rather nicely into order, not perfectly, but still perceptibly (I think): from the bottom, dark blue, light blue, gray, yellow-orange, red. If wishful thinking isn't causing me to see things here, this would point to [ii] becoming more common the further downward we go in a paragraph. For what it's worth, [i] doesn't appear to share this pattern.

Some may argue that Currier B is too heterogeneous to be meaningfully analyzed, so here's an equivalent graph limited to Quire 20:

[Image: voynich-q20-ii.jpg]

A bit more mixed up, but blue still looks like it's sorting mainly to the bottom, and red and yellow to the top, and gray into the middle.

Other glyphs seem to have distinctive curve shapes more than simple rightward or leftward preferences that could be summed up in a single numerical score. A case in point is [q]; here's a display for it across all of Currier B:

[Image: voynich-b-q.jpg]

That odd double hump seems to be typical of [q] across different sections, and it looks a bit to me like the curves Marco plotted for words beginning [qo]. Once again, there's also notable variation by position within paragraphs -- particularly with the last line (red), although the fact that last lines are often atypically short might factor into the deviation in shape.

I can say more about other patterns these displays seem to be turning up on a glyph-by-glyph basis if there's interest, but I'm still not sure I've hit upon the best approach to visualizing what's going on. Does anyone have other ideas about what an optimal display might look like? What can't we get a good look at here yet?

Suggestion accepted!

The "rightwardness" and "downwardness" measures are dimensionless coordinates that allow us to collate position information from multiple paragraphs... in search of layout-level patterns, I suppose. Perhaps such data could be represented directly as two-dimensional heat maps or histograms. In computing the example graphics below, I have ignored interesting tweaks and ramifications discussed in pfeaster's posts. They can easily be incorporated if it seems worthwhile.

Given some string-pattern of interest, let us compute the rightwardness and downwardness values for every match found in a text sample. The result is a list of {rightwardness, downwardness} ordered pairs that are coordinates in the dimensionless paragraph-space. We can immediately bin these string-pattern matches and plot them as a two-dimensional histogram. For example, here are counts of words beginning with "qo" in all of the 'paragraph'-style text, represented as grey level. Sisu the cat chose 10 word bins and 5 line bins by walking across my keyboard:

[attachment=5776]

Following practice in this thread, the counts can be scaled by the total number of words in each bin to yield fractional populations:

[attachment=5777]

Tabulating only rightwardness amounts to using only one downwardness bin:

[attachment=5783]

It is characteristic of such data displays that patterns are visible, but quantitative values are difficult to extract. Here the fractional populations from the previous one-dimensional histogram are shown on Cartesian axes (orange), together with analogous results for words beginning with "o" (blue):

[attachment=5784]

Thus the plots posted by pfeaster and MarcoP are satisfactorily reproduced, so there is reason to believe that we are on the same page computationally; remaining differences may be due to the ivtt extract used.

Finally, we can plot a two-dimensional histogram of the ("o"-initial)÷("qo"-initial) ratio:

[attachment=5785]

The known rightward and downward trend is visible.

No specific question is addressed by these particular graphics, but the general form may be a natural way of representing paragraph-level structure. Binning artifacts discussed earlier in the thread are naturally still present.

Pages: 1 2 3 4 5 6 7 8 9