The Voynich Ninja - [split] Merged words across lines

Pages: 1 2 3

I computed a few measures in order to compare Voynichese with known language texts with and without hyphenation. As always, I could have made errors, so be careful.

I used these texts:

the ten pages from a medieval Bonaventura manuscript I transcribed and shared You are not allowed to view links. Register or Login to view.; 283 lines, 65 of which end with a broken word which ends on the next line
the English printed text I mentioned in the post I previously linked (You are not allowed to view links. Register or Login to view.); punctuation, hyphen/minus included, was removed; 1340 lines, 218 hyphenations
the same English text re-arranged so that no words are broken: a fixed a maximum line length of 40 characters, if a word does not fit, it is moved to the next line; 1280 lines, no hyphenation
the whole VMS in Takahashi's transliteration; 5207 lines
VMS Currier A, using the Zandbergen-Landini transliteration (ZL_ivtff_1b.txt), ignoring uncertain spaces; 1939 lines
VMS Currier B, as above; 2892 lines
VMS Quire 20 "Stars"; Takahashi's transliteration; 1083 lines

Each line was split into three parts:

first word
"inside" words
last word

E.g.:
qualities / of his mind and most of all / in

For each line, I tried concatenating the last word with the first word of the following line. I then checked if the resulting word occurs in the text. The percentage of matches appears in the "across %" column. Differently from what Matthias did, I tested all lines, also when the last word is long. Also, I treated paragraphs breaks as all other line breaks.
The same processing was done for all the consecutive couples in the "inside" part of the line. See the "inside %" column.

[attachment=4578]

The two texts with hyphenation have many more matches across lines than inside. The text with no hyphenation (Lincoln_fixed) has uniform low numbers across lines and inside.
The VMS has many more matches inside lines than across. This is the opposite of what happens with hyphenation: I take this as evidence that words are not simply broken across lines. It could still be that words are broken across lines, but at least one of the two halves is transformed in some way (IMO this scenario is not particularly likely, but who knows?).

These numbers also point out to the fact that combining two Voynichese words is much more likely to result in a "legal" word than in English or Latin. This feature is mentioned in Timm and Schinner's paper and I think it might also be related with You are not allowed to view links. Register or Login to view., but I am sure that more can be said and researched about the subject.

Why do across-lines combinations result in fewer matches? As first observed by Currier, the first and last words of lines are different from the rest. For instance, word-initial EVA:y is much more frequent at line-start than elsewhere; the last word often ends with EVA:m/g: combining two such words is less likely to result in a legal word than taking two consecutive "ordinary" words from inside a line. For instance, dam+yshor results in the bizarre digraph "my"; yshor+dam would have been more plausible, but Voynich line-effects work in the opposite direction.

I also computed the % of hapax legomena on word tokens and average word length for the three positions: first, inside and last.
The two texts with hyphenation show a higher frequency of hapax legomena in the first and last position: this is due to words broken across lines, which rarely result in accidentally legal words. Hyphenation makes average word length rather constant in different positions. On the contrary, when fixed length is applied, the first word of the line tends to be longer. I guess this is because long and short words tend to alternate and short words are often squeezed at the end of the line, leaving a longer word for the beginning of the next line (I think it should be possible to check if this explanation works, or to find a better explanation, by closely examining fixed-line-length texts).

Hapax results for the VMS are higher in both the first and last position. This is similar to what happens with hyphenation, but this possibility is ruled out by Matthias' method: combining hypothetical word fragments across lines does not result in the Voynichese words. As discussed above, Voynichese hapax legomena at line boundaries are more likely morphological anomalies as originally observed by Currier.
Voynich words in the first position are consistently longer: this appears to be further evidence in favour of a fixed-length layout.
For words in the last position, things are more complex. It seems clear that they are shorter in Q20, but in Currier A they appear to be slightly longer than words inside lines.

(19-07-2020, 07:03 AM)MarcoP Wrote: You are not allowed to view links. Register or Login to view.It could still be that words are broken across lines, but at least one of the two halves is transformed in some way (IMO this scenario is not particularly likely, but who knows?).

Hi, Marco,

thanks for your work. - I think your idea that words are broken across lines but at least one of the two halves is transformed in some way, is not so far-fetched. The first and last words in a line may have a "special meaning". This would certainly clash with breaks across lines and would have to be adjusted somehow. In other words: Laafu and line breaks are a problem.

Marco said, "These numbers also point out to the fact that combining two Voynichese words is much more likely to result in a "legal" word than in English or Latin. This feature is mentioned in Timm and Schinner's paper and I think it might also be related with You are not allowed to view links. Register or Login to view., but I am sure that more can be said and researched about the subject."

This is the uneasiness I had with interpreting these results from the beginning. I tried to think of some type of "control" to eliminate this general effect from muddying the results. Marco's "inside" tests are one way and I agree the results argue that split across lines is not happening.

But I also agree with bi3mw that the fact that across line matches resulted in a measurably lower amount of legal words than the inside "random" matches lends support for the idea that line end words (or word portions) are modified from "inner" words.

But wouldn't mere physical placement at line ends be enough? What scenario for decoding/understanding would necessitate modifying the ends of lines when it is physically evident what the ends of lines are from the script?

(19-07-2020, 06:23 PM)MichelleL11 Wrote: You are not allowed to view links. Register or Login to view.
...What scenario for decoding/understanding would necessitate modifying the ends of lines when it is physically evident what the ends of lines are from the script?

Paragraph end characters (m, g) which were sometimes drawn like the "-tis/-tes" or "-cis/ces" abbreviation or a "-rum" symbol, were moderately common in medieval texts and were not only at the ends of paragraphs, but sometimes at the ends of sections within a paragraph. It was a way of expressing section ends, but sometimes also etcetera, osv., "and so on" rather than the more traditional etcetera symbol.

Here are some examples:

[attachment=4583]

(19-07-2020, 07:03 AM)MarcoP Wrote: You are not allowed to view links. Register or Login to view.The two texts with hyphenation have many more matches across lines than inside. The text with no hyphenation (Lincoln_fixed) has uniform low numbers across lines and inside.
The VMS has many more matches inside lines than across. This is the opposite of what happens with hyphenation: I take this as evidence that words are not simply broken across lines.

I agree with MarcoP.
Untill more evidence comes to light, i think we can put simple word-breaks-across-lines on the shelf.
Summary: Are vords broken across lines? -- Probably not.

RobGea Wrote:I agree with MarcoP.
Untill more evidence comes to light, i think we can put simple word-breaks-across-lines on the shelf.
Summary: Are vords broken across lines? -- Probably not.

We have to keep in mind that lines with line-breaks when the columns are wide (as they are in the VMS) are going to be less frequent than unbroken lines. There might only be two or three (or none) per paragraph. So, proportionately, we are less likely to find line-broken patterns that match in-line patterns.

And there are two other issues... padding, and marker symbols like the diagrams I posted just above.

Maybe we could say, for now: "Assuming there is no filler text at the ends, and assuming there are no line-end-marker symbols at the ends, VMS vords are probably not broken across lines."

(20-07-2020, 11:13 PM)-JKP- Wrote: You are not allowed to view links. Register or Login to view.Maybe we could say, for now: "Assuming there is no filler text at the ends, and assuming there are no line-end-marker symbols at the ends, VMS vords are probably not broken across lines."

It's a little like saying "Assuming that the earth has not been invaded by CO2-emitting invisible aliens, global warming is caused by humans".

Of course I am exaggerating, I know that the evidence I provided is not conclusive and I may have made errors (while evidence for human-generated global warming is totally solid). I also know that the assumptions you picked are not as far fetched as my invisible aliens. But why only picking those two? One can list numberless ideas that could potentially obfuscate hyphenation. Assuming that Voynichese does not have a rule that only allows hyphenation for hapax-legomena, assuming that the text is not an Alberti cipher, assuming that lines are not to be read bottom-to-top and left-to-right, assuming that the first half of each line reads in the same direction as the second half.... and so on ad infinitum

Yet, in my opinion, speculative ideas do not have the same weight as evidence. RobGea was right in writing "until more evidence comes to light". I know that this point of view is not shared by everybody, e.g. people have written that collecting evidence is quite an effort, so they prefer not to take part in it, yet they are happy to speculate. I respect this point of view, though I believe it makes discussions less constructive.

About the specific ideas you put forward:

Filler text at the end of lines would result in line-end words being longer than the others, and this is not the case: according to Elmar Vogt, they are actually shorter. Also, the idea of statistically relevant filler text is somehow anachronistic, since it appears to have been introduced by Trithemius half a century after the VMS was created. In his application of the method, cipher-text has a TTR (type-token-ratio) close to 1, while (as amply discussed by Koen) the VMS has figures similar to plain-text.

I guess that by "line-end-marker" symbols you mean specific abbreviations, as in the tiny fragments with "etc" you posted. I must say that calling "etc" a "marker" seems misleading to me. As you probably know, I like the idea that EVA:m/g are abbreviations, but I am not aware of this hypothesis resulting in evidence for hyphenation.
Anyway, you could explore this possibility, whatever you mean, and show that this results in defensible figures for hyphenation.

Are vords broken across lines? -- There is no evidence suggesting that they are, while there is some evidence suggesting that they are not.

(21-07-2020, 05:42 AM)MarcoP Wrote: You are not allowed to view links. Register or Login to view.
...But why only picking those two? ..

Because I can see reasons for those being more likely than some of the other possibilities, and proposing too many possibilities at once doesn't appear to be as productive in terms of discussion.

Quote:About the specific ideas you put forward:

Filler text at the end of lines would result in line-end words being longer than the others, and this is not the case: according to Elmar Vogt, they are actually shorter. Also, the idea of statistically relevant filler text is somehow anachronistic, since it appears to have been introduced by Trithemius half a century after the VMS was created. In his application of the method, the cipher text has a TTR close to 1, while (as amply discussed by Koen) the VMS has figures similar to plain-text.

Why would line-end filler be longer? It seems to me if you are filling whatever space is left to make the right-hand column line up (to double-justify the text), then the filler text might be quite short and would not necessarily affect every line. Filler can have spaces.

There is evidence in the VMS that some of it might be filler (put "might" in capitals because I'm not at all sure—rather than filler, this might be an artifact of vertically-written blocks of text). These two folios in particular suggest the possibility: You are not allowed to view links. Register or Login to view. f81v They appear to be related in terms of general layout, and they are on the same sheet, and yet one is ragged (left-justified) and the other is double-justified.

Quote:I guess that by "line-end-marker" symbols you mean specific abbreviations, as in the tiny fragments with "etc" you posted. I must say that calling "etc" a "marker" seems misleading to me. As you probably know, I like the idea that EVA:m/g are abbreviations, but I am not aware of this hypothesis resulting in evidence for hyphenation.

I'm hesitant to call them specific abbreviations because symbols in the VMS that resemble medieval abbreviations (in form or position) might have a different function in the VMS, that's why I substituted the word "marker" (it might not be the best word but at least it does not carry as many assumptions).

The VMS frequently uses a glyph that resembles a paragraph-end marker (g, m) in many of the same positions as one would find it in medieval plaintext. Such a glyph might obfuscate line-break statistics if it is treated as a letter rather than as... a marker something else.

(21-07-2020, 06:31 AM)-JKP- Wrote: You are not allowed to view links. Register or Login to view.Why would line-end filler be longer? It seems to me if you are filling whatever space is left to make the right-hand column line up (to double-justify the text), then the filler text might be quite short and would not necessarily affect every line.

I agree, it could just be the very short words ( up to three letters ) that were appended to the end of a line. Anyway, I have this impression with some lines.

edit : I mean something like that ( You are not allowed to view links. Register or Login to view. ):

[attachment=4588]

Yes, something like that.

Here's another example...

If you look at You are not allowed to view links. Register or Login to view. , you'll notice it's quite ragged on the right. The preceding folio You are not allowed to view links. Register or Login to view. is much straighter, with many short tokens. Maybe meaningful? Maybe filler to straighten out the margin?

[attachment=4589]

Pages: 1 2 3