19-07-2020, 07:03 AM
I computed a few measures in order to compare Voynichese with known language texts with and without hyphenation. As always, I could have made errors, so be careful.
I used these texts:
Each line was split into three parts:
qualities / of his mind and most of all / in
For each line, I tried concatenating the last word with the first word of the following line. I then checked if the resulting word occurs in the text. The percentage of matches appears in the "across %" column. Differently from what Matthias did, I tested all lines, also when the last word is long. Also, I treated paragraphs breaks as all other line breaks.
The same processing was done for all the consecutive couples in the "inside" part of the line. See the "inside %" column.
[attachment=4578]
The two texts with hyphenation have many more matches across lines than inside. The text with no hyphenation (Lincoln_fixed) has uniform low numbers across lines and inside.
The VMS has many more matches inside lines than across. This is the opposite of what happens with hyphenation: I take this as evidence that words are not simply broken across lines. It could still be that words are broken across lines, but at least one of the two halves is transformed in some way (IMO this scenario is not particularly likely, but who knows?).
These numbers also point out to the fact that combining two Voynichese words is much more likely to result in a "legal" word than in English or Latin. This feature is mentioned in Timm and Schinner's paper and I think it might also be related with You are not allowed to view links. Register or Login to view., but I am sure that more can be said and researched about the subject.
Why do across-lines combinations result in fewer matches? As first observed by Currier, the first and last words of lines are different from the rest. For instance, word-initial EVA:y is much more frequent at line-start than elsewhere; the last word often ends with EVA:m/g: combining two such words is less likely to result in a legal word than taking two consecutive "ordinary" words from inside a line. For instance, dam+yshor results in the bizarre digraph "my"; yshor+dam would have been more plausible, but Voynich line-effects work in the opposite direction.
I also computed the % of hapax legomena on word tokens and average word length for the three positions: first, inside and last.
The two texts with hyphenation show a higher frequency of hapax legomena in the first and last position: this is due to words broken across lines, which rarely result in accidentally legal words. Hyphenation makes average word length rather constant in different positions. On the contrary, when fixed length is applied, the first word of the line tends to be longer. I guess this is because long and short words tend to alternate and short words are often squeezed at the end of the line, leaving a longer word for the beginning of the next line (I think it should be possible to check if this explanation works, or to find a better explanation, by closely examining fixed-line-length texts).
Hapax results for the VMS are higher in both the first and last position. This is similar to what happens with hyphenation, but this possibility is ruled out by Matthias' method: combining hypothetical word fragments across lines does not result in the Voynichese words. As discussed above, Voynichese hapax legomena at line boundaries are more likely morphological anomalies as originally observed by Currier.
Voynich words in the first position are consistently longer: this appears to be further evidence in favour of a fixed-length layout.
For words in the last position, things are more complex. It seems clear that they are shorter in Q20, but in Currier A they appear to be slightly longer than words inside lines.
I used these texts:
- the ten pages from a medieval Bonaventura manuscript I transcribed and shared You are not allowed to view links. Register or Login to view.; 283 lines, 65 of which end with a broken word which ends on the next line
- the English printed text I mentioned in the post I previously linked (You are not allowed to view links. Register or Login to view.); punctuation, hyphen/minus included, was removed; 1340 lines, 218 hyphenations
- the same English text re-arranged so that no words are broken: a fixed a maximum line length of 40 characters, if a word does not fit, it is moved to the next line; 1280 lines, no hyphenation
- the whole VMS in Takahashi's transliteration; 5207 lines
- VMS Currier A, using the Zandbergen-Landini transliteration (ZL_ivtff_1b.txt), ignoring uncertain spaces; 1939 lines
- VMS Currier B, as above; 2892 lines
- VMS Quire 20 "Stars"; Takahashi's transliteration; 1083 lines
Each line was split into three parts:
- first word
- "inside" words
- last word
qualities / of his mind and most of all / in
For each line, I tried concatenating the last word with the first word of the following line. I then checked if the resulting word occurs in the text. The percentage of matches appears in the "across %" column. Differently from what Matthias did, I tested all lines, also when the last word is long. Also, I treated paragraphs breaks as all other line breaks.
The same processing was done for all the consecutive couples in the "inside" part of the line. See the "inside %" column.
[attachment=4578]
The two texts with hyphenation have many more matches across lines than inside. The text with no hyphenation (Lincoln_fixed) has uniform low numbers across lines and inside.
The VMS has many more matches inside lines than across. This is the opposite of what happens with hyphenation: I take this as evidence that words are not simply broken across lines. It could still be that words are broken across lines, but at least one of the two halves is transformed in some way (IMO this scenario is not particularly likely, but who knows?).
These numbers also point out to the fact that combining two Voynichese words is much more likely to result in a "legal" word than in English or Latin. This feature is mentioned in Timm and Schinner's paper and I think it might also be related with You are not allowed to view links. Register or Login to view., but I am sure that more can be said and researched about the subject.
Why do across-lines combinations result in fewer matches? As first observed by Currier, the first and last words of lines are different from the rest. For instance, word-initial EVA:y is much more frequent at line-start than elsewhere; the last word often ends with EVA:m/g: combining two such words is less likely to result in a legal word than taking two consecutive "ordinary" words from inside a line. For instance, dam+yshor results in the bizarre digraph "my"; yshor+dam would have been more plausible, but Voynich line-effects work in the opposite direction.
I also computed the % of hapax legomena on word tokens and average word length for the three positions: first, inside and last.
The two texts with hyphenation show a higher frequency of hapax legomena in the first and last position: this is due to words broken across lines, which rarely result in accidentally legal words. Hyphenation makes average word length rather constant in different positions. On the contrary, when fixed length is applied, the first word of the line tends to be longer. I guess this is because long and short words tend to alternate and short words are often squeezed at the end of the line, leaving a longer word for the beginning of the next line (I think it should be possible to check if this explanation works, or to find a better explanation, by closely examining fixed-line-length texts).
Hapax results for the VMS are higher in both the first and last position. This is similar to what happens with hyphenation, but this possibility is ruled out by Matthias' method: combining hypothetical word fragments across lines does not result in the Voynichese words. As discussed above, Voynichese hapax legomena at line boundaries are more likely morphological anomalies as originally observed by Currier.
Voynich words in the first position are consistently longer: this appears to be further evidence in favour of a fixed-length layout.
For words in the last position, things are more complex. It seems clear that they are shorter in Q20, but in Currier A they appear to be slightly longer than words inside lines.