03-05-2022, 02:01 PM
A couple of years ago, when I was playing around with entropy statistics, I also wanted to test the impact of spaces. I don't recall if I ever published these data, but I do remember mailing with Marco about it. We concluded that a good way to test spaces would be to remove the variable entirely. Remove all spaces from all texts tested to level the playing field.
Predicted outcome:
- H1: "space" is a frequent character, so removing it may increase h1.
- H2: I'd expect an increased h2 in all texts, because removing spaces will create novel bigrams where the endings and beginnings of words meet.
Results:
[attachment=6477]
The big red cloud is the comparison corpus of medieval texts in various languages. Purple is the same corpus with spaces removed. As predicted, there is a general shift towards the top-right. The h2 of all texts increases by 3% (some Latin texts) to 16% (an English text). The median increase is 9%.
The green dots are various sections of the VM in EVA. Their h1 does not change much, but h2 does increase, as predicted. The increase for EVA texts is around 10%, close to the median. In other words, taking spaces out of the equation does not help EVA at all to catch up with other texts, since their h2 increases just as much (if not more).
Finally, the blue dot is a slightly modified Herbal A, which tries to mitigate EVA's most entropy-reducing characteristics. Benched gallows were unstacked, then benches replaced by novel glyphs. "Ain" and "aiin" were also replaced. Here, h1 does increase like in normal texts. Moreover, h2 goes up by 15%, which would be the biggest gain if it weren't for one English text.
Here is the same graph again, but only with texts that had their spaces removed:
[attachment=6478]
Conclusion: accounting for some of the more obvious effects of EVA and eliminating spaces as a variable is not enough to fix Voynichese's entropy problem.
Predicted outcome:
- H1: "space" is a frequent character, so removing it may increase h1.
- H2: I'd expect an increased h2 in all texts, because removing spaces will create novel bigrams where the endings and beginnings of words meet.
Results:
[attachment=6477]
The big red cloud is the comparison corpus of medieval texts in various languages. Purple is the same corpus with spaces removed. As predicted, there is a general shift towards the top-right. The h2 of all texts increases by 3% (some Latin texts) to 16% (an English text). The median increase is 9%.
The green dots are various sections of the VM in EVA. Their h1 does not change much, but h2 does increase, as predicted. The increase for EVA texts is around 10%, close to the median. In other words, taking spaces out of the equation does not help EVA at all to catch up with other texts, since their h2 increases just as much (if not more).
Finally, the blue dot is a slightly modified Herbal A, which tries to mitigate EVA's most entropy-reducing characteristics. Benched gallows were unstacked, then benches replaced by novel glyphs. "Ain" and "aiin" were also replaced. Here, h1 does increase like in normal texts. Moreover, h2 goes up by 15%, which would be the biggest gain if it weren't for one English text.
Here is the same graph again, but only with texts that had their spaces removed:
[attachment=6478]
Conclusion: accounting for some of the more obvious effects of EVA and eliminating spaces as a variable is not enough to fix Voynichese's entropy problem.