The Voynich Ninja

Full Version: Spaces and Entropy
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Pages: 1 2 3 4
A couple of years ago, when I was playing around with entropy statistics, I also wanted to test the impact of spaces. I don't recall if I ever published these data, but I do remember mailing with Marco about it. We concluded that a good way to test spaces would be to remove the variable entirely. Remove all spaces from all texts tested to level the playing field.

Predicted outcome: 
- H1: "space" is a frequent character, so removing it may increase h1.
- H2: I'd expect an increased h2 in all texts, because removing spaces will create novel bigrams where the endings and beginnings of words meet. 

Results:

[attachment=6477]

The big red cloud is the comparison corpus of medieval texts in various languages. Purple is the same corpus with spaces removed. As predicted, there is a general shift towards the top-right. The h2 of all texts increases by 3% (some Latin texts) to 16% (an English text). The median increase is 9%.

The green dots are various sections of the VM in EVA. Their h1 does not change much, but h2 does increase, as predicted. The increase for EVA texts is around 10%, close to the median. In other words, taking spaces out of the equation does not help EVA at all to catch up with other texts, since their h2 increases just as much (if not more).

Finally, the blue dot is a slightly modified Herbal A, which tries to mitigate EVA's most entropy-reducing characteristics. Benched gallows were unstacked, then benches replaced by novel glyphs. "Ain" and "aiin" were also replaced. Here, h1 does increase like in normal texts. Moreover, h2 goes up by 15%, which would be the biggest gain if it weren't for one English text.

Here is the same graph again, but only with texts that had their spaces removed:

[attachment=6478]

Conclusion: accounting for some of the more obvious effects of EVA and eliminating spaces as a variable is not enough to fix Voynichese's entropy problem.
I didn't understand the yellow dot, what is it?
(03-05-2022, 02:17 PM)Ruby Novacna Wrote: You are not allowed to view links. Register or Login to view.I didn't understand the yellow dot, what is it?

It's the "no space" version of the blue dot. This is Herbal A, but in modified EVA. Benches (ch, sh) are rewritten as single characters. Ain and Aiin are also rewritten as single characters.
But where is Q13 & Co?
I had all data from before apart from the modified EVA, so I quickly selected one section. If I were at my computer, I could prepare the files, run the test, enter the data to include the others as well, but they would certainly form a cloud like the black dots. Q13 is always much lower than the rest - without having tested it, I can almost guarantee that it would be lower than the yellow dot. The yellow dot is like giving it the best odds, without going crazy with replacements.
(03-05-2022, 02:56 PM)Koen G Wrote: You are not allowed to view links. Register or Login to view.without going crazy with replacements.
Indeed, I am familiar with Excel, where you can filter the data easily and quickly search and replace too.
Did you ever come to a conclusion why the Voynich increase in h2 is bigger than most texts? Or could it simply be that the effect would be larger for texts with a small starting h2?
Generally I think removing spaces is a good idea for entropy statistics as it also removes a lot of potential parsing errors (what's a space, half-space and so on).

But do I read this graph correctly that in EVA h1 slightly decreases when removing spaces?
The black dots are minimally skewed to the left compared to the green ones.

The modified Herbal A in contrast behaves very similar to medieval (and likely also contemporary) texts when removing spaces. What exactly causes this difference to original EVA?
Clearly unstacking glyph combinations makes VM text behave more like traditional ones in this regard.

But can we infer from this that spaces in the VM indeed function like they do in ordinary texts? This would indicate they are not nulls and glyph combinations across spaces are different from ghyph combinations confined by spaces. I would expect that, has this been investigated?

What we can say is that spaces have the same influence on entropy in all EVA sections you analyzed (green/black dots) as the dots are transposed almost exactly the same way. That's an indication they also mean the same thing in those texts.

The bottom line remains in any case - h2 remains extremely low, be it for our insufficient parsing or the underlying plaintext.
(03-05-2022, 05:22 PM)Emma May Smith Wrote: You are not allowed to view links. Register or Login to view.Did you ever come to a conclusion why the Voynich increase in h2 is bigger than most texts? Or could it simply be that the effect would be larger for texts with a small starting h2?

I think so. It's not an absolute rule, but the general tendency in the corups is that Latin text have little to gain by removing spaces (both in absolute terms and as a percentage). A language like Italian sits somewhere in between, and languages like German and English gain a lot. 

Imagine a text where conditional entropy has been maximized by shuffling everything, including spaces. In this text, removing spaces will have a small effect on h2. The VM almost sits at the other end. Some very predictable pairs are removed if you take out spaces. Granted, these will be replaced by other predictable pairs in places where common final and common initial glyphs meet. But there will be more variation overall.

Note that in unmodified EVA, Voynichese's gains were just a bit above the mean. I think this is because of the aforementioned scenario, where omitting spaces creates novel common pairs, like [nc, nq, ns]... These in turn will keep h2 low

In fact, if we look at the absolute increase in h2 (rather than a percentage), EVA's gains are below the mean.
(03-05-2022, 05:49 PM)Bernd Wrote: You are not allowed to view links. Register or Login to view.Generally I think removing spaces is a good idea for entropy statistics as it also removes a lot of potential parsing errors (what's a space, half-space and so on).
But can we infer from this that spaces in the VM indeed function like they do in ordinary texts? This would indicate they are not nulls and glyph combinations across spaces are different from ghyph combinations confined by spaces. I would expect that, has this been investigated?
Spaces are needed if same glyph can be a letter or the starting of a bigram/3-gram that forms a letter. It would be a way to distinguish between both cases. Spaces would be letter/letters separators and by extension glyph combinations would be only within spaces.
Pages: 1 2 3 4