(03-05-2022, 05:49 PM)Bernd Wrote: You are not allowed to view links. Register or Login to view.But do I read this graph correctly that in EVA h1 slightly decreases when removing spaces?
I had not calculated the h1 difference yet, but it is a bit all over the place, and usually very small. Latin texts are generally also reduced in h1 when spaces are removed. The increase in h1 ranges from -1% to 5%. The median value is just a 1% increase in h1 upon removing spaces.
I find this hard to grasp intuitively, but it must have something to do with the balance between glyph frequencies, and to what extent the removal of spaces upsets this balance. Maybe someone else can give a better explanation for this.
I don't think we can use this data to determine whether something can be a language or not. After all, we have no control group. Maybe all kinds of different data sets behave this way when spaces are removed.
Here is a copy of the spreadsheet for those who are interested in the numbers. I made this 2 years ago without sharing in mind, (and over many iterations) so it's a bit messy, but you should be able to figure it out: You are not allowed to view links.
Register or
Login to view.
If h1 measures the median, or frequency of appearence, then removing the spaces will only remove the far right data point (I assume that space will be the most popular character by far in all texts).
(The median is the middle number in a sorted, ascending or descending, list of numbers. )
So what we seem to be looking at is that spaces in the Vm corpus function in a similar fashion to those in vernacular texts. IE, there are lots of them. Removing them forces every text up the graph by the same relative amount.
BTW, did you see this article on the entropy of word ordering across linguistic families? Can you do something similar with the VM and see if you can fit it into their results?
You are not allowed to view links.
Register or
Login to view.
Thanks, Koen.
Though I understand the idea behind conditional entropy, I find it hard to intuit quite how a change to the text would affect the measure.
I appreciate that you have probably run quite a few experiments on the text making alterations and measuring the results. Can I ask if you've split glyphs and what the affect was? By "split", I mean something like: randomly replacing [o] with either of two new glyphs [o1] and [o2].
I'm curious because we know that combining glyphs which are in complementary distribution increases h2. So surely creating glyphs which are in contrastive distribution should also increase h2? Or have I misunderstood something?
(03-05-2022, 11:11 PM)Emma May Smith Wrote: You are not allowed to view links. Register or Login to view.By "split", I mean something like: randomly replacing [o] with either of two new glyphs [o1] and [o2].
Especially the y, divided into single, initial and final?
When I was corresponding with Marco about this, I remember we agreed how difficult it is to predict entropy behavior. There are just so many variables.
Take your example. Imagine a situation where "o" can be followed by a wide range of glyphs. Now, after your proposed transformation, it can only be followed by two glyphs. (Although now those two glyphs can be followed by everything that could originally follow "o", so you get a bunch of new combinations there).
So it's really hard to predict, but I would guess that in the majority of cases you'd be correct: splitting a glyph will normally increase the number of possible combinations and hence increase h2. I'll try to test this tomorrow, although I'm not sure yet how to replace half of all o's randomly. Humans are bad at random so it would have to be automated somehow. It would be the random (unconditional) selection that increases conditional entropy.
In a way, "splitting glyphs" is a method VM translators may use to add entropy back into the mix, for example by choosing whether to translate a glyph as "t" or "d". If this selection happens at the whim of the translator, it is not conditioned by surrounding glyphs so it will increase h2.
Edit: I must add that it is the same mechanism that eventually leads to one-way cipher solutions, because it would imply that a lot of information was lost upon incryption.
Hi Koen, I guess that's my suggestion: h2 could be low because the script faultily records distinctions. A reader familiar with the language wouldn't have a problem restoring the correct reading from their own knowledge. A bit like English "th", which is restored to /th/ or /dh/ by recognizing the whole word.
Quote:I'll try to test this tomorrow, although I'm not sure yet how to replace half of all o's randomly. Humans are bad at random so it would have to be automated somehow.
For each [o]:
1. Generate a random integer.
2. Check the remainder of that number divided by 2.
3. If the remainder is 1, change the [o]. Otherwise leave it
A bit of searching suggests that Python can generate a random number with a bit size of 1, so effectively a random boolean.
(Unless I've misunderstood what you were asking.)
(04-05-2022, 01:19 AM)Emma May Smith Wrote: You are not allowed to view links. Register or Login to view.Hi Koen, I guess that's my suggestion: h2 could be low because the script faultily records distinctions. A reader familiar with the language wouldn't have a problem restoring the correct reading from their own knowledge. A bit like English "th", which is restored to /th/ or /dh/ by recognizing the whole word.
Quote:I'll try to test this tomorrow, although I'm not sure yet how to replace half of all o's randomly. Humans are bad at random so it would have to be automated somehow.
For each [o]:
1. Generate a random integer.
2. Check the remainder of that number divided by 2.
3. If the remainder is 1, change the [o]. Otherwise leave it
A bit of searching suggests that Python can generate a random number with a bit size of 1, so effectively a random boolean.
(Unless I've misunderstood what you were asking.)
English is of course rather nasty when it comes to representing single phonemes with bigrams.
sh, ch, ph, th are all quite frequent, and look at words like methane / pothole and bishop / mishap.
The human mind is quite a capable instrument.
With respect to the question about randomly replacing o's, the bitrans tool I posted at my web site can do this for you.
As David said I think what we see in the graphs is that spaces divide voynichese into distinct vords with the familiar pre- and suffixes, clearly they are not randomly distributed. So there's no wonder entropy changes similarly to texts when removing them.
We also see that a few differences in parsing/unstacking do have a noticable impact on entropy.
Koen, do any of the latin texts you analyzed contain scribal abbreviations e.g. symbols for -us, -um, -bus, et?
It would be interesting to use such abbreviated text as control group to see how combining/unstacking glyphs in a known text changes entropy. Should be easily doable with a python script.
Still including abrreviations, ambiguities and scribal errors I do not see how adaptions in parsing could transform voynichese into a readable natural language. It just does not work like this with its strict rules, repetitions and highly similar vords.
(04-05-2022, 12:22 PM)Bernd Wrote: You are not allowed to view links. Register or Login to view.Koen, do any of the latin texts you analyzed contain scribal abbreviations e.g. symbols for -us, -um, -bus, et?
No, all texts were normalized. Generally speaking, abbreviation symbols will take a text's entropy statistics further away from the VM, so this is not something I am concerned about.
One thing I can try is take a normalized Latin text and introduce abbreviation symbols by replacing certain letter groups with numerals.
1 = con, com, cun, cum
2 = tur, ur
3 = us, os
4 = ris, tis, cis
Doing this will remove some information from the text, because when we now see "4", we must guess from context whether it represents ris, cis or tis. Therefore, we could hypothesize that some entropy stat will be reduced. However, they are all increased.
h0: 4.64 -> 4.86
h1: 4.01 -> 4.15
h2: 3.31 -> 3.38
It was to be expected that h1 would increase, since we introduce several new, frequent symbols.
H2 increases as well, probably in part because the non-abbreviated parts of the Latin text still behave like normal. Moreover, abbreviation condenses the text, which is also likely to increase h2.