The Voynich Ninja

Full Version: Spaces and Entropy
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Pages: 1 2 3 4
(04-05-2022, 01:32 PM)Koen G Wrote: You are not allowed to view links. Register or Login to view.
(04-05-2022, 12:22 PM)Bernd Wrote: You are not allowed to view links. Register or Login to view.Koen, do any of the latin texts you analyzed contain scribal abbreviations e.g. symbols for -us, -um, -bus, et?
One thing I can try is take a normalized Latin text and introduce abbreviation symbols by replacing certain letter groups with numerals.

1 = con, com, cun, cum
2 = tur, ur
3 = us, os
4 = ris, tis, cis

Doing this will remove some information from the text, because when we now see "4", we must guess from context whether it represents ris, cis or tis. Therefore, we could hypothesize that some entropy stat will be reduced. However, they are all increased. 

h0: 4.64 -> 4.86
h1: 4.01 -> 4.15
h2: 3.31 -> 3.38

It was to be expected that h1 would increase, since we introduce several new, frequent symbols. 
H2 increases as well, probably in part because the non-abbreviated parts of the Latin text still behave like normal. Moreover, abbreviation condenses the text, which is also likely to increase h2.

Hi, Koen:

Thanks for doing this additional investigation.  Your results definitely confirm the conclusions provided by Lindemann and Bowern You are not allowed to view links. Register or Login to view. that Marco discussed in the parallel entropy thread.

I think at this point there is quite strong support that hypothesized abbreviation in the text is not going to help "normalize" the very low conditional entropy seen in Voynichese, particularly if such abbreviation shares parallels with scribal abbreviations used in medieval Latin.  Because this was the overwhelmingly most popular kind of abbreviation used at the time, this would be, in my opinion, the most likely approach adopted if such abbreviation is used.

Further, I am having trouble trying to imagine what kind of manipulation of an underlying natural language that would be termed "abbreviation" would have the desired effect.  Of course, this doesn't eliminate the possibility of abbreviation -- it just eliminates it as a central cause of the issue that we are trying to understand -- namely the low conditional entropy.

Thanks again,

Michelle
Thanks, Michelle, I think you are correct.

If anyone can think of an abbreviation that would decrease entropy (without entirely ruining the text), I would be happy to test it, but it will be quite the challenge!
I'm not sure how to put it differently:

Abbreviations increase entropy.
Verbosity decreases entropy.

It is really just a simple matter of information content divided by text length.
(04-05-2022, 06:00 PM)ReneZ Wrote: You are not allowed to view links. Register or Login to view.It is really just a simple matter of information content divided by text length.

In practice, abbreviation in manuscripts often destroys information though (the same symbol replacing various strings). If this is driven to the extreme in an experimental situation, you might reach a point where your h2 will drop. For example, replace everything that could be abbreviated by medieval scribes with "1". I predict that h2 will follow a bell curve as you replace more strings. But this will be a significant cost of information, and it may have become impossible to read the text. And the result will still look nothing like Voynichese.
(04-05-2022, 01:32 PM)Koen G Wrote: You are not allowed to view links. Register or Login to view.Generally speaking, abbreviation symbols will take a text's entropy statistics further away from the VM, so this is not something I am concerned about. 

One thing I can try is take a normalized Latin text and introduce abbreviation symbols by replacing certain letter groups with numerals.

1 = con, com, cun, cum
2 = tur, ur
3 = us, os
4 = ris, tis, cis

Doing this will remove some information from the text, because when we now see "4", we must guess from context whether it represents ris, cis or tis. Therefore, we could hypothesize that some entropy stat will be reduced. However, they are all increased. 

h0: 4.64 -> 4.86
h1: 4.01 -> 4.15
h2: 3.31 -> 3.38

It was to be expected that h1 would increase, since we introduce several new, frequent symbols. 
H2 increases as well, probably in part because the non-abbreviated parts of the Latin text still behave like normal. Moreover, abbreviation condenses the text, which is also likely to increase h2.
Thank you, that's to be expected.
Finding strong arguments against VM characters acting as latin abbreviations ist also important, since several symbols look like such and are placed accordingly.

To sum it up - we have a ciphertext with low h1 and extremely low h2 that stands out from contemporary manuscripts, yet spaces seem to behave similarly to ordinary texts. That's why I like the roman numeral hypothesis, though it probably is not that simple.

I expect the VM plaintext, if it exists, to either be highly verbose or not ordinary text at all.
(04-05-2022, 06:00 PM)ReneZ Wrote: You are not allowed to view links. Register or Login to view.I'm not sure how to put it differently:

Abbreviations increase entropy.
Verbosity decreases entropy.

It is really just a simple matter of information content divided by text length.

Statistics is not my thing, so I am reluctant to add my ideas, however, I believe it might point all of you into a new direction. I don't know about other languages, but Slovenian language was first taught by pronouncing syllables. In any old grammar book, there was a table of syllables which can include up to five letters. The table of syllables show is from the first Slovenian printed book in 1550.  
[attachment=6487]
After mastering syllables, a text was written by dividing the words into syllables, as shown in the Our Father prayer. 
[attachment=6488
By the end of the 17th century, the most words included proper vowels, but the words as still short, since there were not that many complicated compound words.
[attachment=6489]
Inserting the vowels for semi-vowels came into practice in the first Slovenian books, however, in the dialectal speech, the semi-vowels were not pronounced and if written down, it would look as if it was written in semi-abjad.
 Assuming that the author of the VM wrote down the words as he heard them, he would miss a lot of vowels. 
On the other hand, there was a practice in Slavic writing before the time of the printing press to join several short unstressed words in so-called word blocks. This could increase the length of the word, but decrease the number of vowels. 
In Slovenian language, there is also problem with the way the words are built by adding the syllables and making new words. In this case, a long word could be divided into three different words, each having a separate meaning.

I don't know if any other language has similar characteristics. But these are a few ideas to consider.
(04-05-2022, 06:26 PM)Koen G Wrote: You are not allowed to view links. Register or Login to view.
(04-05-2022, 06:00 PM)ReneZ Wrote: You are not allowed to view links. Register or Login to view.It is really just a simple matter of information content divided by text length.

In practice, abbreviation in manuscripts often destroys information though (the same symbol replacing various strings).

Well yes, if you remove or reduce information, that will again reduce entropy. If con is replaced by 9 and cum is replaced by 9, you no longer see which one of the two it was originally, after this replacement. Information is lost.

When someone applies a homophonic cipher, such as the typical diplomatic ciphers of the renaissance (and let's ignore the role of nulls, doubles and nomenclator words) the text length is not changed, but a lot of information is added. Even though it is meaningless information, the entropy will increase.
I did some experiments with the text. One of them seems to me to have been extremely successful, although I only achieved this by removing spaces. The difference between the same experiment with and without gaps was enormous. I, as usual, used the basic counts of letters, words and bigrams, without calculating the h indices, but it seems the result must be good. I'll think a little more about whether the result can be further improved or changed a little by trying to discover certain rules. I will share this later. Unfortunately, the problem of such strings like "shol shot shol shol daiin dain", "cheol dol cthey ykol dol dolo ykol dol" or "otaiin otair otair okeedy taiin aiin s aiin" isn't solved, while the general result is better. Maybe, such strings  are really formed with numbers. Who knows.
Brought in here from another thread:

(20-05-2022, 09:34 AM)Searcher Wrote: You are not allowed to view links. Register or Login to view.But if applying transformations - at the beginning and then - removing spaces?


This is something I was planning to look into back when I was writing about entropy, but for some reason I didn't fully pursue this line of investigation. I did do a little bit of it in the opening post of this thread, but didn't take it all the way yet. This approach may have some potential.

To run a quick test, I referred back to my original You are not allowed to view links. Register or Login to view. where apparently I had come up with a series of transformations for Herbal A: [qok, chol, chor, che, chy, ol, cho, or, qot, ar, eey, al, qo]. Note that this is after eliminating the possible EVA-effects [ch, sh, ain, aiin].

With spaces conserved, I got the following entropy values for HA, which I still considered a bit too low:

h1 = 4.20
h2 = 2.94

Now, if we assume the VM somehow messed with spaces as well, we can just take spaces out of the equation. So if I remove spaces from the transformed file and compare it to a no-space corpus, this is what we get:

[attachment=6551]

The closest dot to the red one is a Slavic text with the following values (h0, h1, h2) 5.044394119, 4.556408709, 3.634152131
The values for the transformed VM file are 5.209453366, 4.570687902, 3.650064015

Note that at this point I have been messing a lot with the file and all of this should be approached with caution. What I would say is the following:
If we assume a verbose cipher and omit spaces as a variable, it is possible to reach normal entropy values for Voynichese. Whether this leads to a viable "text" is a whole different question.

Also, so far I have only been performing transformations based on numerical data. A real solution would ideally look for a more systematic solution (for example, [o] is a null, [o] modifies the preceding glyph, stuff like that).
Hm, that's quite interesting...
What intrigues me more than the entropy increase by substitution is the increase of h2 by removing spaces.
If I read the graph correctly thats a plus of over 20%, that's massive and much more than in any other document in the text file you shared on the previous page.

I think if we assume a highly verbose cipher we either must treat spaces as nulls to overcome the shortness of vords or alternatively they may separate syllabes / word parts.

Does your transformation destroy the peculiar vord structure with glyphs favoring the beginning / middle / end - or to put it differently - is there a significant difference between glyph combinations within the transformed vords and across spaces?

If we assume spaces are randomly inserted (with a minimum and maximum vord length), there should be no difference across these spaces. But I would assume even after transformation, such would remain, although weaker than in traditional parsings like EVA.
Pages: 1 2 3 4