The Voynich Ninja

Full Version: How to recombine glyphs to increase character entropy?
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Pages: 1 2 3 4 5 6 7 8
As you may or may not know, the character entropy of Voynichese is extremely low. This is caused by a number of phenomena, not in the least by the fact that characters appear in predictable places. For example in EVA if you have [i], it is likely followed by [i] or [n] and preceded by [i] or [a].

Rene posted the following in Marco's thread today:

Quote:- how to recombine glyphs in order to increase entropy? I did some very initial experiments with this that were quite promising, but the number of possible permutations is a big problem here.

This is something I have been exploring as well last year:
You are not allowed to view links. Register or Login to view.

An intuitive example is the following: [aiin] is a unit so common in EVA, where it is written as a series of characters. This might unnecessarily inflate character entropy, so what if we replace [aiin] by a single character?

(Note: I take a humble stance when it comes to statistics and will be happily corrected by those more skilled in the subject.)

What I learned was the following:
  • I preferred to use the value h2/h1 because this gives the text's realized h2 as a percentage of its highest possible h2. So very much simplified a text with for example (h2/h1=0.5) reaches 50% of its maximal h2. You cannot compare h2 outright because h1 has to be taken into account as well (at least, this is what I took from Anton's teachings, and it seemed to work out in practice as well).
  • My strategy was basically to introduce new characters to replace common letter groups. This did raise h2, but h1 as well since I was inflating the alphabet
  • Cumulative changes interact in an unpredictable way. The first couple of changes you make might actually reinforce each other, with all changes combined resulting in a greater h2/h1 increase than the sum of their separate effects. After a while, however, the inflated alphabet takes its toll and I was unable to further raise h2/h1. So in the beginning you might get great results but after a few changes there are already diminishing returns, until you reach a critical point where little to no increase is possible.
These were my findings about replacing glyph groups by new characters:
- Replacing [qo] by something else has little effect. I think this is because the new character is still extremely positionally dependent (it always follows space).
- I was unable to do anything useful with [ee] either.
- Replacing clusters like [aiin], [aiir]... has the highest effect I've found. I-clusters in vanilla EVA transcriptions are a huge burden on character entropy.
- Collapsing benches into one new character each has a great effect. Collapsing benched gallows does not, but combined with collapsing normal benches, the effect is great.
- Collapsing the bigrams [ar, or, al, ol, am] (one of Nick's suggestions) has a decent effect as well.

[Image: naamloos-1-kopic3abren-1.gif]


Combining the three most effective measures, I was able to raise h2/h1 to a barely acceptable level. However, h2 alone was still much too low.

Removing spaces has an extremely beneficial effect, but I felt uncertain about how "legal" this step was. It is a change of a different order than combining some of EVA's strokes.

Since writing this post I have not had any better ideas... So Rene I wonder what your own experience is and if other members might have ideas.
Never mind. The forum software nuked my long message.

I have to get back to work. I don't have time to write it all again. Here is the condensed version.


...
I've always felt that treating the spaces as nonliteral or possibly as syllable breaks was the quickest and easiest way to change a number of the statistics (including entropy) that are generated by computational attacks.

Since there were some researchers who doubted whether syllable breaks even occurred in medieval text, I collected numerous examples where words have been broken across syllable boundaries. It happens. Not often, but often enough that the concept obviously existed in the Middle Ages. This can especially be seen in maps.

Also, the alchemical manuscript that has ciphers sprinkled through numerous folios, the one that Rene mentioned on the forum several years ago, has a page at the beginning where words have been broken across syllables. My feeling is that this was done in preparation for encipherment. Why? Because 1) the manuscript is full of ciphers and 2) many pieces of folios have been cut out (probably enciphered text, or plaintext), 3) it's obviously not a page about grammar (it has lots of dates and name-dropping, the kind of thing that is more likely to be enciphered).

If the cutout chunks were enciphered text, they may have been sent as a message. If they were plaintext, they may have been the key to the message and were cut out to hide them.


Koen Wrote:Removing spaces has an extremely beneficial effect, but I felt uncertain about how "legal" this step was.


I have always been suspicious of the VMS spaces. I think if any kind of re-arrangement of the VMS text is "legal" this would be high on the list.
Something I did not study in detail yet is what exactly the difference is between Voynich text without spaces and "normal" text without spaces. If you eliminate EVA-related issues like -iin, then removing spaces tackles a lot of your remaining problems.

This approach is only "legal" though if you assume that spaces were used in an artificial way that decreases entropy.
(28-04-2020, 11:55 PM)Koen G Wrote: You are not allowed to view links. Register or Login to view.Something I did not study in detail yet is what exactly the difference is between Voynich text without spaces and "normal" text without spaces. If you eliminate EVA-related issues like -iin, then removing spaces tackles a lot of your remaining problems.

This approach is only "legal" though if you assume that spaces were used in an artificial way that decreases entropy.

Hi Koen,
I am glad to see you continue this interesting line of investigation!

I made a simple experiment with two language texts (Italian=Machiavelli and Latin=Pliny) and two Voynch sections (Herbal A and Bio/Q13), each using two different transliterations (Currier and EVA). I may have made errors, so I think it would help if someone could double check.

_TEXT__ sp_c.h1 sp_b.h sp_cnd sp_cnd/h1 | nosp_c.h1 nosp_b.h nosp_cnd nosp_cnd/h1
_HA_Cur   3.787  6.319   2.531   0.668  |   3.911     6.607    2.696    0.689
Q13_Cur   3.725  5.735   2.010   0.540  |   3.752     5.862    2.109    0.562
_HA_EVA   3.790  5.980   2.190   0.578  |   3.815     6.063    2.248    0.589
Q13_EVA   3.805  5.672   1.867   0.491  |   3.805     5.740    1.936    0.509
machiav   3.935  7.034   3.098   0.787  |   3.982     7.298    3.316    0.833
_pliny_   4.013  7.379   3.367   0.839  |   4.003     7.506    3.504    0.875


The first four numbers in each row are for the original file, while the last four are for the no-space version.
cnd/h1 is the ratio between conditional entropy and character entropy: the measure you are referring to, I think.
Removing spaces always results in a considerable increase in entropy. In the case of Italian and Latin this could be due to two reasons:
1. there is a certain correlation between characters and spaces that results in frequent bigrams (e.g. Italian words often end with a vowel, Latin words often end with -s). By removing these frequent space-bigrams, you increase bigram entropy (b.h in the table).
2. removing spaces creates new bigrams that are very rare or absent in the original. For instance, 'ee' in Italian:
nuoveecomeebbeamicizieesoldati
nuove e come ebbe amicizie e soldati
This is again an increase in bigram entropy.

I believe that, if you go this way, using space-less language texts as a benchmark would make things "legal".
Thanks, Marco!
I did not respond earlier because I got a new PC and I had to get all the java stuff installed. 

For Herbal A in EVA, my conditional entropy (h2) goes from 2.1 to 2.3. You seem to get different values, where the difference between space and no space is much smaller. I redownloaded the file (Tahahashi) then pre-processed by eliminating punctuation and line-endings. Still my results are similar...

Results for a few randomly selected texts of similar size:

[attachment=4263]
After discussing things and exchanging files with Koen via email, we might have understood the differences between our results. The main reason is that, when removing spaces, Koen also removed line separators (newlines), so that the text is transformed in a single very long word. In any text, removing spaces results in bigrams that typically do not occur. In an ordinary text, when removing newlines you get the same bigrams you generate when removing spaces. But the VMS has characters that are considerably more frequent at line start (e.g. f- p- ) or line end (e.g -m -g): when removing newlines you generate new bigrams like 'mf' 'mp'.
Another difference is that I used the Zandbergen Landini transcription, while Koen used Takahashi's. But I don't think this really makes a difference.

Here is an updated table, where I tried to more closely replicate Koen's way of processing the files:

            WITH_SPACES                        NO_SPACES/NEWLINES          DIFF
___FILE___ char.h bigr.h cond.h cond/char| char.h bigr.h cond.h cond/char| cond/char
VMS_TT_H.A 3.8178 5.9237 2.1059 0.5516   | 3.8107 6.1390 2.3283 0.6110   | 0.0594
VMS_TT_Q13 3.7949 5.6247 1.8298 0.4822   | 3.8026 5.7961 1.9935 0.5242   | 0.0420
Machiavell 3.9354 7.0337 3.0982 0.7873   | 3.9823 7.3056 3.3234 0.8345   | 0.0472
___Pliny__ 4.0127 7.3793 3.3666 0.8390   | 4.0025 7.5086 3.5061 0.8760   | 0.0370
After cleaning up my corpus to remove excess tabs, carriage returns, spaces etc, I established the following target values. The lowest value is the non-VM text with the lowest conditional entropy (h2) or h2/h1 ratio. The median value obviously depends entirely on the corpus, but it gives a vague indication of an average. I included only latin alphabet because there are some issues with other alphabets I'd like to clear out first.

- With spaces:

VM h2: 1.78 - 2.23
Lowest h2: 2.82
Median h2: 3.15

VM h2/h1: 0.47 - 0.57
Lowest h2/h1: 0.68
Median h2/h1: 0.77

- Spaces removed:

VM h2: 1.97 - 2.38
Lowest h2: 3.02
Median h2: 3.42

VM h2/h1: 0.52 - 0.62
Lowest h2/h1: 0.72
Median h2/h1: 0.82


Apart from the target values, the main takeaway so far is that removing spaces generally increases entropy. More in-depth analyses follow.

Another thing I should add is that when looking at the difference between h2 with and without spaces, the VM is in the first half. This means that most texts actually gain more character entropy than the VM does when removing spaces. My intuition would expect the opposite.

However, this is without modifications to EVA. I predict the VM will rise significantly in the "difference" column once EVA is modified.
(02-05-2020, 03:53 PM)Koen G Wrote: You are not allowed to view links. Register or Login to view.Another thing I should add is that when looking at the difference between h2 with and without spaces, the VM is in the first half. This means that most texts actually gain more character entropy than the VM does when removing spaces. My intuition would expect the opposite.

My guess is that this could be due to last-first combinations across words.
For intance, in Voynichese -y words are often followed by q- words. If you remove spaces, you create a number of new 'yq' bigrams: many identical bigrams lower bigram entropy. I thought of this as an explanation for why Q13 has a lower h2/h1 increase than Herbal-A: it could be due to the fact that last-first combinations are stronger in B-dialects than in A.

I guess that very few texts (if any) will have such a strong last-first correlation as the VMS: the results of removing spaces are less predictable, hence more entropy is generated.
Ah yes, of course. I was focused on spaces as a predictable character, but did not yet consider that if you remove them, new predictable patterns emerge.
I slightly edited the values in the previous post after streamlining corpus formatting and encoding.
Pages: 1 2 3 4 5 6 7 8