The Voynich Ninja

Full Version: How to recombine glyphs to increase character entropy?
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Pages: 1 2 3 4 5 6 7 8
(27-05-2020, 06:30 AM)Koen G Wrote: You are not allowed to view links. Register or Login to view.Very interesting, Marco. Do we know how OKO works? He seems to eliminate all vowel bigrams, but I'd like to see the actual substitutions. The link on Stolfi's site doesn't work for me. I wonder what its entropy values are and how those compare to Vietnamese.

The page is available You are not allowed to view links. Register or Login to view., but I haven't looked into the details yet.
I spent some time browsing through the WayBackMachine page, but it does not seem to detail how the OKO encoding works: the page seems to be more about a word-structure model (something earlier than the crust-mantle-core model).

From what Stolfi says in the page linked by Rene, OKO appears to be close to an abjad:
Stolfi Wrote:the EVA groups ch, sh, ee and the platform gallows are counted as single letters, and the symbols a o y e i are assumed to be part of the preceding letter.

The first part of the sentence is similar to how the Currier-D'Imperio, CUVA and other systems work. The second part suggests that (as in abjads) many of the most frequent glyphs ("sounds" in an abjad) are not encoded as independent symbols.

Also, I noticed that the histogram I attached in the previous post is based on word types, not tokens: daiin has the same weight as qopchody.

Overall, this is another area that I would like to be more familiar with: it seems clear that something interesting is happening. The relationship of this problem with the subject of this thread is not entirely clear: I think that tackling entropy increase on both text-with-spaces and text-without-spaces is sensible. Since increasing entropy likely implies producing shorter words (in the "with spaces" case), we should be aware that the resulting words will be even more clearly shorter than words in European languages. There might be other implications that escape me at the moment.
Yeah, the impression I got was that he had especially word length problems in mind when writing the page. With what we know now, I can't use it to test entropy.

I think that if I did everything he suggests there, his h1 would be through the roof. He does suggest a number of glyphs which might be equivalent to others (for example, all gallows are varieties of the same thing?), which is one way to counter the h1 problem. But it's not specific enough to test.
(27-05-2020, 04:58 PM)MarcoP Wrote: You are not allowed to view links. Register or Login to view.I spent some time browsing through the WayBackMachine page, but it does not seem to detail how the OKO encoding  works: the page seems to be more about a word-structure model (something earlier than the crust-mantle-core model).

From what Stolfi says in the page linked by Rene, OKO appears to be close to an abjad: 

You're right, Marco, it does seem to be about a word-structure model.

I might be missing your exact meaning about how OKO encoding works, but I don't believe Stolfi intended to suggest it was similar to an abjad. I might be missing Stolfi's meaning too but, as I read it, it appears to be kind of the opposite of an abjad, as in: in the OKO model, every other 'letter' in a 'word' is explicitly a 'vowel' (if you accept that a o and y are vowels, of course)

To save others going through Stolfi's extensive data tables in Note 017, I think this is the heart of the OKO model:

Stolfi observes that 98% of the text consists of 'words' which follow the pattern:

Q?O?(OK?)* = Q?O?KO?KO?...KO?

where:

Q = { q }

O = { a o y }

K = { k    ke  ckh  ckhe
          t    te  cth  cthe
          ch  che
          sh  she
          ee  eee
          l    m    s    d
          n    r   
          in  ir
          iin  iir
          iiin

        }

? = One occurrence or missing

* = Zero or multiple occurrences

He also says that a good proportion of the other 2% of the text might merely be scribal errors. 

Please do correct any misunderstanding I might inadvertently have inserted in the above.
I've been wrestling with this a lot when I had some spare time the last few weeks, wrote the results down in a blog post: You are not allowed to view links. Register or Login to view.
Inspired by You are not allowed to view links. Register or Login to view. by Koen G I was playing with entropy having my key and two questions in mind:

1) the simpliest way to increase h2/h1.
The starting values where h1=3.86 and h2=2.13 (calculated using Takahashi transcription).
It appears that the following adjustments do some improvement:
cGh -> Gr (G is any gallow)
ch -> er
sh -> prl
t -> ll
a, i, n removed
Resulting h1=3.12, h2=2.26 (so h2/h1=0.72)

2) how merging letters changes entropy?
The "input" of letter X in entropy formula is p * log2(p). In the ideal case if X represents two letters with equal probability (so p/2 for each) then p * log2(p) turns into 2 * (p/2) * log2(p/2) = p * (log2(p) - 1), so the entropy increases. In the idealest case when each letter represents two letters with equal probabilities entropy increases by exactly 1. In the most idealest case when each letter represents two letters with equal probabilities independently of context both h1 and h2 change by the same value (1).
All this is quite logical since twice bigger alphabet requires 1 more bit per character. However interesting property of merging/splitting is that (in ideal case) it doesn't affect (h2-h1) difference, so can be used to manipulate h2/h1 without any side effects.
I tried to replicate Koens blog post You are not allowed to view links. Register or Login to view. results
using a genetic algorithm in python
after many trials i actually got the same changes as Koen did ( phew! ).
It seems Koen did an awesome job getting his results.
Running the proggie tens of times sometimes for hundreds of generations,
several results turned up, all similar to Koens but nothing beat them on h1 and h2 considered together.
The output from my prog was put into nablators 'Entropy' java code and results posted here.
My numbers are not quite the same because i used a slighly different Q13 text as original input.

preprocess text
sh=%
ch=!
ain ={
aiin =}

Koens' results
java Entropy koen_best.txt
h1 = 4.2599308291379225
h2 = 2.7107551739850644
['qot', 'qok', 'ot', 'or', 'ol', 'ok', 'eey', 'edy', 'ar', 'al', '%e', '!e'] 

My best result
java Entropy my_best.txt
h1 = 4.268190134298851    ( my best but has higher h1)
h2 = 2.7127760842416575
['qot', 'qok', 'or', 'ol', 'ok', 'hy', 'eey', 'edy', 'ar', 'al', '%e', '!e']    diff:: 'hy' for 'ot'

A close result
java Entropy nearly.txt  ( a close result but with lower h1 )   
h1 = 4.24663933742455
h2 = 2.7104328131417854
['qot', 'qok', 'ot', 'or', 'ol', 'ok', 'eey', 'edy', 'ar', 'al', '%ey', '!e']    diff::  '%ey' for '%e' i.e 'shey' for 'she'
This is really cool, thanks for the effort! What I was doing felt like manually imitating a genetic algorithm, so it is reassuring that we get similar results.

Do you also feel like this approach has been pushed to its limits now?
(17-04-2022, 09:20 PM)Koen G Wrote: You are not allowed to view links. Register or Login to view.This is really cool, thanks for the effort! What I was doing felt like manually imitating a genetic algorithm, so it is reassuring that we get similar results.

No problem, it was fun to do ( sort of Smile ) once the code was up and running it gave results that were in accordance with yours so that helped me push on.


(17-04-2022, 09:20 PM)Koen G Wrote: You are not allowed to view links. Register or Login to view.Do you also feel like this approach has been pushed to its limits now?

Yes , my impression is that what you have done has approached the limits, there is maybe a little left.


Then there is the addition of using spaces in the n-grams and
possibly getting the n-grams from words only
ReneZ mentioned it, using spaces as separators but not characters, so n-grams from across word boundaries are excluded.
I think that that won't change much but spaces are a whole different story.

For the method of replacing strings in general, as you mentioned in your blog, it's a small step into the big world of verbose ciphers.

There are also various flavors of entropy: joint entropy, mutual entropy, cross entropy etc.
No doubt some of these could be leveraged on the VMS.
An interesting case is the use of entropy calculations to find nulls ( in specific conditions ),
mentioned at the bottom of this page ( incidentally the same place where i got the entropy code to put in my prog ).
You are not allowed to view links. Register or Login to view.
I think that this approach is very promising.

One challenge is to avoid increasing h1 too much while increasing h2.

My best result is:
h1: 3.927
h2: 2.971

I am rewriting the little paper I prepared to explain this. The transformation works in both directions, and the attached figure should be compared with Figure 12 on You are not allowed to view links. Register or Login to view..

[attachment=6405]
Pages: 1 2 3 4 5 6 7 8