Bit-level Geometry: A deterministic discovery of Voynichese building blocks

Bit-level Geometry: A deterministic discovery of Voynichese building blocks - Printable Version

+- The Voynich Ninja (https://www.voynich.ninja)
+-- Forum: Voynich Research (https://www.voynich.ninja/forum-27.html)
+--- Forum: Theories & Solutions (https://www.voynich.ninja/forum-58.html)
+--- Thread: Bit-level Geometry: A deterministic discovery of Voynichese building blocks (/thread-5407.html)

Pages: 1 2

RE: Bit-level Geometry: A deterministic discovery of Voynichese building blocks - TheEnglishKiwi - 01-03-2026

(01-03-2026, 11:36 AM)Mauro Wrote: You are not allowed to view links. Register or Login to view.I have a couple questions.

How do you consider spaces? As separators between words, or as any other character?

How many bits are required to store the VMS when you compress it using your Minimum Description Length algorithm (dictionary + text)?

Hi there, great questions!

1. The engine is blind to words. It encodes everything into raw UTF-8 bytes which then gets fed into the engine as a continuous stream of 1's and 0's. It doesn't "see" a space, it just sees the bit pattern for the space character (00100000) like it would any other character. It then lets the math decide if that space forms a structural boundary or not. The tool then uses spaces, tabs and newlines to figure out where those blocks sit. That way it can categorise if a block tends appears at the start, middle or end of traditional word grouping.

2. The raw, uncompressed EVA transcription is 1,862,208 bits or 232,776 bytes. It depends on which Lens you run through the engine (standard MDL formula) so here's each of them:

8-bit: 903,682 bits (~51.5% reduction)
16-bit: 723,282 bits (~61.2% reduction)
24-bit: 688,640 bits (~63.2% reduction) (winner)
32-bit: 782,379 bits (~57.9% reduction)

This is an intriguing result for me. 24-bit (3 characters) is the mathematical sweet spot which matches what Voynich researchers have long suspected. It suggests the text is built out of tight, highly constrained prefixes, roots and suffixes (e.g. "qok", "che", "dai").

Cheers!

RE: Bit-level Geometry: A deterministic discovery of Voynichese building blocks - ThomasCoon - 02-03-2026

(01-03-2026, 11:07 PM)TheEnglishKiwi Wrote: You are not allowed to view links. Register or Login to view.
8-bit: 903,682 bits (~51.5% reduction)

16-bit: 723,282 bits (~61.2% reduction)

24-bit: 688,640 bits (~63.2% reduction) (winner)

32-bit: 782,379 bits (~57.9% reduction)

This is an intriguing result for me. 24-bit (3 characters) is the mathematical sweet spot which matches what Voynich researchers have long suspected. It suggests the text is built out of tight, highly constrained prefixes, roots and suffixes (e.g. "qok", "che", "dai").

Cheers!

If that's true, this will be a huge step forward in cracking Voynichese. People have suspected for years that the text is made up of bigrams or trigrams, but nobody has been able to figure out what the combinations are: for example, is "qo-" a block by itself, or should we include the gallows letter which follows qo- 80% of the time?

Thank you for your work, TheEnglishKiwi Yes

RE: Bit-level Geometry: A deterministic discovery of Voynichese building blocks - Mauro - 02-03-2026

(01-03-2026, 11:07 PM)TheEnglishKiwi Wrote: You are not allowed to view links. Register or Login to view.2. The raw, uncompressed EVA transcription is 1,862,208 bits or 232,776 bytes. It depends on which Lens you run through the engine (standard MDL formula) so here's each of them:

8-bit: 903,682 bits (~51.5% reduction)

16-bit: 723,282 bits (~61.2% reduction)

24-bit: 688,640 bits (~63.2% reduction) (winner)

32-bit: 782,379 bits (~57.9% reduction)

This is an intriguing result for me. 24-bit (3 characters) is the mathematical sweet spot which matches what Voynich researchers have long suspected. It suggests the text is built out of tight, highly constrained prefixes, roots and suffixes (e.g. "qok", "che", "dai").

Cheers!

I think it would be interesting to compare your results with mine.

I used a different approach, modeling the VMS word types using a slot grammar, which was then used to derive the 'chunks' composing the words. Then I applied a metric (which I called Nbits): the number of bits needed to compress the textg usin the identified chunks as the dictionary for the compression algorithm (you can find the procedure here: You are not allowed to view links. Register or Login to view. , the text was compressed using Hufmann codes).

The best compression I found (see You are not allowed to view links. Register or Login to view. and following posts, included a discussion of the problems I found) required 480985 bits to encode the whole EVA text (not guaranteed at all to be optimal in any sense, I used a Monte Carlo engine to sample different 'chunkifications' of words). 15314 bits for the dictionary, 465671 bits for the text.

I looked forward to an algorithm implementing a 'chunkification ' of words without being supervised, so I'm happy to see yours, unsupervised and MLD-based (but I also agree with others here, it's not a given that this can be done meaningfully). It would be nice to compare the compression rate you achieved with mine, however, there are a lot of caveats in comparing the two results: I used Huffman codes to compress the text, I considered spaces as word separators (and now I'm unsure of how/if I counted them in the Nbits figure), I removed words with 'rare' characters (from EVA 'g' downwards). Not knowing exactly what you did, it's hard to say how to make the comparison (nor I have much time for it). What do you think? Does your 688640 bits improves on mine 480985 [I hope so]?

And, did you find many 'chunks' which include a space in the middle? Ie. like "y q"?

RE: Bit-level Geometry: A deterministic discovery of Voynichese building blocks - nablator - 02-03-2026

(02-03-2026, 01:11 AM)ThomasCoon Wrote: You are not allowed to view links. Register or Login to view.If that's true, this will be a huge step forward in cracking Voynichese. People have suspected for years that the text is made up of bigrams or trigrams, but nobody has been able to figure out what the combinations are: for example, is "qo-" a block by itself, or should we include the gallows letter which follows qo- 80% of the time?

Again, the performance of a given compression algorithm is not evidence of anything.

If the goal is to figure out the cipher (if it exists) there are many alternatives to the (static) tokenization.

That trigrams are important is not new: you must include constraints at least at the trigram level if you want to generate good-looking pseudo-Voynichese: it handles LAAFU and preferences across word breaks nicely. The only visible problem with a third-order Markov chain program is that it generates too many long words so the word length distribution is not good.

RE: Bit-level Geometry: A deterministic discovery of Voynichese building blocks - nablator - 02-03-2026

(02-03-2026, 11:44 AM)Mauro Wrote: You are not allowed to view links. Register or Login to view.Does your 688640 bits improves on mine 480985 [I hope so]?

If I read this correctly 688640 > 480985 so no? Less bits is better compressed. It is not difficult to improve on a compression that has a severe restriction like 3-letter max chunks. Even BPE can do better because it is not limited by chunk size.

RE: Bit-level Geometry: A deterministic discovery of Voynichese building blocks - Mauro - 02-03-2026

(02-03-2026, 01:26 PM)nablator Wrote: You are not allowed to view links. Register or Login to view.
(02-03-2026, 11:44 AM)Mauro Wrote: You are not allowed to view links. Register or Login to view.Does your 688640 bits improves on mine 480985 [I hope so]?

If I read this correctly 688640 > 480985 so no? Less bits is better compressed. It is not difficult to improve on a compression that has a severe restriction like 3-letter max chunks. Even BPE can do better because it is not limited by chunk size.

At face value yes, of course, but the two figures are not directly comparable, so it may be that 688640 is actually better than 480985. Then, as I said, there's no guarantee this kind of exercise is even meaningful, who knows.

RE: Bit-level Geometry: A deterministic discovery of Voynichese building blocks - nablator - 02-03-2026

(02-03-2026, 11:44 AM)Mauro Wrote: You are not allowed to view links. Register or Login to view.I used Huffman codes to compress the text

Entropy gives you a theoretical number of bits, it's better for any purpose other than actual compression (like estimating the amount of information). With Huffman codes that are actual sequences of bits (not fractions) performance is a little worse (more bits).

RE: Bit-level Geometry: A deterministic discovery of Voynichese building blocks - Mauro - 02-03-2026

(02-03-2026, 01:42 PM)nablator Wrote: You are not allowed to view links. Register or Login to view.
(02-03-2026, 11:44 AM)Mauro Wrote: You are not allowed to view links. Register or Login to view.I used Huffman codes to compress the text

Entropy gives you a theoretical number of bits, it's better for any purpose other than actual compression (like estimating the amount of information). With Huffman codes that are actual sequences of bits (not fractions) performance is a little worse (more bits).

I used Huffmann codes only because they're demonstrably optimal (given a certain 'chunkification' of a certain text), so I fancied it could be a sound baseline. It's actually very much anachronistic in the VMS context. If, and I stress 'if', VMS word types are actually composed from a number of chunks, trying to identify them by computing Nbits over a Huffman-coded compression is probably not going to work well, but yet again, who knows. It did not work for me (with the added caveat of the supervising slot grammar, which limits what the compressor can do, a problem which does not apply to Minimum Description Length).