TheEnglishKiwi > 01-03-2026, 11:07 PM
(01-03-2026, 11:36 AM)Mauro Wrote: You are not allowed to view links. Register or Login to view.I have a couple questions.
How do you consider spaces? As separators between words, or as any other character?
How many bits are required to store the VMS when you compress it using your Minimum Description Length algorithm (dictionary + text)?
ThomasCoon > 02-03-2026, 01:11 AM
(01-03-2026, 11:07 PM)TheEnglishKiwi Wrote: You are not allowed to view links. Register or Login to view.
- 8-bit: 903,682 bits (~51.5% reduction)
- 16-bit: 723,282 bits (~61.2% reduction)
- 24-bit: 688,640 bits (~63.2% reduction) (winner)
- 32-bit: 782,379 bits (~57.9% reduction)
This is an intriguing result for me. 24-bit (3 characters) is the mathematical sweet spot which matches what Voynich researchers have long suspected. It suggests the text is built out of tight, highly constrained prefixes, roots and suffixes (e.g. "qok", "che", "dai").
Cheers!
Mauro > 02-03-2026, 11:44 AM
(01-03-2026, 11:07 PM)TheEnglishKiwi Wrote: You are not allowed to view links. Register or Login to view.2. The raw, uncompressed EVA transcription is 1,862,208 bits or 232,776 bytes. It depends on which Lens you run through the engine (standard MDL formula) so here's each of them:
- 8-bit: 903,682 bits (~51.5% reduction)
- 16-bit: 723,282 bits (~61.2% reduction)
- 24-bit: 688,640 bits (~63.2% reduction) (winner)
- 32-bit: 782,379 bits (~57.9% reduction)
This is an intriguing result for me. 24-bit (3 characters) is the mathematical sweet spot which matches what Voynich researchers have long suspected. It suggests the text is built out of tight, highly constrained prefixes, roots and suffixes (e.g. "qok", "che", "dai").
Cheers!
nablator > 02-03-2026, 01:22 PM
(02-03-2026, 01:11 AM)ThomasCoon Wrote: You are not allowed to view links. Register or Login to view.If that's true, this will be a huge step forward in cracking Voynichese. People have suspected for years that the text is made up of bigrams or trigrams, but nobody has been able to figure out what the combinations are: for example, is "qo-" a block by itself, or should we include the gallows letter which follows qo- 80% of the time?
nablator > 02-03-2026, 01:26 PM
(02-03-2026, 11:44 AM)Mauro Wrote: You are not allowed to view links. Register or Login to view.Does your 688640 bits improves on mine 480985 [I hope so]?
Mauro > 02-03-2026, 01:32 PM
(02-03-2026, 01:26 PM)nablator Wrote: You are not allowed to view links. Register or Login to view.(02-03-2026, 11:44 AM)Mauro Wrote: You are not allowed to view links. Register or Login to view.Does your 688640 bits improves on mine 480985 [I hope so]?
If I read this correctly 688640 > 480985 so no? Less bits is better compressed. It is not difficult to improve on a compression that has a severe restriction like 3-letter max chunks. Even BPE can do better because it is not limited by chunk size.
nablator > 02-03-2026, 01:42 PM
(02-03-2026, 11:44 AM)Mauro Wrote: You are not allowed to view links. Register or Login to view.I used Huffman codes to compress the text
Mauro > 02-03-2026, 02:05 PM
(02-03-2026, 01:42 PM)nablator Wrote: You are not allowed to view links. Register or Login to view.(02-03-2026, 11:44 AM)Mauro Wrote: You are not allowed to view links. Register or Login to view.I used Huffman codes to compress the text
Entropy gives you a theoretical number of bits, it's better for any purpose other than actual compression (like estimating the amount of information). With Huffman codes that are actual sequences of bits (not fractions) performance is a little worse (more bits).