Bit-level Geometry: A deterministic discovery of Voynichese building blocks

Bit-level Geometry: A deterministic discovery of Voynichese building blocks - Printable Version

+- The Voynich Ninja (https://www.voynich.ninja)
+-- Forum: Voynich Research (https://www.voynich.ninja/forum-27.html)
+--- Forum: Theories & Solutions (https://www.voynich.ninja/forum-58.html)
+--- Thread: Bit-level Geometry: A deterministic discovery of Voynichese building blocks (/thread-5407.html)

Pages: 1 2

Bit-level Geometry: A deterministic discovery of Voynichese building blocks - TheEnglishKiwi - 28-02-2026

Hi all, I'm incredibly excited to share the findings of my research, and present a tool for the community to try out the method for yourselves!

In a nutshell:
We have created a method which uses data compression to mathematically map the most statistically optimal, repeating building blocks within a closed text.

Constellation Analogy:
Imagine you've been asked to draw a shape with the stars in the nights sky, without being told what the shape was. 100's of different shapes might seem to fit perfectly, making it impossible to know which pattern is real just from looking at it.

Our method ignores all of those imaginary shapes. Instead it measures gravity to prove exactly which stars are linked together in the same solar system. By mathematically mapping the true structural clusters first, no one would waste time trying to connect stars that are actually millions of miles apart.

In the context of the manuscript, those millions of stars are the raw characters of the EVA transcription. It's incredibly easy to group the wrong letters together because you might think they look familiar to a word in say Latin or Hebrew.

So instead of guessing, our engine uses a math rule called Minimum Description Length (MDL) as our "gravity". We can mathematically measure exactly which letters are structurally bound together to form highly stable candidate morphs (structural prefixes, word-cores, and suffixes).

It's not translation, you translators still have the hard job of deciphering the meaning but at least now you can test your translation theories based on mathematically verified structural boundaries, rather than waste time on statistically random clusters of letters.

How does this differ from existing MDL/Entropy techniques (simplified):
If you're not familiar with MDL, or want to skip some of the more technical details, feel free to scroll past this and the next section.

Traditional entropy tools like standard n-gram analysis treat text as fixed blocks of data. They identify a character/token by how often it appears, or use a cryptographic hash. Basically, they count things.

Our proprietary engine (patent pending) maps data into a 2D bit-array. We identify a token by its geometric centroid (its centre of gravity), not just as a list of characters. This allows us to recognise patterns even if the transcription varies, so long as the geometric shape is stable.

A classic example of this is "csheedy" and "sheedy".
Try this for yourself with the Word Shape Visualiser tool here: You are not allowed to view links. Register or Login to view.

MDL Comparison Continued:
For those who might have more questions on this part, here are a couple of additional comparisons:

1. Traditional methods use fixed-window chunking e.g. 8-bit or 16-bit. This runs the risk of cutting words in the middle. Our engine uses a unique "optimal symbol length" calculation using the data itself via Shannon entropy minimisation. This makes segmentations data-driven instead of hard-coded assumptions.

2. Traditional compression tools are often lossy or purely mathematical. This makes inspecting or replaying the data in its original form difficult. Our engine is a reversible ledger and returns the exact original bits. This means the patterns found are actually representing the original data, not just mathematical abstractions.

3. Traditional statistical tools will find patterns in random noise since random data naturally produce frequency spikes. Our engine runs the data it ingests against null controls such as Shuffle, Markov and Random samples before it considers a structure to be statistically significant. It means if a pattern appears in the text but also the same frequency in randomised text, it identifies it as noise, not structure.

How to mathematically verify what the engine finds is valid
It's important that anyone can verify the method otherwise you'd have no reason to take anything the engine gives you as "truth". We verify this process through a couple of important, well-known mathematical principles/laws.

I've written a guide that you can follow along to conduct your own math experiment, using some EVA text of your own choosing, and you can verify your own results! Head over to the guide here: You are not allowed to view links. Register or Login to view.

If you can't be bothered going through the guide, here is the bottom line:

We mathematically prove that text is being compressed which is a mathematical “trick” that only works if you have found real, repeating patterns.

You cannot compress random data. If every piece of data is unique, you cannot describe it using fewer letters without losing information. Finding these patterns reveals a hidden “lego set” used to build the words. Whilst we can only guess where the lego bricks click together, the maths proves it: they only click together when it makes the dictionary smaller, not larger. The maths tests every single letter, so if we cut just one letter to the left or right, the file size would increase, not decrease.

If the engine shows that sh always connects to edy to maximise compression, that is no longer a statistical coincidence, it is a definitive, physical property of the manuscripts dataset.

This engine is helpful because humans tend to see patterns where they do not exist. We can see a face in the clouds, but maths has no brain and cannot be fooled. In short: if the maths can make the dictionary smaller, the pattern is physically real.

USE THE ENGINE YOURSELF
Now for the fun part, we have put together a few different tools to help enthusiasts test their translation theories in a few different ways:

Link to the tool: You are not allowed to view links. Register or Login to view.

You are not allowed to view links. Register or Login to view. Find repeating patterns at character, prefix, word and phrase levels.
1. Paste in an EVA transcription. You can use the Quick Load buttons to use a sample Folio.
2. Change the Lens based on whether you want to focus on individual characters, structural-prefixes, repeating word-cores etc.
3. Click Analyse and see the patterns found in that EVA text. If there is a particular root or phrase you're interested in, you can click Find Context which takes you to the next available tool:

You are not allowed to view links. Register or Login to view. Investigate grammatical roles without guessing the meaning.
1. Paste in a pattern you are interested in and again provide the Corpus you wish to scan it against. If you used the Find Context button from the Pattern Decoder, it will have done this and performed the analysis for you automatically.
2. It will show you which words contain your pattern, as well as the words that come before and after it (and their frequency).

You are not allowed to view links. Register or Login to view.: Compare the 2D bit structure of EVA words to see how word prefixes and roots differ in their geometric representation.
1. Enter in 2 EVA words. By default the Lens is set to candidate syllables, prefixes and word-cores.
2. Try using "csheedy" and "sheedy" as a test to see how they share a common structural root.

You are not allowed to view links. Register or Login to view.: Compare a highly compressible EVA block with what you think the English translation is to test your theory.
1. Provide your EVA text and English word, try "ol" and "the" as an example.
2. Enter in a Corpus (Folio) to compare them against and Run the Audit.
3. The engine will replace the EVA word with the English word against the Corpus you selected. It will then evaluate it according to its grammatical position, entropy impact and morphological family coherence.
4. It will give you a score for you to review. There are various guides on the page to help you interpret the results.

SUMMARY
I would love to get the communities feedback. I'm happy to share more information, provide more details on any particular area, collaborate, help test theories, provide more tools and more. There is an FAQ section which you can also view here: You are not allowed to view links. Register or Login to view.s

RE: Bit-level Geometry: A deterministic discovery of Voynichese building blocks - Typpi - 28-02-2026

I don't think putting No AI and No machine learning on the website, makes it true.

Can you give us a proper summary in human words?

RE: Bit-level Geometry: A deterministic discovery of Voynichese building blocks - TheEnglishKiwi - 28-02-2026

(28-02-2026, 07:26 AM)Typpi Wrote: You are not allowed to view links. Register or Login to view.Can you give us a proper summary in human words?

In simple terms, I've shared a tool that uses math to find the consistent structural building blocks of the text. The repeated "bricks" used to build the words, without having to rely on human guesswork.

Example: Take the English word "unbelievable".

If you didn't speak English, you might guess the blocks were "unb" - "eliev" - "able". But that's a bad guess because those pieces aren't very reusable.

If the engine cuts the word at "un-", it realises it can reuse that same brick to explain 100's of other words like "un-happy", "un-known", or "un-do". The math proves that "un-" was the optimal building block because it's the most efficient way to organise the dataset.

If your guess is "cs" is a prefix of csheedy and it's wrong, that's potentially years wasted going down the wrong rabbit hole. So this tool acts as a map for translators to test their theories on statistically verified units, rather than random clusters of letters that might just look familiar but aren't.

And the math is proof because the engine can only compress the data if it contained repeating predictable patterns (see the math verified link in the original post for why this is so).

To be clear: The math only proves that a pattern exists. It doesn't prove why it exists or what it means.

RE: Bit-level Geometry: A deterministic discovery of Voynichese building blocks - nablator - 28-02-2026

(28-02-2026, 07:06 AM)TheEnglishKiwi Wrote: You are not allowed to view links. Register or Login to view.If we find a pattern that repeats, it means the author was following a rule. If the engine shows that sh always connects to edy, that is no longer a coincidence - it is a law of the language.

Hi,

The problematic thing is: there is no law, only local preferences. There have been many attempts to tokenize the text into smaller chunks that could be used as building blocks, all of them unconvincing IMHO. The idea is to increase the character conditional entropy, not only (Shannon's) character entropy.

Quote:The math proves exactly where the author(s) of the manuscript connected their building blocks.

I don't think it does. Different compression algorithms produce different results and there is no guarantee that the optimal compression results in the best tokenization or that tokenization is even necessary to parse the text.

For example the tokenization used by LLMs (see You are not allowed to view links. Register or Login to view.) is not the one that maximizes compression. Best related to meaning is not the same as best compressed.

There is a dictionary in every zip file (the LZW algorithm uses a dictionary). If you zip a text file in English, does the content of the dictionary reveal some deep truth about the building blocks of the English language? Of course not. It may identify some useful morphemes (prefixes, roots, suffixes) but not all, it only identifies frequent repetitions and not how the language is put together.

RE: Bit-level Geometry: A deterministic discovery of Voynichese building blocks - Mauro - 28-02-2026

Well, I've tried for a long time to find a method to parse Voynich word types into 'chunks', including by using a compression algorithm, without finding anything conclusive. So I'm happy to see this approach taken further with a new tool. Do you have a list of the 'repeating patterns' the algorithm identifies in the manuscript (or even better, in the different sections of the manuscript)?

RE: Bit-level Geometry: A deterministic discovery of Voynichese building blocks - Jorge_Stolfi - 28-02-2026

(28-02-2026, 11:22 AM)TheEnglishKiwi Wrote: You are not allowed to view links. Register or Login to view.In simple terms, I've shared a mathematically grounded tool, designed to identify the true structural building blocks (e.g. prefixes, root words and suffixes), without having to rely on human guesswork (prone to bias).

That claim is questionable already at the theoretical level, because only a vanishingly small subset of all languages admits a factorization L = A B where any "prefix" string from the set A can be concatenated with any "suffix" string from set B.

The structure of the vocabulary of a natural language is more like the union (and not the disjoint union) of dozens or hundreds of separate languages L = L1 ∪ L2 ∪ ... Ln, where potentially each term can be factored into the product of two languages Li = Ai Bi.

But that "potentially" must be strongly stressed, because this model cannot be applied in practice. First, the lexicon extracted from any finite text in the language will miss many of the pairs ai bi from the product Ai Bi, thus preventing the recognition of that product as one of the terms. Second, many of those pairs will be missing even in the "ideal" complete lexicon, because of semantic constraints. And third, phonetic and spelling effects typically would require splitting the language into hundreds of product terms, making the identification of those terms highly questionable.

In your example, for instance, the "able" words actually belong to several categories depending on the pair of suffixes. Thus the prefixes A1 = {"speak", "pass", "read", ...} can pair with the suffixes B1 = {"", "able"}; but the prefixes A2 = {"fus", "coerc", "collaps"} must be paired with B2 = {"e", "ible"), while A3 = {"abat", "abus", ...} should pair with B3 = {"e", "able"}, and A4 ={"abdic", "abomin", ...} should pair with B4 = {"ate", "able"}, and so on. And some verbs, like "exist", "remain", "comprise" would not have "able" forms even in principle.

Quote:If you didn't speak English, you might chop that word up, guessing the blocks were "unb" - "eliev" - "able". When actually: ...

Why is your "actually" alternative factorization more correct than the "non-speaker" one? Because of semantics? But we don't know the semantics of the Voynichese words...

Quote:The math proves exactly where the author(s) of the manuscript connected their building blocks.

I bet it does not. Math may provide a set of prefixes, cores, and suffixes that have certain favorable statistical properties (like most previous word models had), but it cannot prove that the Author thought of the words in terms of those parts.

All the best, --stolfi

RE: Bit-level Geometry: A deterministic discovery of Voynichese building blocks - TheEnglishKiwi - 01-03-2026

(28-02-2026, 01:08 PM)nablator Wrote: You are not allowed to view links. Register or Login to view.I don't think it does. Different compression algorithms produce different results and there is no guarantee that the optimal compression results in the best tokenization or that tokenization is even necessary to parse the text.

For example the tokenization used by LLMs (see You are not allowed to view links. Register or Login to view.) is not the one that maximizes compression. Best related to meaning is not the same as best compressed.

There is a dictionary in every zip file (the LZW algorithm uses a dictionary). If you zip a text file in English, does the content of the dictionary reveal some deep truth about the building blocks of the English language? Of course not. It may identify some useful morphemes (prefixes, roots, suffixes) but not all, it only identifies frequent repetitions and not how the language is put together.

Hi there,

Thank you for excellent feedback, I appreciate it!

1 (LZW): So the issue with LZW is it's a dumb, greedy form of compression. It'll read text left-to-right and grab whatever repeats first to save space quickly. Our engine uses Minimum Description Length (MDL). We evaluate the mathematical cost of the entire dictionary which penalises dumb chunks that LZW would make. MDL is considered the gold standard for finding roots and affixes without knowing the language (See: John Goldsmith).

2 (LLM/BPE): LLM's are terrible at finding linguistic roots because they use tokenisers like BPE which merge the most frequent pairs of bytes together and this results in what you are saying which is: "Best Compressed" does not equal "Best Meaning". We use geometric centroids and evaluate the global boundary tax, not just the raw frequency.

3 (Conditional Entropy): I understand you saying that we need to look at conditional entropy, not just Shannon entropy, but that's actually describing exactly how our engine works. When a sequence has an extremely low conditional entropy (e.g. "sh" strictly predicts "edy"), an MDL engine groups them together into a single block. By doing so, the entropy between the newly formed blocks go up.

I must admit however, I was too enthusiastic to say "it is the law of the language". The math maps out the physical structure of the text, not the underlying semantics.

Thanks!

RE: Bit-level Geometry: A deterministic discovery of Voynichese building blocks - TheEnglishKiwi - 01-03-2026

(28-02-2026, 03:12 PM)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view.That claim is questionable already at the theoretical level

Hi Jorge, a huge thanks for taking the time to give some excellent feedback! Allow me to provide some critical clarifications as I believe you are looking at the tool through the lens of traditional, generative linguistics. The tool is strictly an extractive, descriptive compression engine.

Important distinction between "Generative" and "Descriptive":
You are absolutely right, language is not a disjoint union. However, the engine isn't trying to write a rulebook for how to generate new Voynichese words (where L = A x B falls apart due to exceptions).

MDL doesn't care about generating new words, it is strictly descriptive. That means it treats the entire manuscript as a closed universe of data. It doesn't attempt to claim that any prefix can attach to any suffix, rather - it simply draws a mathematical map of the pieces that already exist on the page. It simply states "In this specific manuscript, Piece A connects to Piece B exactly X times".

Phonetic Variations:
You correctly pointed out that roots change spellings when suffixes are added. But MDL doesn't rely on perfect/uniform phonetic rules. If the text uses "-able" and "-ible", the engine simply identifies them as 2 separate, highly efficient suffix blocks. It maps the messiness, exactly how it is, without ever needing to understand the morphological rules behind it.

Global Disctionary Optimisation vs Semantics
In regard to the assumption that you need to know the meaning of a word in order to know where to cut it, MDL solves this through Global Dictionary Optimisation. Let's compare 2 scenarios using the "unbelievable" example:

If the engine cuts "unb-", it has to create a new entry for "unb-", which can only be used a few times in total (e.g. "unbelievable" or "unbalanced"). The mathematical cost of storing "unb-" far outweighs the compression space saved.

If the engine cuts "un-", it can reuse that exact block 100's of times across the entire text (e.g. "unhappy", "undo", "unseen", "unknown", "unusual" and so on). It mathematically proves that "un-" is the true structural block because it returns the highest global compression ratio, all without needing to know what the words actually mean.

Authors Intent:
As I replied to another comment, I do need to concede this point. You are 100% correct there is no proof of psychological intent from the author, rather simply using math to prove the structural rules of the text. The math definitely proves the structural rules of the text to show how it was physically assembled, but not what was in the author's head.

Given your own extensive work on Voynichese word paradigms, are there specific 'rules' you've observed in Voynichese that you believe a pure compression algorithm would blindly fail to capture?

Cheers,
Ben

RE: Bit-level Geometry: A deterministic discovery of Voynichese building blocks - TheEnglishKiwi - 01-03-2026

(28-02-2026, 01:31 PM)Mauro Wrote: You are not allowed to view links. Register or Login to view.Well, I've tried for a long time to find a method to parse Voynich word types into 'chunks', including by using a compression algorithm, without finding anything conclusive. So I'm happy to see this approach taken further with a new tool. Do you have a list of the 'repeating patterns' the algorithm identifies in the manuscript (or even better, in the different sections of the manuscript)?

Hey Mauro,

I ran a quick snapshot script, here's the output!

--- Voynich Quick Snapshot: Most Frequent EVA Patterns ---

**Common Prefixes** (word-initial 3-grams)
ch-, qo-, sh-, ok-, ot-, da-, ol-, ai-, or-, al-

**Common Suffixes** (word-final 3-grams)
-dy, -in, -ey, -ol, -ar, -or, -hy, -al, -!n, -ir

**Common Roots/Cores** (internal 3-grams)
che, iin, aii, edy, y q, qok, cho, she, y o, oke

**Common Roots/Cores** (internal 4-grams)
aiin, y qo, hedy, dy q, y ch, ched, daii, qoke, okee, n ch

--- End Snapshot ---

RE: Bit-level Geometry: A deterministic discovery of Voynichese building blocks - Mauro - 01-03-2026

I have a couple questions.

How do you consider spaces? As separators between words, or as any other character?

How many bits are required to store the VMS when you compress it using your Minimum Description Length algorithm (dictionary + text)?