Hi all, I'm incredibly excited to share the findings of my research, and present a tool for the community to try out the method for yourselves!
In a nutshell:
We have created a method which uses data compression to mathematically map the most statistically optimal, repeating building blocks within a closed text.
Constellation Analogy:
Imagine you've been asked to draw a shape with the stars in the nights sky, without being told what the shape was. 100's of different shapes might seem to fit perfectly, making it impossible to know which pattern is real just from looking at it.
Our method ignores all of those imaginary shapes. Instead it measures gravity to prove exactly which stars are linked together in the same solar system. By mathematically mapping the true structural clusters first, no one would waste time trying to connect stars that are actually millions of miles apart.
In the context of the manuscript, those millions of stars are the raw characters of the EVA transcription. It's incredibly easy to group the wrong letters together because you might think they look familiar to a word in say Latin or Hebrew.
So instead of guessing, our engine uses a math rule called Minimum Description Length (MDL) as our "gravity". We can mathematically measure exactly which letters are structurally bound together to form highly stable candidate morphs (structural prefixes, word-cores, and suffixes).
It's not translation, you translators still have the hard job of deciphering the meaning but at least now you can test your translation theories based on mathematically verified structural boundaries, rather than waste time on statistically random clusters of letters.
How does this differ from existing MDL/Entropy techniques (simplified):
If you're not familiar with MDL, or want to skip some of the more technical details, feel free to scroll past this and the next section.
Traditional entropy tools like standard n-gram analysis treat text as fixed blocks of data. They identify a character/token by how often it appears, or use a cryptographic hash. Basically, they count things.
Our proprietary engine (patent pending) maps data into a 2D bit-array. We identify a token by its geometric centroid (its centre of gravity), not just as a list of characters. This allows us to recognise patterns even if the transcription varies, so long as the geometric shape is stable.
A classic example of this is "csheedy" and "sheedy".
Try this for yourself with the Word Shape Visualiser tool here: You are not allowed to view links.
Register or
Login to view.
MDL Comparison Continued:
For those who might have more questions on this part, here are a couple of additional comparisons:
1. Traditional methods use fixed-window chunking e.g. 8-bit or 16-bit. This runs the risk of cutting words in the middle. Our engine uses a unique "optimal symbol length" calculation using the data itself via Shannon entropy minimisation. This makes segmentations data-driven instead of hard-coded assumptions.
2. Traditional compression tools are often lossy or purely mathematical. This makes inspecting or replaying the data in its original form difficult. Our engine is a reversible ledger and returns the exact original bits. This means the patterns found are actually representing the original data, not just mathematical abstractions.
3. Traditional statistical tools will find patterns in random noise since random data naturally produce frequency spikes. Our engine runs the data it ingests against null controls such as Shuffle, Markov and Random samples before it considers a structure to be statistically significant. It means if a pattern appears in the text but also the same frequency in randomised text, it identifies it as noise, not structure.
How to mathematically verify what the engine finds is valid
It's important that anyone can verify the method otherwise you'd have no reason to take anything the engine gives you as "truth". We verify this process through a couple of important, well-known mathematical principles/laws.
I've written a guide that you can follow along to conduct your own math experiment, using some EVA text of your own choosing, and you can verify your own results! Head over to the guide here: You are not allowed to view links.
Register or
Login to view.
If you can't be bothered going through the guide, here is the bottom line:
We mathematically prove that text is being compressed which is a mathematical “trick” that only works if you have found real, repeating patterns.
You cannot compress random data. If every piece of data is unique, you cannot describe it using fewer letters without losing information. Finding these patterns reveals a hidden “lego set” used to build the words. Whilst we can only guess where the lego bricks click together, the maths proves it: they only click together when it makes the dictionary smaller, not larger. The maths tests every single letter, so if we cut just one letter to the left or right, the file size would increase, not decrease.
If the engine shows that sh always connects to edy to maximise compression, that is no longer a statistical coincidence, it is a definitive, physical property of the manuscripts dataset.
This engine is helpful because humans tend to see patterns where they do not exist. We can see a face in the clouds, but maths has no brain and cannot be fooled. In short: if the maths can make the dictionary smaller, the pattern is physically real.
USE THE ENGINE YOURSELF
Now for the fun part, we have put together a few different tools to help enthusiasts test their translation theories in a few different ways:
Link to the tool: You are not allowed to view links. Register or Login to view.
You are not allowed to view links. Register or Login to view. Find repeating patterns at character, prefix, word and phrase levels.
1. Paste in an EVA transcription. You can use the Quick Load buttons to use a sample Folio.
2. Change the Lens based on whether you want to focus on individual characters, structural-prefixes, repeating word-cores etc.
3. Click Analyse and see the patterns found in that EVA text. If there is a particular root or phrase you're interested in, you can click Find Context which takes you to the next available tool:
You are not allowed to view links. Register or Login to view. Investigate grammatical roles without guessing the meaning.
1. Paste in a pattern you are interested in and again provide the Corpus you wish to scan it against. If you used the Find Context button from the Pattern Decoder, it will have done this and performed the analysis for you automatically.
2. It will show you which words contain your pattern, as well as the words that come before and after it (and their frequency).
You are not allowed to view links. Register or Login to view.: Compare the 2D bit structure of EVA words to see how word prefixes and roots differ in their geometric representation.
1. Enter in 2 EVA words. By default the Lens is set to candidate syllables, prefixes and word-cores.
2. Try using "csheedy" and "sheedy" as a test to see how they share a common structural root.
You are not allowed to view links. Register or Login to view.: Compare a highly compressible EVA block with what you think the English translation is to test your theory.
1. Provide your EVA text and English word, try "ol" and "the" as an example.
2. Enter in a Corpus (Folio) to compare them against and Run the Audit.
3. The engine will replace the EVA word with the English word against the Corpus you selected. It will then evaluate it according to its grammatical position, entropy impact and morphological family coherence.
4. It will give you a score for you to review. There are various guides on the page to help you interpret the results.
SUMMARY
I would love to get the communities feedback. I'm happy to share more information, provide more details on any particular area, collaborate, help test theories, provide more tools and more. There is an FAQ section which you can also view here: You are not allowed to view links.
Register or
Login to view.s