I understand this topic hasn't received any attention in nearly on 10 years which likely indicates the author has moved on from this line of thinking. However, I am assuming it's still somewhat relevant as it's included in You are not allowed to view links.
Register or
Login to view. pinned to the top of
Analysis of the text.
After reading this thread I noticed two unanswered question themes, the first one was how accurate is this on a large scale, and which bi-grams are the same and whether author of the VMS intended for these to be reversed.
Process
Using ThomasCoon's unit chart, TC's explanations, and posted examples of the Units in action, I created a list of all Units and their permutations in EVA. I removed spaces, which TC believes do not carry meaning, and experimented with different orders of operations which break the text up line by line.
It wasn't possible for me to directly match TC's process. Part of the differences are due to from working in EVA vs with the VMS itself; there is a finite-ness in transliterations where a certain liberty exists working directly with the VMS. I attempted to account for this in being overly-cautions with my unit permutations.
The other reason was, and I could be wrong, but it seems like this was originally done pen on paper and if one way didn't work by the end TC would go back and try it another. This isn't a criticism, but it is difficult to replicate programmatically, particularly where the same VMS characters are broken up different ways in the same line. This is to say, while my results were not a 1 for 1 match for TC's I do believe I got close.
Results
Q1. How accurate is this on a large scale?
I used an Archimedes in the bathtub way of measuring success; I measured the original number of EVA characters in the VMS, ran the program replicating TC's method, then measured what was left.
Original characters (transliteration):
191545
Remaining characters:
8956
95.32% accuracy
For comparison I selected the 27 highest frequency bi-grams and used them in place of TC's Units. No permutations and the only order of operations was remove in order of frequency
.
ch, he, dy, ai, ok, in, ol, qo, ee, ed, ii, sh, da, ho, ey, ke, ot, yq, eo, ar, yo, al, ka, or, od, yc, hy.
Original characters (transliteration):
191545
Remaining characters:
52711
72.48% accuracy
Q2. Are bi-grams interchangeable?
Using the methodology from my You are not allowed to view links.
Register or
Login to view. post, I calculated the relative frequency of each Unit and its permutations across all topics, normalizing the results. I am working off the theory that content relates to the illustrations and their relative frequency in conjunction with those illustrations shows relatedness to that broad topic. So if Unit permutations have similar relative topic frequencies it might provide evidence the VMS author intended for them to be reversed or characters are interchangeable.
Here is a snip-it of the results, the full results are here:
ThomasCoon_units_results.xlsx (Size: 56.88 KB / Downloads: 4)
Discussion
I'm going to add my personal thoughts here. Feel free to skip this as I assume, given the length of time that has passed, most everyone has already made up their mind about this one way or another, so the information above is likely only relevant to newer folks like myself coming across this for the first time.
Assuming content is related to illustrations, I'm not seeing much evidence that TC's Unit's or even bi-grams can flipped. There are very few examples where permutations have the same major leans, let alone minor leans. Even within this thread the author seemed to be walking away from some of this interchangeability. If the illustrations are not related to the content, or perhaps only exist to tell the reader which way to break up the text, then my results for interchangeability are likely next to worthless. However, I might still point to the raw counts themselves, I would expect that if they could be interchanged freely, we may seem a more even split, but there are often vast differences between bi-grams and their reverses.
As for TC's Units themselves, the results are very good, and this may suggest there is indeed a way to break the VMS down into smaller non-character units, tho TC's original Units are likely not that way despite the impressive accuracy. TC would have very likely trimmed my cautious list of permutations down quite a bit, but even after that, differentiating c, h, and z alone would balloon the number of units. Also, me not being able to figure out a systematic approach which split characters the same way every time bothered me, but this could very well just be a failing on my part to understand it.
While my control/comparison test yielded significant lower results, this close to zero effort approach still yielded what I believe to be non-negligible results. A better order of operations, potentially based on the first or last letter, potentially removing it when the line has an odd number of characters, and a handpicked list of bi-grams may wind up with an accuracy rivaling TC's with far less units.
Overall, while I am not sure it's the right way to look at the VMS, I do think ThomasCoon's method is interesting and impressive even if ultimately I lean away from accepting it. The fact he worked this all out by hand makes it even more so. I believe that this, or a similar method, could be leveraged to improve transliteration, as it tends to highlight places where odd bi-grams stand out and deserve a second look.