Edit-Distance computational attack idea

Edit-Distance computational attack idea - Printable Version

+- The Voynich Ninja (https://www.voynich.ninja)
+-- Forum: Voynich Research (https://www.voynich.ninja/forum-27.html)
+--- Forum: Analysis of the text (https://www.voynich.ninja/forum-41.html)
+--- Thread: Edit-Distance computational attack idea (/thread-655.html)

Pages: 1 2

Edit-Distance computational attack idea - Psillycyber - 23-07-2016

Hi all,

So, for quite a while I've wondered why Torsten Timm's auto-copying hypothesis hasn't made a bigger splash in the Voynich community.

Perhaps it would give the community more confidence in the auto-copying hypothesis if it could be statistically (not just subjectively) shown that the average edit distance between words in the VMS is anomalously low compared to other texts.

In theory, it should be possible for someone with more computer programming experience than myself to calculate, using certain pre-determined rules about how to calculate edit-distances:
1. The edit-distance between any two successive words. Then, display both a color-scale-coded, zoomed-out map of the VMS according to successive word edit distances (where red=1, orange=2, etc), along with the overall average successive word edit-distance in the VMS, and compare these with standard texts from other languages.
2. The average edit distance between any particular VMS "word" and all other words in the VMS, weighted according to the frequency of those other words (so, a word that is used twice as often will have its edit-distance count for twice as much as that of a different word only used half as often).
3. The average of this average. Compare with other standard texts, etc.
4. The average edit-distance between all two-word combinations (not just successively) on any particular page, weighted by the word frequencies. Compare with other standard texts, etc.

I predict that, if someone were to do these sorts of computational attacks on the VMS and other standard texts, one would find stark contrasts. The VMS has no two-word combinations like "throughout" vs. "swimming," to pick two random English words. In the VMS, if you had the word "throughout," you'd find plenty of words like "thrughut" or "trough" or "rouge" or "hotrug" but no word like "swimming" that is based on an entirely different word stem (aside from the other two word stems one would find in the VMS, if you were starting with "chedy." Yes, there are words like "ol" and "aiin," but no words like "nemrie" that are unassociated with any of those stems).

Natural languages don't work like this. This computational attack would settle, once and for all, that the VMS must either be a constructed language, a cipher, or gibberish.

RE: Edit-Distance computational attack idea - Sam G - 23-07-2016

(23-07-2016, 06:08 PM)Psillycyber Wrote: You are not allowed to view links. Register or Login to view.Hi all,

So, for quite a while I've wondered why Torsten Timm's auto-copying hypothesis hasn't made a bigger splash in the Voynich community.

His theory didn't really explain anything. He would basically just take two lines of similar text and assert that one was copied from the other. Of course, the similarity could be for any other reason (such as similar vocabulary or grammar in a meaningful text).

Quote:the average edit distance between words in the VMS is anomalously low compared to other texts... Natural languages don't work like this.

That English or other European languages don't have this property does not prove that no natural language does. Also, well-developed theories regarding what properties natural languages may or may not have basically do not exist.

Quote:This computational attack would settle, once and for all, that the VMS must either be a constructed language,

If you concede that a valid constructed language could have this property, then you would also need to show that such a language could not have arisen naturally.

Quote:a cipher, or gibberish.

That the VMS is not encrypted is basically proven by the low second-order entropy of the text, since virtually all ciphers increase entropy. The main exception, verbose ciphering, is ruled out by the lack of repeated strings (no long words and no repeated sequences of short words).

RE: Edit-Distance computational attack idea - Anton - 23-07-2016

Quote:So, for quite a while I've wondered why Torsten Timm's auto-copying hypothesis hasn't made a bigger splash in the Voynich community.

I think that is mainly because Torsten's hypothesis implies that the whole stuff is meaningless, while there are many tiny obstacles to the meaninglessness of the text. To name my favourite one:

otol and odaiin are the two most frequent "Voynich stars" (labeled objects in f68r1 and f68r2), and they are both mentioned in You are not allowed to view links. Register or Login to view. (supposed to serve for some introduction or summary).

Quote:That the VMS is not encrypted is basically proven by the low second-order entropy of the text, since virtually all ciphers increase entropy. The main exception, verbose ciphering, is ruled out by the lack of repeated strings (no long words and no repeated sequences of short words).

Is the fact that all ciphers increase entropy mathematically proven? I think it is not. So there are no foundations to state that. One may state that "all ciphers known to this person increase entropy", but that would not disprove the hypothesis that VMS is enciphered since it well may be enciphered by a cipher unknown (indeed, if it were enciphered with a known cipher, it would have probably been deciphered long ago).

So it is by no means "proven" that VMS is not a cipher.

For the verbose ciphering, is it ruled out? What if it is supplemented by shuffling?

RE: Edit-Distance computational attack idea - Sam G - 23-07-2016

(23-07-2016, 08:08 PM)Anton Wrote: You are not allowed to view links. Register or Login to view.
Quote:That the VMS is not encrypted is basically proven by the low second-order entropy of the text, since virtually all ciphers increase entropy. The main exception, verbose ciphering, is ruled out by the lack of repeated strings (no long words and no repeated sequences of short words).

Is the fact that all ciphers increase entropy mathematically proven? I think it is not. So there are no foundations to state that. One may state that "all ciphers known to this person increase entropy", but that would not disprove the hypothesis that VMS is enciphered since it well may be enciphered by a cipher unknown (indeed, if it were enciphered with a known cipher, it would have probably been deciphered long ago).

So it is by no means "proven" that VMS is not a cipher.

Whether it can be mathematically proven or not I'm not sure, but there is certainly no such cipher known to modern cryptography. So you're left arguing that someone in the early 15th century invented some entropy-lowering cipher that was never subsequently rediscovered (not even with the VMS to serve as an example) and indeed that appears theoretically impossible today.

Quote:For the verbose ciphering, is it ruled out? What if it is supplemented by shuffling?

Simple verbose ciphering is surely ruled out, assuming a plaintext in a European language. Shuffling will increase the entropy at least somewhat, and at the very least there is no known procedure that does this. To a large extent verbose ciphering is like steganography, which is basically impossible to disprove for any text, though there are strong arguments against it in the case of the VMS.

RE: Edit-Distance computational attack idea - Anton - 23-07-2016

Quote:Whether it can be mathematically proven or not I'm not sure, but there is certainly no such cipher known to modern cryptography. So you're left arguing that someone in the early 15th century invented some entropy-lowering cipher that was never subsequently rediscovered (not even with the VMS to serve as an example) and indeed that appears theoretically impossible today.

Really not more impossible than the assumption that someone in the early 15th century wrote in a language in which no writings have been discovered afterwards.

RE: Edit-Distance computational attack idea - Sam G - 23-07-2016

(23-07-2016, 09:09 PM)Anton Wrote: You are not allowed to view links. Register or Login to view.Really not more impossible than the assumption that someone in the early 15th century wrote in a language in which no writings have been discovered afterwards.

The thing is that the number of possible languages is basically infinite. So if a language dies out, there's no reason to think that anyone is going to subsequently reinvent the exact same language. On the other hand, the number of enciphering mechanisms is inherently limited, so they can and have been independently invented multiple times (even RSA encryption was independently invented twice, if I'm not mistaken). So proposing that the VMS was created using some encryption method that has never been subsequently rediscovered is a bit problematic, especially when you consider that cryptography was not well-developed in the early 15th century (meaning that this putative cipher mechanism must have been invented before many other types of ciphers that are considered basic by today's standards) and that we have the VMS as an example of what to do, and many cryptographers have studied it, so it should not even be as difficult as reinventing it completely from scratch.

RE: Edit-Distance computational attack idea - Anton - 23-07-2016

Quote:The thing is that the number of possible languages is basically infinite.

If we abstract from the human history it is indeed unlimited, but within the framework of the history of mankind it is limited by the number of peoples.

Quote:On the other hand, the number of enciphering mechanisms is inherently limited

I don't see why it is inherently limited. Enciphering is producing output by applying certain operators, rules or procedures to the input. Since a procedure can be a combination of procedures and, next, one can imagine infinity of combinations, therefore the number of enciphering mechanisms is inherently unlimited. It is much the same as the number of mathematical functions is unlimited.

RE: Edit-Distance computational attack idea - Sam G - 23-07-2016

(23-07-2016, 10:05 PM)Anton Wrote: You are not allowed to view links. Register or Login to view.
Quote:On the other hand, the number of enciphering mechanisms is inherently limited

I don't see why it is inherently limited. Enciphering is producing output by applying certain operators, rules or procedures to the input. Since a procedure can be a combination of procedures and, next, one can imagine infinity of combinations, therefore the number of enciphering mechanisms is inherently unlimited. It is much the same as the number of mathematical functions is unlimited.

Well it's not just one cipher but the entire class of ciphers (this proposed entropy-lowering kind) that I'm talking about, and general classes of ciphers are certainly more limited. For instance, I mentioned RSA cryptography in my previous post, and it was invented twice (the first instance having been kept classified by British intelligence):

You are not allowed to view links. Register or Login to view.

Yet somehow, the alleged cipher mechanism found in the VMS was only invented once, in the early 15th century, and nobody has ever been able to come up with anything like it since then. I'd say that's substantially less likely than that it's simply a plaintext in some otherwise unknown language, given that there have surely been countless languages that once existed but died out without leaving any known record.

RE: Edit-Distance computational attack idea - Anton - 23-07-2016

Well there were things in human history well in advance of their time (as we perceive that time) - the Antikythera stuff being perhaps the most often quoted.

To come up with the same thing one needs to work in the same direction, and with the progress of cryptography it's just possible that noone worked in the same direction no more. This only suggests that the mechanism is (probably) inherently trivial; what makes it complicated is some weird combination of simple procedures.

Anyway, my own approach is to be "Voynich-theory-independent" - that is, to utilize as far as possible methodologies invariant to whether VMS is cipher- or natural language-based and to not make a choice between these two hypotheses earlier that one really needs it. After all, a natural language may be also (although highly formally) considered as a kind of cipher.

I just stand against calling anything "proven" when it is not actually proven. In You are not allowed to view links. Register or Login to view. there are plenty of arguments against the natural language hypothesis, for example. It is far not as solid as rock.

RE: Edit-Distance computational attack idea - Psillycyber - 23-07-2016

(23-07-2016, 07:23 PM)Sam G Wrote: You are not allowed to view links. Register or Login to view.His theory didn't really explain anything. He would basically just take two lines of similar text and assert that one was copied from the other. Of course, the similarity could be for any other reason (such as similar vocabulary or grammar in a meaningful text).

Could a similar vocabulary or grammar in meaningful text really generate the same level of similarity? To some it seems plausible, to others it doesn't. Like I said, this question really needs to be attacked rigorously and computationally in order to settle it. If all known natural languages have much larger average edit distances than the VMS, then I would find it implausible to hold out hope that some other unknown natural language might suddenly fit the bill of having the much lower edit distance to match the VMS.

(23-07-2016, 07:23 PM)Sam G Wrote: You are not allowed to view links. Register or Login to view.That English or other European languages don't have this property does not prove that no natural language does. Also, well-developed theories regarding what properties natural languages may or may not have basically do not exist.

One could test as many natural languages as one's heart desired, and I suspect (don't know yet, but suspect) that the same would apply to most any plausible language candidates (although, for example, from what I have read, Hawaiian and some Polynesian dialects happen to be more repetitive, and possibly their average edit distances might be shorter...but I don't think that the rest of the evidence about the VMS matches up with being from that part of the world).

(23-07-2016, 07:23 PM)Sam G Wrote: You are not allowed to view links. Register or Login to view.If you concede that a valid constructed language could have this property, then you would also need to show that such a language could not have arisen naturally.

I actually don't know whether a constructed language could have this property either. That remains to be shown. I don't rule it out that someone could construct something that matched the edit distances of the VMS.

But I know that most natural languages don't have this property. A natural language with very low edit distances between words would be incredibly prone to mis-hearing and mis-interpretation when spoken. Words in languages tend to disambiguate themselves from each other (or rather, the speakers of those words tend to accentuate differences between words over time and explore the entire phonological space available to the phonemes of that language). Yes, there are homophones, but those are the exception. What you don't find among natural languages is any entire vocabulary that is a slight variation of one of three roots. It would be like if all English words were close-relatives to the words "sling," "apple," and "tuck," and we only found words in English like "slorng" and "agtel" and "tulk" but nothing like "question" or "language."

But that's how the VMS "language" is. Torsten Timm's paper makes a good initial argument that the VMS does not explore the entire space of word constructions that one could construct with the Voynich characters. Voynich words tend to be very similar to a couple of "roots" such as "chedy," "ol," and "aiin." How exactly similar remains to be rigorously calculated and compared with words from other languages. Perhaps the difference is not all that significant. That's why we really need to do the computational attack detailed above.

(23-07-2016, 07:23 PM)Sam G Wrote: You are not allowed to view links. Register or Login to view.That the VMS is not encrypted is basically proven by the low second-order entropy of the text, since virtually all ciphers increase entropy. The main exception, verbose ciphering, is ruled out by the lack of repeated strings (no long words and no repeated sequences of short words).

Would a verbose cipher really require long repeated strings? What if there was significant degree of freedom in how to encode the same plaintext in multiple ways? One could have significant freedom to inject meaningless superficial variation while keeping the underlying cipher intact. That's exactly what I explored recently with my "Voynich puzzle" You are not allowed to view links. Register or Login to view..

(23-07-2016, 08:08 PM)Anton Wrote: You are not allowed to view links. Register or Login to view.I think that is mainly because Torsten's hypothesis implies that the whole stuff is meaningless, while there are many tiny obstacles to the meaninglessness of the text. To name my favourite one:

otol and odaiin are the two most frequent "Voynich stars" (labeled objects in f68r1 and f68r2), and they are both mentioned in You are not allowed to view links. Register or Login to view. (supposed to serve for some introduction or summary).

Once again, You are not allowed to view links. Register or Login to view. I explore the possibility that a largely auto-copied text might not necessarily be meaningless. If there are multiple ways of encoding the same information, and if the underlying plaintext is something like with words encoding single letters as represented by a number, then that would allow an encoder to "cut corners" by using convenient repeats copied from above that happen to convey the same essential underlying ciphered letter. Yet, one could also easily avoid repeating long strings by adding the occasional meaningless null character or stroke here and there.