The Voynich Ninja - Marke Fincher's Word Pair Permutation Analysis

Came across this paper (attached below) which seems to address a concern that is becoming more and more of an issue for my ongoing testing of possible approaches.

I know that this was of interest to a group including Koen and nablator back in 2019 and Koen was going to e-mail (see this thread You are not allowed to view links. Register or Login to view.)

Koen -- did you ever hear back from him?

I know that you and nablator did some separate investigations that looked for repeated word strings and reached some conclusions, but that question is a little different than what Fincher's work supposedly shows -- as far as I could tell from this paper.

Of course, Fincher's paper suffers maybe from a lack of breadth of data -- like all of this type of work, the more languages tested the more convincing it would be.

Any other thoughts on this paper and its approach? I continue to hone my ability to critically evaluate such work . . .

Also, have to ask it:

Any thoughts on Fincher's conclusion that word transposition has to be involved to get the VM into a "natural language" form?

I didn't mail him since nablator made a shiny new code that kept us busy for a while Big Grin

I think I considered the case closed when we found out that the statement that real language must use x amount of recurring phrases did not correspond to reality. It is in fact quite common for texts to consist of mostly unique phrases.

I don't recall what was different about Fincher's paper - if there is no response from others tomorrow I will read it again.

Hi Michelle,

This article is interesting but it is nowhere near a proof of transposition. Because 1) it assumes a one-to-one mapping between vords and the words of a natural language, 2) it is possible to find texts (mostly poetry in my experience, but more tests are needed to really know) with few words that repetitively link to other nearby words, at the same distance, because the author was careful to use varied formulas for building sentences.

(27-09-2020, 11:10 PM)Koen G Wrote: You are not allowed to view links. Register or Login to view.I don't recall what was different about Fincher's paper - if there is no response from others tomorrow I will read it again.

Well, the major difference was that rather than looking for strings of matching words, Fincher looks at pairs of words that are separated by stepwise increases in word number (D). In other words, the words in-between the "pairs" do not need to match.

Then he looks at (1) number of unique pairs found as plotted against D (both absolute and as a proportion of max amount seen); (2) the number of unique pairs that exist in both initially paired order AND in reverse order (again, both absolute and as a proportion of max amount seen); (3) the average squared frequency plotted against D and (4) the average squared difference plotted against D.

He gives "and the" as an example of a pair that is quite high at D=1 in English but is quite low in reverse order (little to no "the and").

Each of the plots provides a particular pattern that is relatively repeated within each of the natural languages (he provides 7 and does state he has tested many other languages with similar results . . . I'm a little uncomfortable with that statement, given recent events).

But, in any case -- the VM does look significantly different. Bonus points for him for only testing language B with no labels.

In particular, he provides only the "proportion of max amount" graphs for both (1) and (2) and the VM does look much different in both cases. For (1, forward pair), there is a much lower proportion of maximum at D1 than any of the other languages for forward pairs and it peaks out (starts looking "randomly associated") much, much quicker than any of the other languages e.g., at basically D2! The natural languages have slow ramps toward flat lines that start to be really flat around D9-D10 or so -- the VM is missing that ramp completely.

For (2, reverse pair) the VM looks even more weird, with again, a much lower proportion of maximum at D1 and then bizarre (but admittedly small) ups and downs with a weird peak of no relationship at word pair distance 12 (LAAU effects?). Again, the ramp up to "random" shown by natural languages between D1-D6 or D7 for reverse pairs is completely missing.

Not unsurprisingly, the Average Squared Differences graph also shows how different the VM looks (as this plotting just amplifies the differences) with the VM being the lowest values shown (reminded me a bit of second degree entropy -- but not quite so drastically lower). Fincher states this measurement best shows "order preference". It is interesting that Latin and the VM are the closest in the ASD value that it "settles on" but it remains that the really huge swoop down from the early D1 to D9, 10 that Latin (and all the other natural languages) show is of course missing again for the VM.

He concludes that the VM differs in scale, scope, and character from natural languages using his measurements, and I find myself agreeing (although trying to think critically -- what is missing?). This is the basis for the conclusion that some sort of transposition must occur to get it to "natural language." He does go on to show that anachronistic polyaphabetic substitution ciphers like Vigenere would only impact scale but not the character of the results -- so are not the cause of the differences.

But, interesting -- the results, in particularly the ASD measurement, is NOT a perfectly flat line, which would be expected if whatever transposition process had completely randomized the word order -- so it seems there is some order still there. In this way this again reminds me of the second order entropy issue -- what transpositions would cause the VM to look like this?

I appreciate you looking at the paper again and am looking forward to your or any board members thoughts.

(27-09-2020, 11:32 PM)nablator Wrote: You are not allowed to view links. Register or Login to view.This article is interesting but it is nowhere near a proof of transposition. Because 1) it assumes a one-to-one mapping between vords and the words of a natural language . . .

So if it was measuring pairs of syllables instead of words, would that change the results?

My gut reaction would be that you'd still expect to see association (maybe even more than words) because of typical "structure" of prefix-midfix-suffix pieces of words in a good chunk of natural languages (at least the ones I know -- obviously there are many I do not know!). Although the reverse measurement would be really screwed up, as I would hypothesize it would be very, very rare to see the reverses.

And there is no doubt that VM shows less association, not more -- even in the forward direction.

In any case, thanks so much for the thoughts!

(28-09-2020, 12:27 AM)MichelleL11 Wrote: You are not allowed to view links. Register or Login to view.So if it was measuring pairs of syllables instead of words, would that change the results?

Certainly. I'm not interested in one-to-one mappings anymore, but if you (or anyone) want to try, I can share my Java code for the "WPPA" calculations.

(28-09-2020, 10:08 AM)nablator Wrote: You are not allowed to view links. Register or Login to view.
(28-09-2020, 12:27 AM)MichelleL11 Wrote: You are not allowed to view links. Register or Login to view.So if it was measuring pairs of syllables instead of words, would that change the results?

Certainly. I'm not interested in one-to-one mappings anymore, but if you (or anyone) want to try, I can share my Java code for the "WPPA" calculations.

Thank you for this thread, I had earlier started to do some similar analyses. It's good that I now won't have to repeat any work Smile

I'm thinking that it would also be interesting if someone could run this WPPA analysis on Timm's generated text
(by the algorithm in the Cryptologia article "A possible generating algorithm of the Voynich manuscript").

(28-09-2020, 05:29 PM)Alin_J Wrote: You are not allowed to view links. Register or Login to view.Thank you for this thread, I had earlier started to do some similar analyses. It's good that I now won't have to repeat any work

I'm thinking that it would also be interesting if someone could run this WPPA analysis on Timm's generated text
(by the algorithm in the Cryptologia article "A possible generating algorithm of the Voynich manuscript").

Hi, Alin_J:

No problem! Glad it was helpful. Unfortunately l don’t have Java skills or currently a set up for it. I have done some JavaScript & HTML but only on a hobby level. I just got Antconc to work so l can do some basic stuff but have to rely on other’s numbers to get a full fix!

I agree that running Torsten’s generated text would be interesting - could help pinpoint where the differences in WPPA results might be coming from. Come back and post what you get if you tackle it! Also would love to hear your thoughts on Fincher’s work/approach to this analysis. Thanks.

(29-09-2020, 01:22 AM)MichelleL11 Wrote: You are not allowed to view links. Register or Login to view.
(28-09-2020, 05:29 PM)Alin_J Wrote: You are not allowed to view links. Register or Login to view.Thank you for this thread, I had earlier started to do some similar analyses. It's good that I now won't have to repeat any work

I'm thinking that it would also be interesting if someone could run this WPPA analysis on Timm's generated text
(by the algorithm in the Cryptologia article "A possible generating algorithm of the Voynich manuscript").

Hi, Alin_J:

No problem! Glad it was helpful. Unfortunately l don’t have Java skills or currently a set up for it. I have done some JavaScript & HTML but only on a hobby level. I just got Antconc to work so l can do some basic stuff but have to rely on other’s numbers to get a full fix!

I agree that running Torsten’s generated text would be interesting - could help pinpoint where the differences in WPPA results might be coming from. Come back and post what you get if you tackle it! Also would love to hear your thoughts on Fincher’s work/approach to this analysis. Thanks.

I will start tackle it if I get more time over in the future.