The Voynich Ninja

Full Version: Mathematical Attempt to Remove Impact of Alphabet Differences
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Pages: 1 2
I note that Jaskiewicz attempted to soften the errors brought about by use of differing alphabets in his paper linked below:

You are not allowed to view links. Register or Login to view.

Specifically, he altered this equation for letter frequency comparison:

[attachment=4718]

to this (with accompanying text included to explain):

[attachment=4717]

What are the thoughts about this kind of change?  Is it a valid way to overcome the use of varied alphabets?  Would an alteration such as this help the work that Darren is doing (see You are not allowed to view links. Register or Login to view.)?

I understand that letter frequency analysis is likely quite different from the hyper vector work in the other thread, but the underlying problem seems similar to me.   I wondered if the parallel issue might have a parallel solution (and also wondered what other more mathematically minded board members might think of this approach).

I apologize in advance if this is a naive question, but I am working on my ability to evaluate these kinds of publications more critically and appreciate any input others would be willing to share.
I think we can postulate that Jaskiewicz is barking up the wrong tree without the use of advanced calculus. He assumes that the Voynich is a natural language in an invented alphabet; this can be disproved by the two simple facts of LAAFU (or internal consistency of lines) and the strong placement affinity of glyphs within vords, neither of which appears in any known and discussed natural language. 
However, even if we continue with the thought experiment, the study only examines letter (glyph) distribution. The study does not clarify which transcription alphabet was used; nor whether benchmark glyphs were separated out in the transcription alphabet, a common error. This suggests that such matters did not present themselves to the author, meaning a fundamental (and very common) error crept into his study from the very beginning.

Jaskiewicz comes to the conclusion that
Quote:The language from the Voynich manuscript is based on Asian language - it is also possible that it was somehow influenced by European languages.
I would suggest that this wide-ranging statement suggests that he did not have much confidence in his own findings.

The approach may very well work if you want to compare natural languages that use different alphabets - for example, Spanish written in Arabic characters. I doubt very much that it will present anything we do not already suppose in the Voynich case, as the supposition that it is a transliterated natural language can logically be discarded.
No idea about the math . It would be nice if somone could explain that.

Curious that Jaskiewicz' top5 matches are :
– Moldavian
– Karakalpak
– Kabardian Circassian
– Kannada
– Thai

and that dvallis gets a closest match to the Caucasian languages.

Kabardian Circassian is a Northwest Caucasian language closely related to the Adyghe language (wikipedia)
Although he states his results would be nonsensical without the altered approach, I wonder what he would get?  That would at least tell us whether there is an impact with the more complex calculations.

Also, let’s say that there is a true effect.  Then both he and Darrin got the same general results using two different methods.  Does this increase the confidence of the results?

Or should we conclude they are both measuring coincidental matches that result from letter patterns in the Caucasian family that pair with similarly unedited input files (perhaps due to misunderstanding of  Voychinese) causing both results to be ascribed to transliteration artifacts?
(31-08-2020, 10:37 PM)RobGea Wrote: You are not allowed to view links. Register or Login to view.No idea about the math . It would be nice if somone could explain that.

Curious that Jaskiewicz' top5 matches are :
– Moldavian
– Karakalpak
– Kabardian Circassian
– Kannada
– Thai

and that dvallis gets a closest match to the Caucasian languages.

Kabardian Circassian is a Northwest Caucasian language closely related to the Adyghe language (wikipedia)

I'm not sure if all those matches are not purely formalistic. E.g. Moldavian (do they mean Romanian?) language is synthetic inflectional,  Kabardian is polysynthetic, Karakalpak is synthetic agglutinative. So what's the worth of these matches if they are not even linguistically homogenuos?

I guess for many Western readers these languages sound exotic, kinda "wow, a rare language, we must give it a chance", but for Russia they are not at all exotic, e.g. some 1.5m people speak Kabardian, and of those half a million live in Russia. So if I bring a person who speaks Kabardian, would he be able to read Voynichese, or what? What's the practical outcome?
(01-09-2020, 07:57 PM)Anton Wrote: You are not allowed to view links. Register or Login to view.So if I bring a person who speaks Kabardian, would he be able to read Voynichese, or what? What's the practical outcome?

Although this isn't really what I was hoping to accomplish with this string (which was to just get some opinions about the validity of the published method to try to be able to compare differing alphabets) I am happy to respond with my thoughts.

1) IF we conclude that these "matches" from Voynichese to the Caucasian family is more than mere pattern coincidence (despite having "hits" on languages that are quite diverse), I would look to what form one or more of these languages had at the time of the carbon dating.  In You are not allowed to view links. Register or Login to view. this conversation was started.  Certainly, one or more of these languages must have had a written tradition at the time?

2) Once a decent amount of the language at the time is available, there could then be an attempt, if someone was an advocate of the "natural language" approach to the VM, to match the sounds of that language to the glyphs in the manuscript -- of course, that brings with it the downsides of this approach, the greatest of which, in my view, is accommodating the lack of regular multiple word pattern repetition (strings of whole words, repeated) that would be expected to be seen in a corpus the size of the MS if it was a natural language - I am, of course, happy to hear how I could be misinterpreting that data, as I am not a linguist. 

OR

3) It could be hypothesized that a Caucasian language was the basis for a cipher where the strong internal word patterns are reflective of the underlying language (e.g. it is a 1:1 substitution -- which is, based on research into cipher history, the most common understanding of what a "cipher" was at the time).  This is supposed to be the big payback for doing this matching work, right?  Thus, the lack of whole word pattern repetition could be accomplished through some sort of in page transposition of the output of the 1:1 substitution.  Again, I understand that such "in page transposition" would be appropriate at the time of the carbon dating (based on later discussions of their earlier existence) -- but, unfortunately actual examples of such have either not survived or are yet to be discovered.

This approach does have the advantage of invoking the "cipher process" to accommodate other unusual features of the manuscript such as the line as a unit and paragraph as a unit unusual statistics.

Needless to say, issues remain with explaining how a manuscript with a Caucasian language basis ended up in Europe and including Central European imagery (e.g. the Zodiac-like images) - but I suppose the easiest way to explain this is the movement of one or more people that speak that language to Europe, where relatively easier access to the European imagery for inclusion at the manuscripts creation would be available.   

So those are my thoughts on possible "practical" outcomes.

But what I'd really like to know is whether the mathematical approach of Jaskiewicz has any validity.
Hi Michelle,
I hoped to find the time to make a few experiments about this, but I didn't. I am not an expert and I may have misunderstood something. Anyway, here are some thoughts:

1. As pointed out by Anton, the best fit found by Jaskiewicz is not a Caucasian language but a Romance one (Moldavian / Romanian).

2. I also agree with Anton about the fact that the huge differences between the five best matches show that the method is unstable. This is understandable, because the method is extremely simple (see next point).

3.  Jaskiewicz has addressed one of the problems that were ignored by Darrin: we don't know how to match Voynichese characters with characters in other alphabets. His solution is quite simple: sort characters by decreasing frequency, so that you compare the most frequent character in language A with the most frequent character in language B, the second most frequent with the second most frequent and so on. For instance, you would compare the frequency of EVA:o with Greek:ν, English:e, Latin:i (these can vary depending on the specific reference text). He than computes the overall difference between character histograms.

4. It is not clear how he treats diacritics (for Darrin 'à' and 'a' are the same). Here, no choice is optimal. One should experiment with various choices for all languages. In principle, one should also use different texts for each language.

5. As David pointed out, Jaskiewicz uses a single Voynich transliteration (v101). For a study like this, one should include several different transliterations.

6.  Jaskiewicz wrote that in computational linguistics "a distribution of pairs and triples of letters may be used to automatically recognize a language of an unknown document". Indeed Darrin's trigram-based analysis could be more robust than Jaskiewicz's approach. If Darrin incorporated the kind of character ordering proposed by Jaskiewicz, his method would probably be improved (something along these lines was suggested by Jonas in the other thread). Anyway, points 4 and 5 should also be addressed, at least by experimenting with different options. It is also true that, as others said, there are strong factors that suggest that the VMS is not a direct phonetic encoding / simple substitution, so these exercises are not very likely to find the underlying language (if there is any and if it isn't an otherwise undocumented artificial language, as Friedman concluded).
Thanks so much for all of your thoughts.  BTW, Marco -- I didn't expect experiments, just thoughts! Smile

In my view, this whole thing can be summarized as follows:

1) In general, these matching attempts suffer from major issues that make it difficult, if not impossible, to interpret the validity of the results (even for the people doing the experiments!). 

2) This is seen, at least for the Jaskiewicz paper internally and through a comparison of Darrin's work and Jaskiewicz work to each other, through the lack of an underlying similarity of the results (at least as how language similarities are determined with known languages).  It is difficult to imagine the match seen has true "meaning" given the range of qualities of the "matches," particularly for the Jaskiewicz study.

I guess this observation could be expanded to Darrin's results -- how traditionally "similar" are the identified Caucasian languages to each other?  I will try to find that out for the board. 

3) Some of these issues might be able to be addressed through use of multiple transliteration alphabets (as Darrin has done) or through ordering of the frequency of the individual letters of the alphabet prior to comparison (as Jaskiewicz has done).  But neither are perfect solutions and it is unlikely (but, admittedly, unknown until it is tried) that applying both solutions to both approaches could bring more concordance (no, Marco -- this isn't a hint for you to do the work, Darrin and Jaskiewicz would be the most appropriate and it's okay in my view for this to be left where it is, especially given (5) below).

4) Part of the reason for the conclusion in part 3) is that there are still multiple issues about the input files and/or methods still unsolved (such as the handling of diacritics and other language aspects not taken into account that could be expected to alter the results.  These details are generally related to the use and comparison of multiple alphabets with the transliteration alphabet(s) of the VM). Basically, the methods used thus far for comparison may be too simple to provide a meaningful language match.

5) But even if all the issues were solved, it remains most likely that no meaningful match would be found because other data (done using methods that provide interpretable results) strongly suggests the VM is not a 1:1 substitution or other phonetically based substitution.  I'm going to be gathering and figuring out what that data is - any suggested directions are much appreciated! 

Thanks for all your input and putting up with my learning process on this.
The main drawback of Jaskiewicz's paper (which I just ventured to look through) seems to me not mathematical, but methodological. He says (p. 251):

Quote:The research involving linguistic studies often compare an unknown language to the well-known language by its structure and syntax.

But then he proceeds to do something quite opposite, namely compare languages to Voynichese by frequency distribution, seeking meaningful linguistic conclusions from such kind of comparison - noting, a propos, that

Quote:sometimes completely different languages are similar in measure

(p 257)

That he uses modern texts instead of XVc texts is a flaw of lesser importance.
Thanks for the additional insight.  Like l said, l’m trying to be more critical about these kinds of publications so any points l should be aware of is much appreciated.
Pages: 1 2