The Voynich Ninja

Hey ninjas Ninja

,
I wrote a paper on the VMS, I compared it with 80+ other languages and I tried to make its characters readable in an alphabet we could understand.
Long story short, it failed.
But I'm pretty sure my paper could still interest some of you, as I did use a large amount of data and even if the translation failed, the languages / corpus I found to be the closest of VMS are of a great interest I think. It could be used to support or guide your theories.

The paper is here: You are not allowed to view links. Register or Login to view.

I hope to have feedbacks.
Also, if you find it of any interest and think it would be great to have it published on Arxiv, I need to find an endorser on CS.CL.

Have a nice day!

Thanks ElieD! I look forwards to reading the paper.

Thanks for sharing this with us.

What is the advantage of t-SNE over other dimensional reduction techniques such as PCA?

One obvious way of identifying an unknown language is phonotactics. Alphabetic scripts are to a large extent phonetic, so analysing adjacent letters, or commonly occurring groups of letters, would be a promising approach. (Perhaps your N-gram approach partially addressed this.) Also, did you try to recognize parts of speech, either through their position relative to nearby words or through their inflection patterns? Finally, it would be good to exclude languages which are impossible given the manuscript's known provenance, such as Guarani, and very improbable, such as Yoruba or Japanese.

I used a phonotactic approach on the assumption it's plaintext (or a simple substitution cipher, which for an unknown script is really the same thing), and identified an unlikely though historically plausible language candidate (Abkhaz). I think it's highly unlikely that it is Abkhaz, or any other language, and didn't (indeed, can't, given Abkhaz's obscurity) follow it up to prove it either way. My analysis and reason for considering Abkhaz are here: You are not allowed to view links. Register or Login to view..

There are a few minor mistakes in the paper. Scottish Gaelic, which I can read, is a modern language (it has its own television channel, BBC Alba) and didn't even exist in the middle ages: it's a direct descendant of Middle Irish. It, and its close relative Irish, are unusual in two important respects: their grammar, which is superficially closer to Semitic languages than other non-Celtic European languages, and their spelling system, so not surprisingly it's a linguistic outlier. Also, Tagalog is spoken in the Philippines, quite a long way from Australia.

Quote:What is the advantage of t-SNE over other dimensional reduction techniques such as PCA?

t-SNE will produce more clusters where PCA aims at being an accurate method to represent distances between all elements in low dimensions. The idea of t-SNE is to keep points that are close in the original space close in the low-dimensional space. It's non-linear so as long as t-SNE keeps close points close, it can create absurdly large/close distances between non-close clusters (it's regularized to avoid being too absurd). If my data are in N clusters, and if I don't care that the distance between these clusters is consistent, I'll use t-SNE. More here: You are not allowed to view links. Register or Login to view.

Quote:Also, did you try to recognize parts of speech, either through their position relative to nearby words or through their inflection patterns?

No, I could try but universal/european POS tagger on tokens with only character-level information would probably not give great results. Even if it worked, I'm not sure on how I would process the fact that a word is a verb, another a noun... It would be interesting but not necessary usable for translations... Usually words that only appear once are proper nouns... It could be interesting for further researches.

[font=Tahoma, Verdana, Arial, sans-serif]

Quote:Finally, it would be good to exclude languages which are impossible given the manuscript's known provenance, such as Guarani, and very improbable, such as Yoruba or Japanese.

[/font]

[font=Tahoma, Verdana, Arial, sans-serif]Actually, I "know" that voynichese is not historically related to a lot of languages I used, but what's interesting is to understand why it appears to be close to some of them like Yoruba or Guarani. That's why I used all languages I had available. I'm not saying that voynichese is in fact yoruba, I'm saying that the way yoruba's writing system was created has similarities with the way voynichese's writing system was created. Moreover, this approach allows me to comfort my metric because other languages proximities seem to be consistent. [/font]
[font=Tahoma, Verdana, Arial, sans-serif]There are a lot of paper claiming that VMS is in fact [insert your theory here] because [insert one statistic and apply it on 5 languages here], but to be more robust they need to use a lot more of languages and a lot more of statistics. [/font]

Quote:[font=Tahoma, Verdana, Arial, sans-serif][font=Tahoma, Verdana, Arial, sans-serif]There are a few minor mistakes in the paper. Scottish Gaelic, which I can read, is a modern language (it has its own television channel, BBC Alba) and didn't even exist in the middle ages: it's a direct descendant of Middle Irish. It, and its close relative Irish, are unusual in two important respects: their grammar, which is superficially closer to Semitic languages than other non-Celtic European languages, and their spelling system, so not surprisingly it's a linguistic outlier. Also, Tagalog is spoken in the Philippines, quite a long way from Australia.[/font][/font]

[font=Tahoma, Verdana, Arial, sans-serif]Thanks, I'll try to correct these errors. For Scottish Gaelic I'm not sure on how it developped since Middle-Age (probably the difference is the same between old french and french?). I'll remove this statement. The ScotGaelic corpus seems to be composed of oral narratives ( You are not allowed to view links. Register or Login to view. ) so instead I'll probably put it in the same category as "FrenchSpoken" and "NorwegianNynorskLIA". The closest corpora for Irish are Welsh and Breton, so gaelic languages are close to each others...when they're not spoken. Maybe the main common feature between corpuses close to VMS is that they are transcripts of spoken languages.[/font]

[font=Tahoma, Verdana, Arial, sans-serif][font=Tahoma, Verdana, Arial, sans-serif]

Quote:Perhaps your N-gram approach partially addressed this

[/font][/font]

[font=Tahoma, Verdana, Arial, sans-serif][font=Tahoma, Verdana, Arial, sans-serif]For the (N, M-gram) statistic, with N=10 and M=2, I list all bigrams (two adjacent letters) and compute how frequent are the 10 most frequent bigrams compared to all bigrams. With N and M in Ns = [1,2,3,5,7,10,15,50,100], Ms = [1,2,3,4][/font][/font]

elieD

davidjackson

DonaldFisk

elieD