The Voynich Ninja

Full Version: How to decipher the MS?
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Pages: 1 2
Hello all,

I would like to know the latest ways on how to decipher the MS. I mean, what are the key points, their order, what should not be left or forgotten...

René Zandbergen says

"Some time in the past, somebody or some group of people sat down and generated the text of the Voynich MS using some method.

This may seem trivial but, in reality, it is fundamental. There may or may not be a decoding method (step 1 above), but there certainly was an encoding or rather text generation method.

Anybody who wants to present a Voynich MS solution should present the method how the text that we see in the MS was generated. The main advantages of this approach are:

this method certainly exists and was really used by someone in the fifteenth century;
this approach works both in the case that the text is meaningful, and in case it is meaningless."


Now, for example, magnesium's Naibbe cipher has been published. A cipher that matches very well with the MS entropy and statistics. So... how should we continue? Is this a start? Should we look for other ciphers?

According to the most expert people here... What should we do? Which direction should we take? Which steps?

Thank you!
If you already know the mechanics of the cipher (for example, the Naibbe cipher) and have some hypothesis about the plaintext language, then you can perform some form of statistical analysis (for example, simulated annealing) using the known bigram and trigram statistics of the plaintext language. This is not particularly complicated and setting it up won't take much time, the latest versions of ChatGPT/Claude/DeepSeek/Gemini should be able to create the code and you can run it with some test encoding scheme to ensure it works. Finding good sources for the statistics of the plaintext languages can be problematic, but in principle solvable.

The hard part is identifying the details of the encoding and the plaintext language. Since there are no strong arguments that point exclusively towards any particular plaintext language, it makes sense to create a multilingual pipeline that could cover many medieval languages at the same time.

I tried to collect the list of plausible plaintext languages for this task in the following thread: You are not allowed to view links. Register or Login to view.
(15-08-2025, 09:40 PM)quimqu Wrote: You are not allowed to view links. Register or Login to view.the latest ways on how to decipher the MS

Are you sure you believe that the manuscript can be decyphered? All attempts to do so have failed. By many professionals in the fields of cryptography and linguistics using deep computational algorithms. They were all unsuccessful. Almost every language has been put forward as a candidate for being the text in the invented manuscript alphabet. Attempts to use AI seem only to have produced laughable results. No luck so far, and probably no luck in the forseeable future.
Here are some necessary, but certainly insufficient, steps involved in solving the MS.

1) Be aware of all assumptions you are making, also any unconsicious ones. The last part is, by its very nature, not easy. 
Example: if you assume it is a cipher, you must be aware that you probably won't find the solution if it is not not a cipher. If you assume there is a plaintext in Latin, you may not find the solution if there is an Arabic plain text.
It is not forbidden to make any assumptions at all. One just has to be aware what are the consequences of them.

2) Be aware of the various statistical analyses that have been made in the past, and what they already tell us. 

3) Make sure you have another hobby that brings occasional moments of achievement. ;-)

That's really it. The advice from my web site, that was quoted in the opening post, does not lead to any specific steps to be taken, but always keep it in mind when working on it.
How to decipher the VM? Simple answer: don't. It's been done hundreds of times, nobody cares.

Design a system that mimics the properties of Voynichese, is simple and straightforward to use as a cipher or generator, not as cumbersome (and unlikely) as a code book. Make it make sense. I don't mean it ironically: it has to make sense somehow, even if it is not what we think it is (a ciphered text), because the authors/scribes composed the VM with a purpose in mind and used an unknown method to achieve it, a method that made sense to them.
(16-08-2025, 11:06 AM)dashstofsk Wrote: You are not allowed to view links. Register or Login to view.Are you sure you believe that the manuscript can be decyphered? All attempts to do so have failed. By many professionals in the fields of cryptography and linguistics using deep computational algorithms. They were all unsuccessful. Almost every language has been put forward as a candidate for being the text in the invented manuscript alphabet.

There are many codes -- even simple ones that a nerd could have used in the 1400s -- that cannot be cracked by any "deep computational algorithms".  And no cryptographic code is as hard to crack as a language you don't know.

No, it is not true that "almost every language" has been seriously tried.  Not even every "European" language.  Énd iú mâst nót ónli gués d lánguaj bât ôuso d spélin sístem - uitx fór sâm lánguajis mêi bí véri dífren fróm d stândar uân.  Here is the same Mandarin sentence in four widely used romanizations:

Pinyin with diacritics:   Zhè shì yī běn shénmì de shū
Pinyin with numeric suffix tones: Zhe4 shi4 yi1 ben3 shen2 mi4 de shu1
Wade-Giles: Chih^4 shih^4 i^1 pen^3 shen^2-mi^4 te^5 shu^1
Gwoyeh Romatzyh: Jeyh shyr i been shenmih de shu

And then there are the changes that the candidate language must have undergone in the last 600 years.

And then you must account for errors - by the modern transcribers, by eventual Retouchers, by the Scribe, and by the Author himself.  If the error rate is 5%, it is already a problem for some startistical analyses and cracking methods. What if 30% of the words were inconsistently misspelled?

All the best, --jorge
I should think if 30% of words were inconsistently misspelled this should raise entropy in comparison to a perfect example of the target language, which seems to me to be at heads with what we observe with the script?


To the original question, if you think something has promise try applying meaning to the steps and see where it goes. 
Likely it is nowhere, but "journey before destination" (way of kings ref..) sometimes what you learn along the way is most important.
(16-08-2025, 12:42 PM)ReneZ Wrote: You are not allowed to view links. Register or Login to view.Here are some necessary, but certainly insufficient, steps involved in solving the MS.

1) Be aware of all assumptions you are making, also any unconsicious ones. The last part is, by its very nature, not easy. 
Example: if you assume it is a cipher, you must be aware that you probably won't find the solution if it is not not a cipher. If you assume there is a plaintext in Latin, you may not find the solution if there is an Arabic plain text.
It is not forbidden to make any assumptions at all. One just has to be aware what are the consequences of them.

2) Be aware of the various statistical analyses that have been made in the past, and what they already tell us. 

3) Make sure you have another hobby that brings occasional moments of achievement. ;-)

That's really it. The advice from my web site, that was quoted in the opening post, does not lead to any specific steps to be taken, but always keep it in mind when working on it.

There's a lot of wisdom in René's advice. 

My general perspective here is that it would be more probable for us to see the one and only VMS with the statistical properties it has if the underlying text generation method often and reliably created VMS-like text. So as I see it, the question is, which specific kinds of text generation methods have high probabilities of generating text with VMS-like properties? Exploring these parameter spaces will yield reference models for how the text could have been generated. None will be perfect—but a model doesn't have to be perfect to be analytically useful. What are a good model's specific modes of failure when it comes to replicating the VMS, and what does that failure tell us? What are the general predictions that a good model makes regarding the nature of the VMS, and what happens when those predictions are tested?

Some non-exhaustive ideas:

1. Gibberish hypothesis

Timm and Schinner (2020)'s self-citation algorithm elegantly shows that the iterative generation of meaningless text can go a long way toward replicating the VMS's statistical properties, but it would be interesting to test the four corners of what this kind of algorithm can do. For instance, I'd love to see someone systematically sweep through different combinations of threshold parameters and initializing lines to study the broad behavior of this class of algorithm. Under what specific conditions does this general kind of algorithm deliver especially VMS-like text? 

2. Ciphertext hypothesis

The Naibbe cipher has the structure it has specifically because I was looking for probabilistically favorable ways to structure a homophonic substitution cipher that could generate VMS-like text when encrypting a European language. I came away from that process—especially my work with my random cipher-generation tool Voynichesque—convinced that the VMS's observed word grammar, character entropy, conditional character entropy, token length distribution, and word type length distribution jointly place major constraints on plausible cipher structures. For one, a substitution cipher must be verbose to have a prayer of exhibiting VMS-like behavior while encrypting most any European language.

But are there other kinds of historically plausible ciphers that can evade these constraints? For example, in 2009 Nick Pelling suggested that a clever multi-step transposition cipher could explain some features of the VMS: You are not allowed to view links. Register or Login to view. One could imagine screening many randomly generated variations of Pelling's proposed cipher for VMS-like properties as they all encrypt a conserved plaintext. Under what specific conditions does a Pelling-style ciphertext do a better or worse job of mimicking the VMS?

Similarly, Matlach, Janečková, and Dostál (2022) proposes a steganographic cipher, in which one plaintext letter is mapped to a sequence of 3 numbers (111, 123, 221, etc.), several Voynichese glyphs each stand for 1 number, and some apparent glyphs are really ligatures that stand for sequences of 2-3 numbers. When screening across a wide range of plaintext languages, what are the number sequence assignments, glyph assignments, and plaintext languages that combine to most reliably yield VMS-like text? What are the other properties of those ciphertexts, and to what extent are they consistent with the VMS?

3. Chinese language hypothesis

Jorge Stolfi and others have proposed that the unusual properties of Voynichese stem from a one-off attempt to develop a phonetic writing system for an East Asian language. Can we define a representative list of possible phoneticization schemes and then randomly iterate phoneme-glyph mappings for a variety of East Asian languages, to see under what conditions we get more VMS-like text?

The point here is, we need to treat the VMS's statistical properties as constraints and then figure out which kinds of text generation methods often evade those constraints.
(17-08-2025, 02:26 AM)magnesium Wrote: You are not allowed to view links. Register or Login to view.Jorge Stolfi and others have proposed that the unusual properties of Voynichese stem from a one-off attempt to develop a phonetic writing system for an East Asian language. Can we define a representative list of possible phoneticization schemes and then randomly iterate phoneme-glyph mappings for a variety of East Asian languages, to see under what conditions we get more VMS-like text?

It sounds easy, but...

The candidate East Asian languages -- those where basic words are single syllables -- would include Tibetan, Thai, Burmese, Vietnamese, Lao, Khmer, Hmong, Mandarin, Cantonese, and a couple dozen other languages, mostly in China.  At least half a dozen of these have documented cases of European travelers or merchants, like Marco Polo and Willem Van Ruysbroeck, spending years in their domains before 1400; and there must have been hundreds of others which we don't know of, and hundreds of travelers from Arabia, Turkey, Persia, etc.  Even if we restrict the list to languages that are likely to have been learned by an European, we still have at least half a dozen candidates.

All those languages are more different from each other than Swedish is to Spanish.  Each has its own syllable structure, with its own set of tones.  The tone of a syllable is a pattern of variation of pitch (or other sound quality) as the syllable is spoken; the same consonants and vowels said with different tones are completely different words.   Mandarin has only four tones (or five, depending on how one counts); Cantonese and Vietnamese have six, some of those languages may have 8 or more. The tone of a syllable is not a property of a particular vowel, but of the syllable as a whole; thus different phonetic renderings may place the tone marker in different places like hỏi or hoi4 or 4hoi, or use two or more symbols to indicate pitch,  e.g. 3h1oi3  to mean mid-then-low-then-mid pitch pattern.

And then there are complex rules that change the "normal" tone of a syllable depending on the adjacent syllables.  For example:  
  • In Taiwanese, within a You are not allowed to view links. Register or Login to view., all its non-You are not allowed to view links. Register or Login to view. syllables save for the last undergo tone [modification]. Among the unchecked syllables, tone 1 becomes 7, 7 becomes 3, 3 becomes 2, and 2 becomes 1. Tone 5 becomes 7 or 3, depending on dialect. Stopped syllables ending in ⟨-p⟩, ⟨-t⟩, or ⟨-k⟩ take the opposite tone (phonetically, a high tone becomes low, and a low tone becomes high) whereas syllables ending in a You are not allowed to view links. Register or Login to view. (written as ⟨-h⟩ in Pe̍h-ōe-jī) drop their final consonant to become tone 2 or 3.
A phonetic script can stabilize the pronunciation of a language for many centuries. That's the case of Italian, for instance: most native speakers can still understand Dante's 13th century poems. But several of those east Asian languages used to be written with Chinese characters, which are not phonetic and therefore allow the pronunciation to change radically, geographically and over time.  And we do know that those languages have changed a lot in the past 600 years.  There are old Chinese poetry manuals that give examples of syllables that were supposed to rhyme -- but they don't rhyme anymore.

And now suppose that you are an European merchant who has lived in Myanmar or Shanghai for a few years and you have learned the local language well enough to order food and haggle about the exchange rate of muskets vs rubies, but not much beyond that.  And, before going back home, you remember the promise you made to your physician uncle to bring him some medical and herbal books from the place.  Just copying them would be pointless since neither you nor your uncle can read the native script.  For the same reason you cannot translate the books into Latin or your vernacular.  Thus the best you can do is devise a phonetic script for the local language and pay a local to read the books aloud while you take dictation.  You only understand some of the text, and never heard all those medical terns and plant names; but you hope that, back home, you can use what you know of the language to figure out the rest. 

Then how many mistakes would you make (like encoding only 3 tones instead of 5, or conflating -ng with -n) when devising the script, and while taking dictation? (I lived in the US for 13 years, and still cannot hear the difference between the vowels in "man" and "men", unless they are spoken next to each other...)

And now suppose that the local you hired was not as literate as he pretended to be, so that whenever he got to a Chinese character he did not know, rather than say so he would make up a reading at random...

Considering all those complications, I don't see how I could implement your program. I don't know any of those languages, so I would not know how to choose a phonetic encoding that the Author might have used, nor which statistics could tell whether the guess is right...

So you see why I lost enthusiasm 20 years ago.  I still think that can do some progress with that theory or extract some useful properties from the text, even without knowing the language and the encoding.  But as for realy "cracking the code", I believe it would have to be someone ho happens to know the correct language as it was spoken 600 years ago, and can be motivated to spend a couple of years deciphering the phonetic encoding...

All the best, --jorge
Just to give me some idea of what this might look like, I passed a piece of the previous post through a script that mimics a scribe that would confuse with 50% probability the following pairs of letters: a/e, m/n, p/b, t/s, l/r and also will remove 25% of spaces.

emd now suppotethes you ele an eurobaen malchent whohet rivad in myemmal or thangheifolefaw yaels andyou heveraarnad tha locarremguagewerr enough so ordal food emd heggla epoutshaaxchamga lesa of nuskets vslupiet, put notnuch beyomd shat. amd, pefola going pack hona, you remenpeltha blonite younede toyoul bhysiciam umcla so blimg hinsona nadicel emdharpel bookt from sha bleca. just cobying them would ba poinsrast tince neithal you nor your uncle cem laad thenasive tcribs. fol sha tana leatom you cemnot tlamsresa shebookt imso rasimor youlvelnecurel.  thussha bets youcamdo it devite e phometic tcribs fol tha rocal remguegaand bay a rocel toreadshe bookteloudwhila you seka dictation.  youomry undalttamd tomeof the sext, end mavel haeldarl thotenadical talmsemd plans nenas; butyou hopa thas, pack homa, you cem uta whet you kmow of the renguaga sofigure out the rats.

It's possible to match it knowing the plaintext, but I'd say it's extremely hard to read it. Decoding even a simple substitution (that is, a custom phonetic script) on top of this doesn't look very straightforward.

Edit: I know that confusing letters and confusing sounds would produce a somewhat different effect, but I still think the above is a useful approximation.
Pages: 1 2