This post became very long when I created it back in march of last year. I never dared to post it, assuming that it was probably obviously wrong in some way, but i'm posting it now in case something here sparks any ideas in others. I am completely open to this being nothing or irrelevant. I've tried to format this post as nicely as possible (given its length), with some sections spoilered to make it easier to read, but if you don't want to read the whole post i've added a Tl;dr at the end.
Hello, while researching through the history of the VM, i came across "Polygraphie" by Johannes Trithemius. After auto translating page after page, I found an interesting technique that he describes as "enn'agrammaton". This consists of splitting the alphabet into a 9 square grid, and allocating letters to each square. A symbol is then encoded as that square, with 1, 2, or 3 dots under it to symbolise which letter it was.
Clearly, if this was the case with the VM, each symbol would denote a character, and therefore be a simple substitution cipher. As we all know by this point, a simple substitution cipher can not accurately describe the VM text in any known language. I had a thought however:
what if the dots were missing?
If you were to write using this simple cipher system, but without the dots (or other clear marker), leaving just the square symbol itself, what characteristics would the encoded text have?
To illustrate this, here is a sentence in english that i encoded in this way:
"This is undoubtedly due to the fact that English uses more combinations of two or more letters to represent single phonemes than Latin does"
"7336 36 75257172248 272 75 732 2117 7317 2534363 7626 4562 154135173556 52 775 56 4562 4277266 75 625626257 635342 53552426 7315 41735 2526" (abc=1,def=2 etc)
With dots:
As you can see, with the dots under the symbols and knowing the system/key, the text is very easy to decipher. However, without a key (dots) this becomes much harder, even with the known system. This is mainly because instead of a 1-1 substitution, it is now a 3-1 substitution (1 cipher symbol represents 3 plaintext letters). Thus, when deciphering, a 1-3 substitution must occur somehow. This encoded text has some characteristics which may be relevant to the VM:
- It looks and has the feel of natural language (because it is encoded plaintext), with spaces and word lengths being conserved
- Converting plaintext to these symbols is very easy and can be done with only a few minutes practice, even if you do not understand what the plaintext means
- Symbols can repeat in sequence after eachother, such as in the word "represent" ("625626257")
- Some common words/letter groups are represented using the same symbols (to/up , lo/mo , to/un , th/ti , ile/ime/ike , ne/nd/oe/of/pe)
- Entropy has changed dramatically from the plaintext (i'll discuss this later as it's important)
- Normal frequency analysis fails to detect an obvious plaintext
Note that a simple substitution cipher would likely provide nonsense here, with something like "renrerent" or "sepsesept". If you do what many solvers do, and pick out likely common words and apply a mono-alphabetic substitution based on them, the rest of the sentence becomes nonsense.
Using the substitution for common english words:
THE, THAT, TO (732, 7317, 75)
(7=T, 3=H, 1=A, 2=E, 5=0)
and then the most common frequency english letters for the others:
(4=L, 6=S, 8=Y) you get:
"THHS HS TOEOTATEELY ETE TO THE EAAT THAT EOHLHSH TSES LOSE AOLAHOATHOOS OE TTO OS LOSE LETTESS TO SEOSESEOT SHOHLE OHOOELES THAO LATHO EOES"
Having some fun, i'll now interpret this in the same way many do.
"THHS HS" -> Early middle english shorthand for "The HouseHouse" -> modern english "The (noble family)houses"
"TOEOTATEELY" -> A dialectal mispelling of the 14th century latin "totalium" -> modern english "Totally/completely"
"ETE" -> french "été" meaning "was" -> modern english "were"
"TO THE EAAT" -> "to the eating", a common phrase meaning "ready to serve (food) to"
"THAT" -> "that"
"EOHLHSH" -> The name of a lord, possibly "Éowyn"
"TSES" -> "towers" -> with a missing a' likely meaning "during the" -> "during the two towers"
"LOSE" -> "lose/losing"
"AOLAHOATHO" -> An allemanic-hebrew hybrid borrowed word from latin "altercationis" ->"altercation" (battle)
The noble family houses were completely ready to serve that food to Éowyn during the (battle of the) two towers, losing ....
Clearly, with this many degrees of freedom, you can make literally any string of letters into any word in any language you choose.
Decoding back into plaintext reliably
My first intuition was that due to 3 degrees of freedom per letter per word you would be presented with an over-abundance of word choices, leading to the same issues above and with too much room for interpretation. The amount of permutations scales tremendously, at 3^n where "n" is the word length. This quickly creates thousands, sometimes hundreds of thousands of possible variations.
Method 1
Use a program to output every single possible permutation, 1 word at a time and sift out the possible solutions
Sifting through these would take forever -and nobody would be willing/capable of doing this in the 15th century- so solving it this way is not reasonable. I did try to do this manually in excel, which was obviously silly. You could write some code to do this relatively easy but it seems inefficient to do so if other methods exist.
Method 2
Manually write down the first two letters, do the permutations for those, remove incorrect permutations, continue to next letter
This is plausible, but it is very costly time wise to do this for a whole manuscript. For most words however, say 5 letters or fewer, it's not too difficult. A big assumption here is that the word length is correct, and that spaces in the plaintext are correct. Therefore, for a 4 letter word, if you are left with "TIGR" this is not a valid answer, even though "TIGRE" and "TIGRESSES" are words with that prefix.
For a word such as "this":
This took around 5 minutes, and would have taken far less if i had prioritised the green variations first. Still, what are the chances that someone would go through such an effort to read their text?
Method 3
Use an online dictionary with filters for word length and excluded characters
This isn't a tool they would have had in the 15th century, but it works really well for us now and reliably returns results with relatively little interpretation required. Using wordfinderx.com, I entered the length of the word, entered all of the excluded letters represented by symbols not in that word, and then started with words beginning with the first letter. This also assumes correct word length, spacing and spelling.
Here is an example process for "undoubtedly":
Exclude GHIQRSJ, length 11:
beginning with T ---> No 11 letter words beginning with TN,TO,TP.
beginning with U ---> No 11 letter words beginning with UO,UP, many with UN.
Beginning with W ---> No 11 letter words beginning with WN,WO,WP.
Beginning with X ---> No 11 letter words.
Continue with UN:
4 words beginning with UNE --> no words match UNE(N/O/P)
3 words beginning with UNF --> 1 matches UNF(N/O/P)(unfoundedly) ---> Manual check shows the word fails at letter 6, UNFOU(A/B/C)
10 words beginning with UND --> 3 words match UND(N/O/P)(undoubtably, undoubtedly, undoubtable)
Undoubtably/Undoubtable fail at letter 8: UNDOUBT(D/E/F)
Undoubtedly matches completely!
After doing this for the whole sentence, this is what the results were:
Method 4
Turn every word in the dictionary into a number set and crosscheck against that
Method 4 is effectively a more efficient and easy version of method 3, generating the same results.
For this system, ABC=1 , DEF=2, GHI=3, KLM=4, NOP=5, QRS=6, TU(VW)X=7, YZ=8
I did this with the original sentence above, but for example:
Voynich = 7585313
Manuscript = 4157616357
More = 4562
Lose= 4562
Ordered numerically(I'm not certain what to call this? quasi-numerically?) this would be easy to find. It would effectively be as difficult as finding a word in the dictionary. To be clear, in this system
43780,
448492,
448578235648,
449,
45,
is a correctly ordered sequence, while:
45,
449,
43780,
448492,
448578235648,
is not correctly ordered. This effectively orders them the same way that decimals after a decimal point are ordered.
This would require some code to do meaningfully, which is something I am planning to do but have not had time to do this yet.
If you don't know which language the ciphertext is in, can it be deciphered?
A way to do this may be as follows. First, create a number-dictionary for all likely languages. Enter the ciphertext and have the program determine whether or not each word had a possible variation. Each language can then be listed from "most likely" to "least likely".
For example, using my example result from earlier, the program gave a variation for 100% of words (no words had no options). Skimming the text with some proficiency of dutch, its also obvious that words such as "is/letters/of" would also have been considered a hit in dutch. In fact, I will manually do the process now in dutch to check:
"
???? is ??????????? duf to wie dabt ???? ??????? tres korf ???????????? of vwo ns korf letters to ????????? single ???????? ???? latin does"
Doing this gets a 66.66% match rate. Frankly, many of these words are not really dutch words, but are either english words (single instead of singel), or acronyms like VWO (Voorbereidend Wetenschappelijke Onderwijs) or NS (Nederlandse Spoorwegen). The dictionary i used was from woordvinder.com, and it is generous to say the least. Either way, it is clear and obvious that even with a generous word pool to grab from, there is no grammatically correct or natural sounding dutch sentence to be found here.
This is great!
I'm sure there would be multiple hits in many languages. I would be surprised however, if there were grammatically correct and natural sounding sentences in multiple languages. Please feel free to do the same process in another language to see if it spits out anything correct!
What about entropy?
This is something i could certainly use advice and input on. It seems to me that encoding in this way should have a significant impact on the entropy. Asking "if the first letter is R, what is the second letter likely to be?" clearly has many possibilities. In any case, the upper bound of any theoretical answer for any letter would be the amount of letters in the alphabet.
When asking the same for one of the symbols on this grid, the upper bound of this answer is 8. There are only 8 symbols in use, so there are only 8 possible answers. Therefore, the chance of guessing correctly is far higher. In reality, the chance is even greater because of the english language.
Lets ask the same question for R(q,r,s):
Rq,Rr,Rs / Rk,Rl,Rm are both unlikely letter combinations. This leaves a likely spread of 6 possible symbols. I do recognise that this does not hit the same level of predictability seen in the VM, but its definitely different to plaintext.
But, what if a grid was used that wasn't set up as the in-order latin alphabet? If a grid grouped together all vowels (just an example) into a single glyph, the chance that that "vowel glyph" comes after a consonant is higher than a second consonant. For a symbol representing (b,c,d), there would be an relatively high chance that the next letter is a vowel glyph (a,e,i,o,u). And then for the 3rd glyph, a decently high chance of either a vowel or consonant. After 2 consecutive vowels? Almost no chance of a 3rd vowel. After 2 consecutive consonants? Almost no chance of a 3rd consonant.
This isn't even to mention systems that may have a glyph per 2 letters, per 4 letters, or a mix. The entropy of the same plaintext would vary per system used.
Biggest Issues with this idea
There are more than 8 voynichese characters
This certainly appears to be the case. In a system such as this however, multiple characters would be variants of the same square glyph. Tentatively looking at the 3rd ring on f57v, there is a possibility that multiple characters are actually variants of one another (k and m for example). I'm sure this has been discussed and brought up many times. This would mean that although there are more than 8 or 9 characters, there may be 8 or 9 groups of characters, with each group representing a single square glyph character.
If there are multiple characters per square glyph -and there is no 1-1 substitution happening- why do the characters vary? How would the author know which one to write?
The assumption would have to be either that:
1) The writer had to follow a set of predetermined rules
2) The writer chose one of the symbols in the group based off personal preference
If 1) is the case, how many rules are needed to produce voynichese like text, and how easy is it to do?
I tried this a few times by taking voynich text, transcribing it into square glyphs with the grid system (with groupings I chose), and then writing back into voynich from the grid with a few basic rules. With simple rules that fit voynichese, I had relatively good success at accurately expressing the words correctly.
Here is an example ruleset and grid that I used. To be very clear,
I am NOT saying this is a solution. This is simply a test to see if this type of voynichese -> square glyph -> voynichese can work without complicated rules.
Here is the example ruleset/guidelines:
And the process from the 2nd line of f58v
Does this work 100%? No. Does it work a lot better than random chance? Yes. Sometimes it works really well, and sometimes less well. Mind you, these were not an extensively thought out and analysed set of rules/guidelines, but rather my attempt based on some basic patterns I saw in the text.
There are essentially an infinite amount of ways this could be constructed; how do we know which one to use?
We don't. Maybe someone smarter than me has a way to construct this type of system that fits the VM, but I can't think of one beyond long winded computer programs, luck, or some kind of narrowing process. We would need to agree on a transliteration (or test every permutation of every possible grouping of every variation of every transliteration...
YIKES) while taking into account that the text may be a mix of languages or use incorrect spelling (also
YIKES). Thats a lot of effort if the VM doesn't use this type of system.
It may be a simple ruleset and character set, but without a key or something to tell us which sets are correct, it's simply one of millions of possibilities. The only bright side to this is that it may explain why noone has cracked it yet. I suppose another bright side
may be that only a few systems will provide meaningful text, if the dutch example above holds applies across more languages.
Conclusion/Reasons to continue research in this area
As mentioned earlier, text presented in this way has some very relevant and promising parallels to Voynichese.
- It looks and feels like natural language, with spaces and word lengths being conserved
- Writing plaintext into Voynichese would have been quite easy
- The system could have been known and used in the presumed time period of writing
- Aspects such as repeated letter clusters can be reasonably explained, unlike normal substitution
- Many common letter groups can be represented using the same glyphs, explaining why many words/word endings appear the same
- There are
few degrees of freedom in the interpretation of the text into plaintext for a given system (there is either a coherent sentence, or not)
- Entropy is lowered due to this system, dependent on the exact system used
- The potential key/system making decoding reasonable would be easily demonstrable using a couple of pages (which may have been removed at some point)
The main issues being:
- There are an infinite number of systems that could have been used and we don't know which one
- Different hands may have had different keys/systems, or may have been writing from a different plaintext language
- For this system to be deciphered into plaintext, the reader of the text would likely need either a dictionary, another key, or many years of free time
Tl;dr There is a potential system of encoding text which would have been possible, would be difficult to crack, and shares some properties of Voynichese text. I don't exactly know why such a system would be used, or if such a system was intended (even if it was used). This was tested a little bit by encoding an english sentence and attempting to decode. Decoding to english was reliable, but decoding to dutch yielded no relevant results or coherent sentences. Entropy and other aspects were discussed but could very much use the input from others.