The Voynich Ninja

Full Version: Scoring artefact for 45% entry collapse in i/e/a
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Hey peoples, fairly new here been a viewer for a while, starting to test things to help me understand the problem better.

I'm treaing the manuscript as a homophonic cipher over italian and trying to optimise against a codebook. The basic idea, take a voynich word, assign it an italian word, then check if the decoded line produces trigrams, three-word sequences that actually appear in real italian text. If swapping in a different italian word improves the count, keep it, then repeat thousands of times. I don't know Italian so it's a lot of "copy, paste, check, translate, repeat". I'm basically checking whether the decoded text produces word sequences that look like real Italian. No secrets here... i'm comparing against this specific book - You are not allowed to view links. Register or Login to view. for Italian.

So it felt like it was working, scores were climbing, but I'm a bit stuck, basically 45% of my entries are collapsing into i,e, and a. The seem to help form valid trigrams with anything so the optimiser keeps picking them. Trigram score looked great, but decoded txt reads like trash just "i e a i " e.t.c with occasional real words.

Anyone else run into this? Is this just a fault of my method, or is there potentially a better dataset based on 1400/1500 italian, or anyone has a better idea how to pull out of this rut, as I think i've hit a wall. Is there by any chance a source of n grams based on this that exists.. I can't seem to find one or understand how to find one perhaps, dont' mind if it'is a paid resource I need to purchase.

it does make me wonder, if we had access to a super computer could we just bascially brute force our way to success here, if we coudl assume other things such as short hand and exclude those entries for example?
Hi chrisj,

It doesn't work well because Voynichese and Italian are very different. You need other ciphering techniques (not the ones you think could work!) to have any chance of closing the gap between Voynichese and any European language.

Homophonic ciphers increase character conditional entropy (because of the unpredictability of the choice of alternate symbols that could be used for a given letter) and need more symbols than the alphabet. The same is usually true for abbreviations. This is not what Voynichese text looks like: it has a low character conditional entropy (the next character is too predictable) and a short list of common "characters" (symbols/glyphs), so you can try grouping some of them, like common word endings, to create a verbose substitution cipher. This could work in theory, maybe with a one-to-many mapping between plaintext letters and chunks of Voynichese text: the search space is huge but the analysis of VMS "languages" and "dialects" may help.

I would suggest reading some good papers like these by René Zandbergen to better understand these issues (the third one attempts to match the conditional entropy of Italian with a verbose substitution cipher):
You are not allowed to view links. Register or Login to view.
You are not allowed to view links. Register or Login to view.
You are not allowed to view links. Register or Login to view.
Thanks @nablator, after reading through, I realised that the i/e/a consumption at 45% was insane adn wrong (consdiering ti seem most langauge have these filler words (the,of,a equivelant) at around 6.5%. After capping this to 6.5% and then going again, we're working our way back up of "guessed words" (requiring enourmous amounts of CPU usage) - using modal.com to assist in this / like google batch. Still basically brute forcing combinations of possiblities of words.  I have tested this same method across most of the other potential languages now also and this method doesn't score nearly as well, using Google's Ngram books for the the reference corpus.

Will keep posted on this approach. This implies it was a word for word replacement, and since those voynich words aren't from anything (at least known) this method currently implies that it's a made up, one time pad. This may potentialy also align with the theory of why there are minimal mistakes, the text was written "in italian" and then converted into voynichese, making up words along the way and copieid into this manuscript.

It's still busily attempting as many possiblitiles as it can currently, very curious if anyone else has gone as deeply down this route.
(15-03-2026, 09:42 AM)chrisj Wrote: You are not allowed to view links. Register or Login to view.very curious if anyone else has gone as deeply down this route.

I've been running a similar multilingual pipeline over a bit different set of ciphers, nothing of interest so far. On the other hand, the space of ciphers to explore is huge, so it's a very long term project. To make it practical I run a pre filter that discards all enciphering schemes that would produce statistically unrealistic texts for the list of languages I consider, simple homophonic ciphers just won't pass it.
Oh yes, this. ^^