The Voynich Ninja

Hello folks

I recently put together a little tool to aid with analysing the text - a little project I'd wanted to do for a long while, and finally leveraged some AI coding tools to speed up the process. The tool inputs the ZL transcription - along with configuration as to how you want to pre-process the glyphs - and outputs a ton of aggregations for interrogation.

Tool: You are not allowed to view links. Register or Login to view.
Code & docs: You are not allowed to view links. Register or Login to view.

Currently, it shows the following:

* Basic transition probabilities (including word/line/paragraph/page boundaries)
* Ngram analyser (1-grams, 2-grams, 3-grams)
* Word position preference of glyphs
* Page position preference of glyphs (by character number / line number)
* Page position preference of glyphs (by physical position - data from Voynichese.com)

While most of this has been done before, the more novel part of the tool is that it additionally produces these visualises for a large number of subsets of the manuscript - by language (A, B); by scribe (1,2,3,4,5); by illustration. The cool part is then you can compare (e.g., Currier A vs B; or Hand 2 vs Hand 4) charts side-by-side, and even "diff" them to see at a glance where the large differences lie. This was largely as a response to the common critique of analyses where interesting signal may be lost by aggregating over non homogeneous pages of text.

If one was keen to change the preprocessing applied - while this can't be done in the hosted web app itself, it can be done by cloning the repo & making the changes. I may be happy to take requests here. There are a ton of other things it could show also which could be added to future releases - entropy, LAAFU stuff, word boundaries, glyph equivalence. Again, not that these things haven't been done before, but I believe the ability to interact with them and quickly & visually compare between sections is the most valuable.

Do let me know if it's useful and if you find any bugs, or have any suggestions.

This was inspired by writings by Nick Pelling, Rene Zandbergen, Patrick Feaster, Sean Palmer, Emma May Smith, Marco Ponzi, and many others.

Commendable set of visualization tools! Well done.

(04-03-2026, 10:40 AM)zodiac_killer Wrote: You are not allowed to view links. Register or Login to view.I recently put together a little tool to aid with analysing the text - a little project I'd wanted to do for a long while, and finally leveraged some AI coding tools to speed up the process. The tool inputs the ZL transcription - along with configuration as to how you want to pre-process the glyphs - and outputs a ton of aggregations for interrogation....
Currently, it shows the following:

* Basic transition probabilities (including word/line/paragraph/page boundaries)
* Ngram analyser (1-grams, 2-grams, 3-grams)
* Word position preference of glyphs
* Page position preference of glyphs (by character number / line number)
* Page position preference of glyphs (by physical position - data from Voynichese.com)

Great tool! But, again, character and n-gram statistics are more confusing than illuminating, Can you extend that tool to show the distribution of a specific word (or word pattern, word prefix, word suffix) on a page? Or in a set of, say 5-10 selected pages, side by side?

All the best, --stolfi

(06-03-2026, 02:40 PM)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view.Great tool! But, again, character and n-gram statistics are more confusing than illuminating

I'm not sure that's true. Can you point to a scenario in which such statistics would inherently yield a misleading result, as opposed to one that's just challenging to explain, or that runs contrary to expectations, or that isn't a good fit for a particular hypothesis?

(07-03-2026, 12:04 AM)pfeaster Wrote: You are not allowed to view links. Register or Login to view.
(06-03-2026, 02:40 PM)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view.character and n-gram statistics are more confusing than illuminating
Can you point to a scenario in which such statistics would inherently yield a misleading result, as opposed to one that's just challenging to explain, or that runs contrary to expectations, or that isn't a good fit for a particular hypothesis?

Suppose you have a herbal written all in English by the same person, but it is divided in two sections, one covering grasses and one covering bushes. Digraph statistics may detect that the 'rb' digram is rather common in the first half but nearly absent in the second half, while the opposite is true for the digrams 'bu', 'ba', and 'rk'. This may lead Martian cryptologists (who can't read English) to conclude that the two sections are in different languages or are written by two different people. But a word analysis would show that those differences are due entirely to the word 'herb' being heavily used but only in the first section, while the words 'bush' and 'bark' are heavily used but only in the second section.

Or suppose that the Martians analyze digraph statistics of an "encrypted" book as a function of position line, and notice that the statistics of 'th' are lower than average within the first 10 letters and higher than average in the last 10 letters of each line. They may conclude that the "encryption algorithm" is restarted at the beginning of each line, and that the end of each line is padded with random "filler" words. But in fact the anomaly is due to the fact that the trivial line breaking algorithm causes the first word of each line to be longer than average, and the last 2-3 words to be shorter -- and the frequency of 'th' in English texts is dominated by its occurrence in many common short words.

Trying to understand an unreadable text by looking at character and digraph statistics is like trying to understand the ecology of different regions by counting animals with tails and without tails. The counts will lump aardvarks with alligators, bears with beetles, caterpillars with chimpanzees, ... The counts will probably show that the Sahara is somewhat different than the Amazon --- but not much beyond that. One may stare at those numbers for a whole life and never understand why they are different.

Sure, character and digraph statistics are the first thing one should do when presented with an "encrypted" text. They can give useful clues, e. g. whether it is a simple substitution cipher or something more complicated, like a Vigenère or codebook cipher; and, in the first case, they will suggest some letter assignments, for each possible language. But they cannot help much if the text turns out to be in an unknown language like Etruscan or proto-Elamite or the Rohonc script. To make progress in those cases one must look at words, not just characters...

All the best, --stolfi

(07-03-2026, 10:26 AM)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view.
But they cannot help much if the text turns out to be in an unknown language like Etruscan or proto-Elamite or the Rohonc script. To make progress in those cases one must look at words, not just characters...

Thank you! Or, a forgotten dialect of the Slavic language family.

(06-03-2026, 02:40 PM)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view.
(04-03-2026, 10:40 AM)zodiac_killer Wrote: You are not allowed to view links. Register or Login to view.I recently put together a little tool to aid with analysing the text - a little project I'd wanted to do for a long while, and finally leveraged some AI coding tools to speed up the process. The tool inputs the ZL transcription - along with configuration as to how you want to pre-process the glyphs - and outputs a ton of aggregations for interrogation....
Currently, it shows the following:

* Basic transition probabilities (including word/line/paragraph/page boundaries)
* Ngram analyser (1-grams, 2-grams, 3-grams)
* Word position preference of glyphs
* Page position preference of glyphs (by character number / line number)
* Page position preference of glyphs (by physical position - data from Voynichese.com)

Great tool! But, again, character and n-gram statistics are more confusing than illuminating, Can you extend that tool to show the distribution of a specific word (or word pattern, word prefix, word suffix) on a page? Or in a set of, say 5-10 selected pages, side by side?

All the best, --stolfi

Implemented it here You are not allowed to view links. Register or Login to view.

Let me know if you need anything else... 30 second job.

zodiac_killer

asteckley

Jorge_Stolfi

pfeaster

Jorge_Stolfi

pjburkshire

DG97EEB