The Voynich Ninja

Full Version: The Naibbe cipher
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Pages: 1 2 3 4 5 6 7 8
(04-08-2025, 08:57 AM)oshfdk Wrote: You are not allowed to view links. Register or Login to view.
(04-08-2025, 02:24 AM)magnesium Wrote: You are not allowed to view links. Register or Login to view.The Naibbe cipher isn't perfect, but it's a place to start. I'd love to collaborate with folks and further investigate whether and how the Naibbe cipher can be extended/modified to accommodate the VMS's line-level properties. Part of this work, I suspect, will involve screening for plaintext properties that make those line-level statistics more or less likely.

Thank you for sharing your work! The Naibbe cipher is a bit at odds with what I would consider a good candidate for Voynichese (for the labels to make sense, I would expect the verbosity not exceeding something like ~1.5-2.5 glyphs per plaintext character on average), but overall I think this is the most thought through attempt of replicating the statistics of Voynichese I've seen so far.

Thanks! I agree that through the lens of the Naibbe cipher, the labels in the VMS look weirdly short and uninformative (see Section 4.4 of the paper). One potential workaround is that at least some sets of labels are meant to be read as single interspersed messages. Consider, for example, the star chart on f68r2, whose 24 star labels can be theoretically read left-to-right as 8 rows of text:

[Image: rREQ3rj.png]

I freely admit that this is not a complete solution. 

I should also note: If memory serves, most labels are uncommon word types. Within the Naibbe cipher, the overwhelming majority of the word types outside the 100 most common word types represent plaintext bigrams, with an average verbosity of ~2.5 glyphs/letter (though with some being much more verbose), consistent with the upper bound of your suggested verbosity range. The ultimate reason why the cipher encrypts unigrams as entire words is because if this is in place, it becomes much easier to achieve Voynich B's anomalously flat frequency-rank distribution of word types (see Bowern and Lindemann (2021)). And to most easily reconcile the entropy of Voynichese with a natural-language plaintext, the 1.5-2.5 glyphs/letter verbosity cannot be an upper bound but should instead be considered a median value

I don't know whether it's been done, but if it hasn't, it would be interesting to study the word-level statistics of the labels specifically and see how much they differ from the rest of the VMS. Any which way, the labels pose challenges for the ciphertext hypothesis: Assuming for the moment that the token and type length distributions of labels are consistent with the rest of the manuscript, well more than half of labels would have to be <5 letters long given your suggested verbosity ranges, which in many cases would still imply a weirdly short label.
(04-08-2025, 10:37 AM)magnesium Wrote: You are not allowed to view links. Register or Login to view.I should also note: If memory serves, most labels are uncommon word types. Within the Naibbe cipher, the overwhelming majority of the word types outside the 100 most common word types represent plaintext bigrams, with an average verbosity of ~2.5 glyphs/letter (though with some being much more verbose), consistent with the upper bound of your suggested verbosity range. The ultimate reason why the cipher encrypts unigrams as entire words is because if this is in place, it becomes much easier to achieve Voynich B's anomalously flat frequency-rank distribution of word types (see Bowern and Lindemann (2021)).

I don't know whether it's been done, but if it hasn't, it would be interesting to study the word-level statistics of the labels specifically and see how much they differ from the rest of the VMS. Any which way, the labels pose challenges for the ciphertext hypothesis: Assuming for the moment that the token and type length distributions of labels are consistent with the rest of the manuscript, well more than half of labels would have to be <5 letters long given your suggested verbosity ranges, which in many cases would still imply a weirdly short label.

Yes, this is true. But I said "verbosity not exceeding", so I'm not suggesting a range, more like a fuzzy upper bound. I agree that to have comfortable lengths of the labels there should be a closer to 1:1 correspondence. There is of course a chance that labels are only indices (fig. A, fig. B), referenced in the texts, in which case they can be as short as needed.

(04-08-2025, 10:37 AM)magnesium Wrote: You are not allowed to view links. Register or Login to view.I agree that through the lens of the Naibbe cipher, the labels in the VMS look weirdly short and uninformative (see Section 4.4 of the paper). One potential workaround is that at least some sets of labels are meant to be read as single interspersed messages. Consider, for example, the star chart on f68r2, whose 24 star labels can be theoretically read left-to-right as 8 rows of text...

There are cases where it's not obvious how to parse the labels sequentially, for example:

[attachment=11133]

(04-08-2025, 10:37 AM)magnesium Wrote: You are not allowed to view links. Register or Login to view.I freely admit that this is not a complete solution.

However, do you consider it's actually possible that some similar scheme was used for the Voynich Manuscript? If so, what would you call the strongest hints pointing in this direction?
Hello @magnesium, I found the presentation really interesting and started reading the paper as soon as the meeting ended. Honestly, this is one of the best theories I've ever read about a possible encryption for natural language plaintext. As far as I know, nobody has ever been able to replicate so many properties (entropy, clustering, average length of words, positioning, differences in suffixes and prefixes, etc.) of the VMS together like this.

There are still some problems, such as the labels, and the fact that the predominance of rare glyphs in the top line is not at all explained by the Naibbe cipher. Could those lines/parts of the VMS text have been encrypted using a completely different type of cypher that still uses the same encoded glyphs? Who knows  Angel

Fact is, I think you're surely onto something with the bigram/unigram plaintext thing..
(04-08-2025, 10:56 AM)oshfdk Wrote: You are not allowed to view links. Register or Login to view.
(04-08-2025, 10:37 AM)magnesium Wrote: You are not allowed to view links. Register or Login to view.I should also note: If memory serves, most labels are uncommon word types. Within the Naibbe cipher, the overwhelming majority of the word types outside the 100 most common word types represent plaintext bigrams, with an average verbosity of ~2.5 glyphs/letter (though with some being much more verbose), consistent with the upper bound of your suggested verbosity range. The ultimate reason why the cipher encrypts unigrams as entire words is because if this is in place, it becomes much easier to achieve Voynich B's anomalously flat frequency-rank distribution of word types (see Bowern and Lindemann (2021)).

I don't know whether it's been done, but if it hasn't, it would be interesting to study the word-level statistics of the labels specifically and see how much they differ from the rest of the VMS. Any which way, the labels pose challenges for the ciphertext hypothesis: Assuming for the moment that the token and type length distributions of labels are consistent with the rest of the manuscript, well more than half of labels would have to be <5 letters long given your suggested verbosity ranges, which in many cases would still imply a weirdly short label.

Yes, this is true. But I said "verbosity not exceeding", so I'm not suggesting a range, more like a fuzzy upper bound. I agree that to have comfortable lengths of the labels there should be a closer to 1:1 correspondence. There is of course a chance that labels are only indices (fig. A, fig. B), referenced in the texts, in which case they can be as short as needed.

(04-08-2025, 10:37 AM)magnesium Wrote: You are not allowed to view links. Register or Login to view.I agree that through the lens of the Naibbe cipher, the labels in the VMS look weirdly short and uninformative (see Section 4.4 of the paper). One potential workaround is that at least some sets of labels are meant to be read as single interspersed messages. Consider, for example, the star chart on f68r2, whose 24 star labels can be theoretically read left-to-right as 8 rows of text...

There are cases where it's not obvious how to parse the labels sequentially, for example:



(04-08-2025, 10:37 AM)magnesium Wrote: You are not allowed to view links. Register or Login to view.I freely admit that this is not a complete solution.

However, do you consider it's actually possible that some similar scheme was used for the Voynich Manuscript? If so, what would you call the strongest hints pointing in this direction?

I don't have a strong opinion on any particular labeling scheme, though I do like your suggestion of an indexing "Fig. A" approach. My observations on the labels stem from what I find needs to work globally to get Latin and Italian n-grams to very reliably take on word-level properties of the VMS. To achieve VMS-like character entropy and conditional character entropy very reliably, a cipher benefits from being verbose in the neighborhood of ~2.5 glyphs/letter. But the second you do that, the second it becomes easy to encrypt trigrams and longer n-grams as ciphertext tokens of length ≥8, and the VMS places tight constraints on how many unique Voynichese word types get that long. As a result, within this particular construction, the plaintext has to be mostly unigrams and bigrams—including the labels.

Zooming out, this is exactly the kind of discussion I was hoping the Naibbe cipher would create. I don't assert that the Naibbe cipher precisely reflects how the VMS was created, nor do I assert that Voynichese even contains meaning. But as a reference model for VMS text generation, the Naibbe cipher implies that individual labels don't contain much plaintext at all. Given that, are there viable abbreviation schemes or alternate ways of interpreting the labels? If not, then maybe that's a point in favor of the VMS being gibberish.
(04-08-2025, 11:03 AM)Yavernoxia Wrote: You are not allowed to view links. Register or Login to view.Hello @magnesium, I found the presentation really interesting and started reading the paper as soon as the meeting ended. Honestly, this is one of the best theories I've ever read about a possible encryption for natural language plaintext. As far as I know, nobody has ever been able to replicate so many properties (entropy, clustering, average length of words, positioning, differences in suffixes and prefixes, etc.) of the VMS together like this.

There are still some problems, such as the labels, and the fact that the predominance of rare glyphs in the top line is not at all explained by the Naibbe cipher. Could those lines/parts of the VMS text have been encrypted using a completely different type of cypher that still uses the same encoded glyphs? Who knows  Angel

Fact is, I think you're surely onto something with the bigram/unigram plaintext thing..

Thanks so much! I really appreciate it. And I agree: there are still problems! The next phase of work is to do more fine-tuning and attempt to replicate more line-, paragraph-, and page-level properties. 

For example, the current version of the Naibbe cipher contains no nulls, and while the construction of the cipher is meant to replicate the frequency and pair frequency of a rare glyph like p, it doesn't endow p with any special properties. But these are conditions we could, of course, change. I didn't optimize the tables for this, but one could imagine a world in which only 1 of the 6 Naibbe tables features bigram prefixes and suffixes that contain p, and for whatever reason, the cipher's ruleset dictates that that particular table is the one that's used to encrypt the opening lines of paragraphs. As I mentioned elsewhere, one could also imagine treating the paragraph-opener gallows glyph (e.g., p) or prefix (e.g., pch) as a null, simply meant to denote the start of a paragraph.
(04-08-2025, 11:47 AM)magnesium Wrote: You are not allowed to view links. Register or Login to view.As I mentioned elsewhere, one could also imagine treating the paragraph-opener gallows glyph (e.g., p) or prefix (e.g., pch) as a null, simply meant to denote the start of a paragraph.
But shouldn't we then find the p glyph at the start of every paragraph? Why would p be a null that denotes the start of a paragraph, but is only used sometimes? What could be the rule for choosing when to use a null paragraph opener and when not to?

I suppose it could certainly be a graphical-stylistic choice but to be honest, browsing the manuscript, I would have used a null paragraph opener much more frequently than the scribe(s) did, if we accept the null paragraph opener hypothesis.
(03-08-2025, 10:24 PM)oshfdk Wrote: You are not allowed to view links. Register or Login to view.Can this approach explain line-as-a-functional-unit properties, such as the tendency of certain characters and combinations to appear near/at the beginning or end of lines?

Not as it stands, but maybe a variant on it could.  The playing-card mechanism is designed to impose a frequency ratio among choices from the different tables, and it's good at accomplishing that -- but, applied strictly as described, it would also create flat ciphertexts with none of the regional variation we know and love from our holidays on Tavie's island.

There could be arbitrary rules such as (1) when encoding the first line of a paragraph, draw from this table; (2) when encoding the first vord of a line, draw from that table; (3) when encoding the last vord of a line, draw from that other table.  But that doesn't strike me as a very satisfying solution, since it doesn't offer any real explanation for such a practice.

Michael hinted that an explanation for repeated vords (or strings of similar vords) could be that the same table was used repeatedly.  And I suppose the same table would occasionally have been used several times in a row through sheer chance in a manuscript of this length.  But often enough to produce the patterns of repetition we see, without some further contributing factor?  I'm not sure.

An alternative mechanism consistent with Michael's approach (I think) would be for tables to be chosen dynamically based on the previous output of the ciphering mechanism.  As long as such a mechanism produced the same ratios as the playing cards, it could preserve the statistical advantages of the Naibbe system while simultaneously opening up an entry point for higher-level patterning to emerge.  If the previous vord contains x, switch to the next table when encoding the next vord -- that sort of thing.  Or it could even be based on choosing the table(s) used to encode a nearby vord -- a meaning-bearing variant on Timm-Schinner self-citation?

Another possibility would be for the switch of tables itself to convey information.

When it comes to LAAFU properties, something similar could be in play.  The choice of table for the start of a line could be constrained by different factors than subsequent choices.  Certain choices could naturally become more or less probable as a line progresses (say, the cumulative effect of a repeated 45% or 55% chance).  And the last choice might likewise have special constraints.  This could be worth playing around with.

I'm not sure it's essential for the plaintext units in a system like the one Michael describes to be unigrams and bigrams, specifically -- it seems to me that they could be any values, so long as they occur primarily in units encoded by one ciphertext vord or by two ciphertext affixes.  With that in mind, there might be relevant sequences that are more or less likely to occur at the beginning or end of a line, such that LAAFU effects would be artifacts of line chunking.  I've speculated about this myself in terms of syllabic encoding (for example, NTI might be common mid-line but not line-start); but it might also be applicable to some variant of Naibbe.  That could be interesting to model.

I thought the ground rules Michael laid out at the start of his presentation were excellent, and I'd recommend keeping them in force when exploring solutions to line and paragraph patterning.  Ideally, enciphered lines should show LAAFU effects with no special effort on the encipherer's part to impose them and no special plaintext characteristics behind them.
(04-08-2025, 01:31 PM)pfeaster Wrote: You are not allowed to view links. Register or Login to view.Not as it stands, but maybe a variant on it could.  The playing-card mechanism is designed to impose a frequency ratio among choices from the different tables, and it's good at accomplishing that -- but, applied strictly as described, it would also create flat ciphertexts with none of the regional variation we know and love from our holidays on Tavie's island.

There could be arbitrary rules such as (1) when encoding the first line of a paragraph, draw from this table; (2) when encoding the first vord of a line, draw from that table; (3) when encoding the last vord of a line, draw from that other table.  But that doesn't strike me as a very satisfying solution, since it doesn't offer any real explanation for such a practice.

I think there is a problem with adding more and more rules. But first I have to say that the below is in not way an attempt to devalue the work on the Naibbe cipher, but just my perspective. 

The general methodological (?) problem I sense in the whole approach: if one sets out to replicate particular features, one would likely end up replicating these features, nothing less, nothing more. Given a simple analogy, if I get an F1 car and set myself on a mission to replicate its appearance as closely as possible, using modeling clay and scrap metal, then if I'm careful and accurate, I will end up with a very good replica, quite suitable for photo shoots, but I won't expect to learn a lot about what makes the F1 car a racetrack marvel.

It would be, for me personally, much more interesting find if the features of the Voynichese emerge due to some internal logic and simple constraints of an efficient encoding system. magnesium's cipher is a very good approximation and an excellent work at that, but at a cost of quite high verbosity and still quite complicated encoding/decoding process. Totally achievable with the tools available in the XV century, yes, but what would be the motivation to use this scheme?

So, yes, it's possible to add LAAFU rules and nulls, etc, etc and it is in the end quite possible to achieve a perfect replica of Voynichese. But as long as it is done by arbitrarily adding rules, I'm not sure one will learn much about the actual Voynichese.
(04-08-2025, 01:55 PM)oshfdk Wrote: You are not allowed to view links. Register or Login to view.
(04-08-2025, 01:31 PM)pfeaster Wrote: You are not allowed to view links. Register or Login to view.Not as it stands, but maybe a variant on it could.  The playing-card mechanism is designed to impose a frequency ratio among choices from the different tables, and it's good at accomplishing that -- but, applied strictly as described, it would also create flat ciphertexts with none of the regional variation we know and love from our holidays on Tavie's island.

There could be arbitrary rules such as (1) when encoding the first line of a paragraph, draw from this table; (2) when encoding the first vord of a line, draw from that table; (3) when encoding the last vord of a line, draw from that other table.  But that doesn't strike me as a very satisfying solution, since it doesn't offer any real explanation for such a practice.

I think there is a problem with adding more and more rules. But first I have to say that the below is in not way an attempt to devalue the work on the Naibbe cipher, but just my perspective. 

The general methodological (?) problem I sense in the whole approach: if one sets out to replicate particular features, one would likely end up replicating these features, nothing less, nothing more. Given a simple analogy, if I get an F1 car and set myself on a mission to replicate its appearance as closely as possible, using modeling clay and scrap metal, then if I'm careful and accurate, I will end up with a very good replica, quite suitable for photo shoots, but I won't expect to learn a lot about what makes the F1 car a racetrack marvel.

It would be, for me personally, much more interesting find if the features of the Voynichese emerge due to some internal logic and simple constraints of an efficient encoding system. magnesium's cipher is a very good approximation and an excellent work at that, but at a cost of quite high verbosity and still quite complicated encoding/decoding process. Totally achievable with the tools available in the XV century, yes, but what would be the motivation to use this scheme?

So, yes, it's possible to add LAAFU rules and nulls, etc, etc and it is in the end quite possible to achieve a perfect replica of Voynichese. But as long as it is done by arbitrarily adding rules, I'm not sure one will learn much about the actual Voynichese.

IMHO, the process isn't that much more complicated compared to other ciphers available in the first half of the 1400s, but I agree that adding nulls and specific rules just to prove a point (e.g. the top line 'gallows' appears because of this specific type of encoding) would not be useful.

What I find most interesting about this whole paper is the division of the plaintext into bigrams and unigrams, as well as the use of six different tables of glyphs. This makes total statistical sense and explains many of the properties of vords distribution.
Very interesting indeed!
Pages: 1 2 3 4 5 6 7 8