Last year, M. Greshko's Naibbe cipher marked a turning point in VMS research. Thanks to his work, and of Zattera's, we are now in a position to make testable criteria about key features a script simulator (or a script generator) like Naibbe must meet to replicate the behavior of the Voynich structure.
This new paper explores and proposes 4 such criteria: You are not allowed to view links.
Register or
Login to view.
Best regards
This is some complicated stuff

Would you like to make some summary of your results in simple words?
Quote:We make our code and data available for independent verification.
That would be helpful.
I couldn't find a different repository on kaggle than the one for the directionality calculation: You are not allowed to view links.
Register or
Login to view.
(23-04-2026, 01:12 PM)nablator Wrote: You are not allowed to view links. Register or Login to view.Quote:We make our code and data available for independent verification.
That would be helpful.
I couldn't find a different repository on kaggle than the one for the directionality calculation: You are not allowed to view links. Register or Login to view.
You're right, the file is missing, adding it now to the Kaggle repo also in attachment below, thanks for noticing!
(23-04-2026, 11:30 AM)Rafal Wrote: You are not allowed to view links. Register or Login to view.This is some complicated stuff
Would you like to make some summary of your results in simple words?
Concretely, we propose 4 criteria that any Voynich script "generator" (like Naibbe) should meet to really look like Voynich. These are necessary criteria, not sufficient ones.
(23-04-2026, 02:30 PM)Labyrinthinesecurity Wrote: You are not allowed to view links. Register or Login to view.Concretely, we propose 4 criteria that any Voynich script "generator" (like Naibbe) should meet to really look like Voynich. These are necessary criteria, not sufficient ones.
I'm not sure I understand which 4 criteria you mean, these below?
Quote:Boundary concentration. The 80.6% end→start transition rate exceeds the four comparison languages by a factor of 2.3–4.1×. This means VMS word boundaries enforce a rigid alternation between positional grapheme classes that is quantitatively different from these four languages. The gap could narrow for morphologically richer languages not yet tested.
Bilateral positional extremity.
Extreme positional ratios (>100:1) span both start-preferring and end-preferring grapheme classes, unlike the four comparison languages where such ratios clus- ter in one direction with specific orthographic explanations. This bilateral pattern is consistent with a system where word-initial and word-final positions draw from largely non-overlapping grapheme pools, but we cannot exclude the possibility that some untested natural languages exhibit similar properties.
Zipfian boundary distributions. VMS word-boundary grapheme distributions follow a Zip- fian curve rather than the plateau shape observed in the four comparison languages (Parisel, 2025). This is a fundamental property of VMS word structure.
High cross-boundary MI with high structural residual. The VMS has the highest total cross-boundary MI (0.230 bits) and the highest MI retention after word-order shuffling (21%) among the five corpora tested. The high retention indicates that a substantial fraction of cross- boundary predictability is structural (order-independent).
(23-04-2026, 02:55 PM)Labyrinthinesecurity Wrote: You are not allowed to view links. Register or Login to view.Yes, these are the ones
Could you help me understand what they are, in simpler terms? I know that this all is in the paper, but to me (and probably I'm not alone in this) it's a bit too technical. I could load it to Gemini Notebook, but maybe it's better to get the information from the source.
Do I understand it correctly, that all 4 criteria deal with the transitions between the last character of a word and the first character of the next word?
(23-04-2026, 02:30 PM)Labyrinthinesecurity Wrote: You are not allowed to view links. Register or Login to view.Concretely, we propose 4 criteria that any Voynich script "generator" (like Naibbe) should meet to really look like Voynich. These are necessary criteria, not sufficient ones.
But your criteria are based on features that distinguish Voynichese from four (4) languages - two Western Indo-European ones and two Semitic ones.
Natural languages are MUCH more varied than that. There are the East Asian monosyllabic languages, agglutinative languages, languages with vowel harmony like Turkish and Hungarian, languages with definite articles written as postfixes, languages with and without noun inflections for gender and number and noun-adjective agreements, languages with postpositions instead of prepositions... Then there is sandhi, which may be realized in writing (like "a" changing to "an" in English).
But a bigger problem is that character statistics depend on the
spelling system much more than on the language. For instance, tones in romanized Chinese may be encoded as diacritics on a vowel, or as a numeric suffix 1-4. The second choice would radically change the statistics of suffixes...
And, finally, statistics are a property of a
text, not of a language. There is no such thing as "the frequency of 'e' in English" or 'the most common Engish word'. Someone wrote a whole novel in English without using 'e' even once -- and readers don't notice unless they are told. In a materia medica the most common word may well be "take" or "cures", and the word "the" may hardly be used...
All the best, --stolfi
(23-04-2026, 03:29 PM)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view.And, finally, statistics are a property of a text, not of a language. There is no such thing as "the frequency of 'e' in English" or 'the most common Engish word'. Someone wrote a whole novel in English without using 'e' even once -- and readers don't notice unless they are told. In a materia medica the most common word may well be "take" or "cures", and the word "the" may hardly be used...
All the best, --stolfi
Of course there are statistics for a language, and those statistics can be useful for comparing languages. "the frequency of 'e' in english" is effectively saying "the frequency of 'e' in english texts on average". Just because a certain text can be deliberately changed from the norm, does not mean the norm doesn't exist. There is still a distribution of frequencies from all existing texts, and those meaningfully differ from language to language.
Put another way, the frequency of the word written "t h e" is significantly higher in an average english text compared with that of an average french text. This is a clear statistical feature of those languages, even if there are outliers in both.