The Voynich Ninja

Pages: 1 2 3 4

I always enjoy reading about people's efforts to model systems that can generate text mimicking the vord structure and frequency ratios of Voynichese, and I think we stand to learn a lot from them. Sure, there's no guarantee that a system that can produce output superficially like Voynichese resembles the system actually used to produce Voynichese. But much of the time we seem to be at a loss to come up with any plausible explanation for the weird patterns we find, and in those cases the models -- if successful -- can at least help show how those patterns could maybe have come about (which seems like an improvement on having no leads to follow at all).

On the other hand, we hardly ever see comparable efforts to model LAAFU ("Line As A Functional Unit") behavior. To summarize what's at issue for anyone who might need it: Voynichese running text displays clear patterning at the line level. The first vords of lines have distinctive statistical properties, as do the last vords of lines. But so, sometimes, do the second vords of lines (see You are not allowed to view links. Register or Login to view. at Agnostic Voynich). And You are not allowed to view links. Register or Login to view. that many vord features have subtler "preferences" for earlier or later positions deeper within the mid-line.

My feeling is that most proposed explanations don't bear up particularly well to scrutiny.

1. Do line-start and line-end features correspond to parts of words split across line breaks? Likely not, since line-start and line-end words aren't shorter on average than mid-line words (I don't recall offhand who studied this, but someone did).

2. Are line-end features abbreviations employed when the writer was running out of space? Maybe -- but my sense is that, in practice, abbreviations didn't typically cluster at line-end in manuscripts of the period, so this would be a stranger explanation than it might seem at first glance.

3. Do line-start and line-end patterns reflect a linguistic phenomenon, or some other patterning of underlying content (such as poetry)? That would be hard to square with line breaks seemingly inserted as necessary to fill available space around illustrations.

4. Do line-start and line-end patterns reflect contextual scribal variations -- i.e., different ways of writing the "same" glyphs at the beginnings or ends of lines? To be sure, there was plenty of contextual scribal variation in other European writing systems of the period (though not a lot specific to line ends and line starts). But that variation was conventionalized and had emerged over many generations. Unless Voynichese had a long undocumented tradition behind it, when -- and under what pressures -- would such conventions have evolved?

I don't claim that any of those explanations is weak enough that we can completely dismiss it, but at the same time, none of them strikes me as very persuasive -- certainly not enough so that we could say, "Oh, that's probably just X, so it's most likely safe to ignore."

On the other hand, I can imagine a system that would predictably produce line effects as a natural byproduct of its use, and that also falls well within the range of hypotheses people already entertain about how Voynichese might have worked (along the lines of Rene's You are not allowed to view links. Register or Login to view.). Consider this set of specifications:

(1) Lines always break at word boundaries.

(2) Within lines, words are run together indiscriminately.

(3) Text is chunked for encoding into units consisting of one or more consonants followed by one or more vowels, with each "chunk" being encoded as a vord.

(4) It's possible to encode an isolated consonant or vowel (or isolated clusters of either), but this is done only as needed to satisfy rule (1).

(5) Vords that are similarly structured represent similarly structured "chunks," but not in a straightforward letter-by-letter way (imagine something like Naibbe encoding tables not being randomly interchangeable, but each encoding a different category of "chunk").

I've brought this idea up here before, but only as a thought exercise. Now, to try it out in practice, I've just cobbled together a little over a million characters' worth of miscellaneous transcribed medieval Latin and run a few experiments on it to see what would happen to the plaintext (prior to any further encoding) if it were "chunked" as I've described. Note: it isn't actually necessary to break the "chunked" text into lines to gather data about what characteristics different line positions would have -- presuming that line breaks are inserted arbitrarily, we just have to work out how each word would be "chunked" in each of several positions and compare the results.

Based on my sample, the top twelve most common "chunks" in the middle of the line (i.e., the units we get if we run all text together) would be:

[re] 2.53%
[te] 2.02%
[ta] 2.02%
[tu] 1.89%
[mi] 1.63%
[ne], [ra] 1.62%
[ri] 1.59%
[ti] 1.54%
[si] 1.45%
[ni] 1.43%
[se] 1.38%

At line-start (considering only the first "chunks" of individual words), the top twelve most common values would instead be:

i 8.18% † -- yes, the "i" should be in brackets, but that gets misinterpreted as an italics flag! Darn forum formatting.
[e] 8.16% †
[a] 7.04% †
[co] 3.12%
[re] 2.51% *
[o] 2.31% †
[se] 2.20% *
[no] 2.16%
[si] 2.10% *
[de] 2.07%
[u] 1.95% †
[pe] 1.53%

The asterisks mark cases that overlap the mid-line "top twelve," while daggers mark cases that could only occur line-initially. Meanwhile, at line-end (considering only the last "chunks" in individual words), the top twelve most common "chunks" would be:

[s] 18.10%
[m] 16.06%
[t] 11.90%
[r] 4.16%
[n] 3.94%
[d] 2.74%
[re] 2.45% *
[nt] 1.85%
[c] 1.79%
[ns] 1.38%
[ne] 1.26% *
[st] 1.22%

The two cases marked with asterisks overlap the most common mid-line "chunks," but the others would be exclusive to the end of the line.

The second "chunk" in the line -- analyzed so as to permit crossover to a new word, e.g., the second "chunk" in [ex urbe] would be [xu] -- also seems likely to have distinctive characteristics because it will tend disproportionately to represent the second syllable of a word. And indeed it does. For example, [re] is significantly less common as the second "chunk" in a line (1.03%) than as the first "chunk" in a line (2.51%) or in the mid-line as a whole (2.53%). Meanwhile, [mi] is somewhat more common as the second "chunk" (2.11%) than in the mid-line as a whole (1.63%).

As this illustrates, a syllabic encoding scheme along the lines I've described should predictably generate LAAFU effects considerably stronger than the ones we see in the Voynich Manuscript -- and they would affect not just first and last vords, but second vords as well (compare You are not allowed to view links. Register or Login to view. at Agnostic Voynich). I'm less sure about it producing subtler mid-line patterns, but I wouldn't rule out that it might, in practice.

If these effects seem too strong to be comparable to Voynichese, one way to weaken them would be to substitute this for rule #2:

(2) Within lines, the words that make up phrases are run together indiscriminately, but the phrases themselves are not run together.

The beginnings and ends of lines would still have heavily skewed statistical characteristics, but there would be fewer forms that could only be found there -- now limited to "chunks" that occur at beginnings and ends of individual words, but not at beginnings and ends of whole phrases.

Magnesium writes as follows about Voynichese LAAFU patterns:

(12-08-2025, 10:01 PM)magnesium Wrote: You are not allowed to view links. Register or Login to view.One of the things I want to explore is the extent to which the structure of the plaintext can create these biases within Naibbe ciphertext. For example, if the Naibbe cipher were used to encrypt a poem such as Dante's Divina Commedia, the poem's line-by-line structure would have rhyming, repeated phrases, etc. that would theoretically impose greater line-by-line positional biases in the frequencies of plaintext unigrams and bigrams relative to prose such as Pliny's Natural History. Is that sufficient to explain the full extent of the VMS's "line as a functional unit" properties? Maybe, maybe not. But maybe it becomes much easier to achieve "line as a functional unit" properties within a Naibbe-like ciphertext if the plaintext is a poem or poem-like in its structure.

There's certainly no harm in exploring that. But since one of his goals is to "(b) consistently replicate these properties [ = 'well-known VMS statistical properties' ] when encrypting a wide range of plaintexts in a well-characterized natural language," I assume he'd prefer to model a system that would reliably produce LAAFU effects when applied to any source text.

Just wondering: how difficult would it be to adapt the Naibbe approach from a unigram/bigram system to a syllabic "chunk" system? Might the frequencies of different "chunk" types result naturally in something like the frequency distributions simulated through playing cards?

(03-09-2025, 12:42 PM)pfeaster Wrote: You are not allowed to view links. Register or Login to view.(1) Lines always break at word boundaries.

Another way to have LAAFU effects is to allow line breaks anywhere in the cleartext and thus sometimes have to encode chunks of cleartext without vowels only at the start and end of lines, assuming all other vords encode full syllables including vowels.

(03-09-2025, 12:42 PM)pfeaster Wrote: You are not allowed to view links. Register or Login to view.There's certainly no harm in exploring that. But since one of his goals is to "(b) consistently replicate these properties [ = 'well-known VMS statistical properties' ] when encrypting a wide range of plaintexts in a well-characterized natural language," I assume he'd prefer to model a system that would reliably produce LAAFU effects when applied to any source text.

Just wondering: how difficult would it be to adapt the Naibbe approach from a unigram/bigram system to a syllabic "chunk" system? Might the frequencies of different "chunk" types result naturally in something like the frequency distributions simulated through playing cards?

That is absolutely my preference. My first goal was to achieve something that could replicate word-level properties reliably and then build from there. I don't think it would be difficult at all to adapt the Naibbe approach to a syllabic "chunk" system, as the lengths of the chunks as you have outlined them—1 or 2 letters—are ultimately the single most important thing within the Naibbe scheme. The VMS's word length distributions and entropy strongly suggest that to reliably map something like Latin to Voynichese, a substitution cipher would have to be verbose, such that most tokens in the VMS stand for 1 or 2 plaintext letters. Within the Naibbe cipher, those tokens are randomly formed on a letter-by-letter basis—but by no means do they have to be. A "chunk" adaptation of the Naibbe cipher would essentially equate to a deterministic, nonrandom approach to plaintext respacing.

Even within a syllabic "chunk" system, though, I'm of the opinion that the cipher would need to be homophonic, with multiple valid Voynichese word types mapping to a given plaintext chunk. Otherwise it's hard to get the VMS's observed diversity of word types and frequency-rank distribution.

I can't reply to specific points now. But here are my general feelings:

I assume that the Scribe was copying from a draft on paper, provided by the real Author, and treated running text as he was used to do in Latin or whatever other language he normally used. Namely, he generally disregarded line breaks in the draft, and inserted his own breaks where needed to produce a block of left and right justified text, except for the last line.

His "algorithm" presumably was like this: when he got to within 3-4 cm of the right text rail, he looked ahead in the text and decided how many words would still fit. Then he stretched or squeezed those words so that they would end at the right rail.

This algorithm alone would produce anomalous statistics at the start and end of lines. Namely, the last word of a line would tend to be shorter than average, while the first word would tend to be longer than average. We could test it easily by applying it to any running text file.

This algorithm also can explain differences in glyph and glyph pair statistics, since these are dominated by the few most common words -- and the most common short words will have different char stats than longer ones. (If this algorithm is applied to English text, the most common short words probably have more "i" letters and "th" digraphs than long words, or words in general.

Other possible factors that could explain the anomalous statistics at start and end of lines are:

The Author may have told the Scribe that he could abbreviate "aiiin" as "am" if he really needs to.
The Author may have told the Scribe to avoid breaks after or before certain words.
The Author may have told the Scribe that he could break certain long words in certain places, say between gallows if the word has two gallows; but he should add an y or m or whatever, before or after the break, to indicate a split word.
While the Scribe was supposed to ignore the line breaks in the draft, he may have (consciously or unconsciously) favored breaks where the draft had breaks. E.g., if, in the same situation, when the draft had "Chody ol Shedy" on the same line he would break like "Chody ol | Shedy", but if the draft happened to have "Chedy | ol Shedy" he would tend to chose the latter.

All the best, --jorge

(03-09-2025, 05:12 PM)magnesium Wrote: You are not allowed to view links. Register or Login to view.
(03-09-2025, 12:42 PM)pfeaster Wrote: You are not allowed to view links. Register or Login to view.There's certainly no harm in exploring that. But since one of his goals is to "(b) consistently replicate these properties [ = 'well-known VMS statistical properties' ] when encrypting a wide range of plaintexts in a well-characterized natural language," I assume he'd prefer to model a system that would reliably produce LAAFU effects when applied to any source text.

Just wondering: how difficult would it be to adapt the Naibbe approach from a unigram/bigram system to a syllabic "chunk" system? Might the frequencies of different "chunk" types result naturally in something like the frequency distributions simulated through playing cards?

That is absolutely my preference. My first goal was to achieve something that could replicate word-level properties reliably and then build from there. I don't think it would be difficult at all to adapt the Naibbe approach to a syllabic "chunk" system, as the lengths of the chunks as you have outlined them—1 or 2 letters—are ultimately the single most important thing within the Naibbe scheme. The VMS's word length distributions and entropy strongly suggest that to reliably map something like Latin to Voynichese, a substitution cipher would have to be verbose, such that most tokens in the VMS stand for 1 or 2 plaintext letters. Within the Naibbe cipher, those tokens are randomly formed on a letter-by-letter basis—but by no means do they have to be. A "chunk" adaptation of the Naibbe cipher would essentially equate to a deterministic, nonrandom approach to plaintext respacing.

Even within a syllabic "chunk" system, though, I'm of the opinion that the cipher would need to be homophonic, with multiple valid Voynichese word types mapping to a given plaintext chunk. Otherwise it's hard to get the VMS's observed diversity of word types and frequency-rank distribution.

To add on here, a syllabic "chunk" modification to the Naibbe cipher probably would also help strengthen VMS-like topic detection. In my work to date analyzing Naibbe ciphertexts using TF-IDF, the most diagnostic ciphertext word types of a given plaintext topic are rare word types that encrypt language-dependent or topic-dependent bigrams. For example, when I encrypt a text that is partially Dante's Divine Comedy and partially Latin, rare ways of encrypting "CH" are much more common in the Divine Comedy section (because of Italian words like "che") than in the Latin section. If respacing were not totally random as it is in the current version of the Naibbe cipher, these kinds of syllables might be produced more often and would therefore stack up more quickly.

Non-random reuse of bigram word types would also help explain some mismatches between Voynich B and Naibbe ciphertexts deeper in the frequency-rank distribution, in the words of ranks ~150-500. These words are about 1.5-2x more common within Voynich B than they are in Naibbe ciphertexts. I suspect that this difference would narrow if the Naibbe cipher had some nonrandom line-by-line reuse, based in part on the fact that there's more token autocorrelation at offsets of 1-300 within Voynich B than there is within Naibbe ciphertexts. The way to reuse line-by-line is to preferentially chunk the plaintext so that on line N, you can reuse one of the word types you used in line N-1.

(03-09-2025, 05:24 PM)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view.I can't reply to specific points now. But here are my general feelings:

I assume that the Scribe was copying from a draft on paper, provided by the real Author, and treated running text as he was used to do in Latin or whatever other language he normally used. Namely, he generally disregarded line breaks in the draft, and inserted his own breaks where needed to produce a block of left and right justified text, except for the last line.

His "algorithm" presumably was like this: when he got to within 3-4 cm of the right text rail, he looked ahead in the text and decided how many words would still fit. Then he stretched or squeezed those words so that they would end at the right rail.

This algorithm alone would produce anomalous statistics at the start and end of lines. Namely, the last word of a line would tend to be shorter than average, while the first word would tend to be longer than average. We could test it easily by applying it to any running text file.

This algorithm also can explain differences in glyph and glyph pair statistics, since these are dominated by the few most common words -- and the most common short words will have different char stats than longer ones. (If this algorithm is applied to English text, the most common short words probably have more "i" letters and "th" digraphs than long words, or words in general.

Other possible factors that could explain the anomalous statistics at start and end of lines are:

The Author may have told the Scribe that he could abbreviate "aiiin" as "am" if he really needs to.

The Author may have told the Scribe to avoid breaks after or before certain words.

The Author may have told the Scribe that he could break certain long words in certain places, say between gallows if the word has two gallows; but he should add an y or m or whatever, before or after the break, to indicate a split word.

While the Scribe was supposed to ignore the line breaks in the draft, he may have (consciously or unconsciously) favored breaks where the draft had breaks. E.g., if, in the same situation, when the draft had "Chody ol Shedy" on the same line he would break like "Chody ol | Shedy", but if the draft happened to have "Chedy | ol Shedy" he would tend to chose the latter.

All the best, --jorge

I agree that breaking up the text to yield a justified VMS text could skew line-by-line statistics. This idea is also consistent with Claire Bowern's recent results, as reported on Bluesky, that spaces between vords tend to shrink toward the right margin, as if the writer of a given bifolio was trying to cram as many tokens as possible onto a given line: You are not allowed to view links. Register or Login to view.

(03-09-2025, 05:24 PM)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view.This algorithm alone would produce anomalous statistics at the start and end of lines. Namely, the last word of a line would tend to be shorter than average, while the first word would tend to be longer than average. We could test it easily by applying it to any running text file.

This seems to check out. Filtering for only in-paragraph lines with 5 or more words, the average line-ending word is around ~1 less character than at the start.

[Image: word_length_by_idx_in_line.png]

I also have many thoughts about this, which a short forum post cannot capture, but here are some points I think are relevant.

Currier mentioned features at the starts of lines and at the ends of lines, which can both be confirmed, but I am not so sure that his step to 'line as a functional unit' is entirely justified. I think it is entirely possible that these are localised features. On the other hand, the 'rightward and downward' work of Patrick is a different kettle of fish.

The 'line start feature' affect basically all lines. They are of a similar nature, but completely different in detail, for first lines of paragraphs vs. 'all other lines'. These are roughly as follows:
for first lines, 85% of cases start with one of 'p t k f' (in descending order of frequency).
For all other lines, 85% of cases start with one of 'd s y o q' (in descending order of frequency). Note that the 's' combines the stand-alone s and the left half of Sh, which is not optimal. (I computed these stats using the RF1b file, and both numbers were in fact 84% when rounded to zero decimal places).

For the line end feature, it is not so clear to me what Currier meant. I am aware of course of the appearance of 'm' and 'g', two things that are similar in nature. This affects only some of the lines, and I am not sure if he saw more. Here, it is possible that these are embellished variants of plain letters, and they are happy to appear also in short lines and even at the end of labels.
'g' looks like a 'y' with an extra loop, and replacing the g with a y will usually create an existing word.
'm' has often been considered a variant of 'r', but applying the equivalence of 'g' -> 'y', it could also be 'l'.
Both would lead to usually valid words.
It's a bit of a 'problem' that doing these substitutions further (somewhat) reduces the uniqueness of the labels, but that is more a subjective problem than a real one.
One can also occasionally see similar short words at the ends of consutive lines, but it is not clear if Currier was also thinking of this.

Indeed, line start and line end rules apply also to short lines. Line start rules do not seem to apply to labels.

One of the problems with the Voynich MS text is its relative lack of repeating sequences of words. This is aggravated in case one assumes that the text is a verbose cipher.

The line start and line end rules (whatever is their cause) may be one of the reasons for that, as they may introduce extra features to the flow of the text.
I have only experimented very little with this, but indeed, there are a few cases of repeating sequences that cross a line break. In those cases, the line break appeared at the same point in the two strings.
(I did this ignoring word spaces, but these also mostly ended up coinciding).

It may be about time for a reassessment of LAAFU. what it entails and how it is defined.
For example, if my understanding is correct, the long 1st word and short last word of vms lines, according to Elmar Vogt can be also be found in 'Tomsawyer'.
You are not allowed to view links. Register or Login to view.

(04-09-2025, 02:06 AM)RobGea Wrote: You are not allowed to view links. Register or Login to view.It may be about time for a reassessment of LAAFU. [...] For example, if my understanding is correct, the long 1st word and short last word of vms lines, according to Elmar Vogt can be also be found in 'Tomsawyer'.
You are not allowed to view links. Register or Login to view.

Indeed, the anomalous statistics at line ends seem to be just a natural consequence of how a scribe would choose line breaks.

The effect may be more pronounced in the VMS because handwritten text is more "elastic": a scribe can stretch or shrink the glyphs themselves, not just the spaces, to make them fit. Thus he may be able to fit one more short word in the current line in cases where a modern typographer, with "hard" typefaces, would have to break before it.

As for why Elmar (thanks for the link!) did not find anomalies in the second word of the line: I recall someone else observing that in English, but not in Voynichese, there is a tendency for long and short words to alternate. This tendency would be an expected consequence of English grammar, which uses detached articles and prepositions (instead of inflections, e.g. as in Latin), and mandatory pronouns (instead of verb inflections, e.g. as in Romance). In that case, if the first word of a line is slightly longer than average, this phenomenon would tend to make the second word slightly shorter; and this may cancel any "second-word-longer" effect of the line-wrapping algorithm.

All the best, --jorge

Pages: 1 2 3 4

pfeaster

nablator

magnesium

Jorge_Stolfi

magnesium

magnesium

synapsomorphy

ReneZ

RobGea

Jorge_Stolfi