03-09-2025, 12:42 PM
I always enjoy reading about people's efforts to model systems that can generate text mimicking the vord structure and frequency ratios of Voynichese, and I think we stand to learn a lot from them. Sure, there's no guarantee that a system that can produce output superficially like Voynichese resembles the system actually used to produce Voynichese. But much of the time we seem to be at a loss to come up with any plausible explanation for the weird patterns we find, and in those cases the models -- if successful -- can at least help show how those patterns could maybe have come about (which seems like an improvement on having no leads to follow at all).
On the other hand, we hardly ever see comparable efforts to model LAAFU ("Line As A Functional Unit") behavior. To summarize what's at issue for anyone who might need it: Voynichese running text displays clear patterning at the line level. The first vords of lines have distinctive statistical properties, as do the last vords of lines. But so, sometimes, do the second vords of lines (see You are not allowed to view links. Register or Login to view. at Agnostic Voynich). And You are not allowed to view links. Register or Login to view. that many vord features have subtler "preferences" for earlier or later positions deeper within the mid-line.
My feeling is that most proposed explanations don't bear up particularly well to scrutiny.
1. Do line-start and line-end features correspond to parts of words split across line breaks? Likely not, since line-start and line-end words aren't shorter on average than mid-line words (I don't recall offhand who studied this, but someone did).
2. Are line-end features abbreviations employed when the writer was running out of space? Maybe -- but my sense is that, in practice, abbreviations didn't typically cluster at line-end in manuscripts of the period, so this would be a stranger explanation than it might seem at first glance.
3. Do line-start and line-end patterns reflect a linguistic phenomenon, or some other patterning of underlying content (such as poetry)? That would be hard to square with line breaks seemingly inserted as necessary to fill available space around illustrations.
4. Do line-start and line-end patterns reflect contextual scribal variations -- i.e., different ways of writing the "same" glyphs at the beginnings or ends of lines? To be sure, there was plenty of contextual scribal variation in other European writing systems of the period (though not a lot specific to line ends and line starts). But that variation was conventionalized and had emerged over many generations. Unless Voynichese had a long undocumented tradition behind it, when -- and under what pressures -- would such conventions have evolved?
I don't claim that any of those explanations is weak enough that we can completely dismiss it, but at the same time, none of them strikes me as very persuasive -- certainly not enough so that we could say, "Oh, that's probably just X, so it's most likely safe to ignore."
On the other hand, I can imagine a system that would predictably produce line effects as a natural byproduct of its use, and that also falls well within the range of hypotheses people already entertain about how Voynichese might have worked (along the lines of Rene's You are not allowed to view links. Register or Login to view.). Consider this set of specifications:
(1) Lines always break at word boundaries.
(2) Within lines, words are run together indiscriminately.
(3) Text is chunked for encoding into units consisting of one or more consonants followed by one or more vowels, with each "chunk" being encoded as a vord.
(4) It's possible to encode an isolated consonant or vowel (or isolated clusters of either), but this is done only as needed to satisfy rule (1).
(5) Vords that are similarly structured represent similarly structured "chunks," but not in a straightforward letter-by-letter way (imagine something like Naibbe encoding tables not being randomly interchangeable, but each encoding a different category of "chunk").
I've brought this idea up here before, but only as a thought exercise. Now, to try it out in practice, I've just cobbled together a little over a million characters' worth of miscellaneous transcribed medieval Latin and run a few experiments on it to see what would happen to the plaintext (prior to any further encoding) if it were "chunked" as I've described. Note: it isn't actually necessary to break the "chunked" text into lines to gather data about what characteristics different line positions would have -- presuming that line breaks are inserted arbitrarily, we just have to work out how each word would be "chunked" in each of several positions and compare the results.
Based on my sample, the top twelve most common "chunks" in the middle of the line (i.e., the units we get if we run all text together) would be:
[re] 2.53%
[te] 2.02%
[ta] 2.02%
[tu] 1.89%
[mi] 1.63%
[ne], [ra] 1.62%
[ri] 1.59%
[ti] 1.54%
[si] 1.45%
[ni] 1.43%
[se] 1.38%
At line-start (considering only the first "chunks" of individual words), the top twelve most common values would instead be:
i 8.18% † -- yes, the "i" should be in brackets, but that gets misinterpreted as an italics flag! Darn forum formatting.
[e] 8.16% †
[a] 7.04% †
[co] 3.12%
[re] 2.51% *
[o] 2.31% †
[se] 2.20% *
[no] 2.16%
[si] 2.10% *
[de] 2.07%
[u] 1.95% †
[pe] 1.53%
The asterisks mark cases that overlap the mid-line "top twelve," while daggers mark cases that could only occur line-initially. Meanwhile, at line-end (considering only the last "chunks" in individual words), the top twelve most common "chunks" would be:
[s] 18.10%
[m] 16.06%
[t] 11.90%
[r] 4.16%
[n] 3.94%
[d] 2.74%
[re] 2.45% *
[nt] 1.85%
[c] 1.79%
[ns] 1.38%
[ne] 1.26% *
[st] 1.22%
The two cases marked with asterisks overlap the most common mid-line "chunks," but the others would be exclusive to the end of the line.
The second "chunk" in the line -- analyzed so as to permit crossover to a new word, e.g., the second "chunk" in [ex urbe] would be [xu] -- also seems likely to have distinctive characteristics because it will tend disproportionately to represent the second syllable of a word. And indeed it does. For example, [re] is significantly less common as the second "chunk" in a line (1.03%) than as the first "chunk" in a line (2.51%) or in the mid-line as a whole (2.53%). Meanwhile, [mi] is somewhat more common as the second "chunk" (2.11%) than in the mid-line as a whole (1.63%).
As this illustrates, a syllabic encoding scheme along the lines I've described should predictably generate LAAFU effects considerably stronger than the ones we see in the Voynich Manuscript -- and they would affect not just first and last vords, but second vords as well (compare You are not allowed to view links. Register or Login to view. at Agnostic Voynich). I'm less sure about it producing subtler mid-line patterns, but I wouldn't rule out that it might, in practice.
If these effects seem too strong to be comparable to Voynichese, one way to weaken them would be to substitute this for rule #2:
(2) Within lines, the words that make up phrases are run together indiscriminately, but the phrases themselves are not run together.
The beginnings and ends of lines would still have heavily skewed statistical characteristics, but there would be fewer forms that could only be found there -- now limited to "chunks" that occur at beginnings and ends of individual words, but not at beginnings and ends of whole phrases.
Magnesium writes as follows about Voynichese LAAFU patterns:
There's certainly no harm in exploring that. But since one of his goals is to "(b) consistently replicate these properties [ = 'well-known VMS statistical properties' ] when encrypting a wide range of plaintexts in a well-characterized natural language," I assume he'd prefer to model a system that would reliably produce LAAFU effects when applied to any source text.
Just wondering: how difficult would it be to adapt the Naibbe approach from a unigram/bigram system to a syllabic "chunk" system? Might the frequencies of different "chunk" types result naturally in something like the frequency distributions simulated through playing cards?
On the other hand, we hardly ever see comparable efforts to model LAAFU ("Line As A Functional Unit") behavior. To summarize what's at issue for anyone who might need it: Voynichese running text displays clear patterning at the line level. The first vords of lines have distinctive statistical properties, as do the last vords of lines. But so, sometimes, do the second vords of lines (see You are not allowed to view links. Register or Login to view. at Agnostic Voynich). And You are not allowed to view links. Register or Login to view. that many vord features have subtler "preferences" for earlier or later positions deeper within the mid-line.
My feeling is that most proposed explanations don't bear up particularly well to scrutiny.
1. Do line-start and line-end features correspond to parts of words split across line breaks? Likely not, since line-start and line-end words aren't shorter on average than mid-line words (I don't recall offhand who studied this, but someone did).
2. Are line-end features abbreviations employed when the writer was running out of space? Maybe -- but my sense is that, in practice, abbreviations didn't typically cluster at line-end in manuscripts of the period, so this would be a stranger explanation than it might seem at first glance.
3. Do line-start and line-end patterns reflect a linguistic phenomenon, or some other patterning of underlying content (such as poetry)? That would be hard to square with line breaks seemingly inserted as necessary to fill available space around illustrations.
4. Do line-start and line-end patterns reflect contextual scribal variations -- i.e., different ways of writing the "same" glyphs at the beginnings or ends of lines? To be sure, there was plenty of contextual scribal variation in other European writing systems of the period (though not a lot specific to line ends and line starts). But that variation was conventionalized and had emerged over many generations. Unless Voynichese had a long undocumented tradition behind it, when -- and under what pressures -- would such conventions have evolved?
I don't claim that any of those explanations is weak enough that we can completely dismiss it, but at the same time, none of them strikes me as very persuasive -- certainly not enough so that we could say, "Oh, that's probably just X, so it's most likely safe to ignore."
On the other hand, I can imagine a system that would predictably produce line effects as a natural byproduct of its use, and that also falls well within the range of hypotheses people already entertain about how Voynichese might have worked (along the lines of Rene's You are not allowed to view links. Register or Login to view.). Consider this set of specifications:
(1) Lines always break at word boundaries.
(2) Within lines, words are run together indiscriminately.
(3) Text is chunked for encoding into units consisting of one or more consonants followed by one or more vowels, with each "chunk" being encoded as a vord.
(4) It's possible to encode an isolated consonant or vowel (or isolated clusters of either), but this is done only as needed to satisfy rule (1).
(5) Vords that are similarly structured represent similarly structured "chunks," but not in a straightforward letter-by-letter way (imagine something like Naibbe encoding tables not being randomly interchangeable, but each encoding a different category of "chunk").
I've brought this idea up here before, but only as a thought exercise. Now, to try it out in practice, I've just cobbled together a little over a million characters' worth of miscellaneous transcribed medieval Latin and run a few experiments on it to see what would happen to the plaintext (prior to any further encoding) if it were "chunked" as I've described. Note: it isn't actually necessary to break the "chunked" text into lines to gather data about what characteristics different line positions would have -- presuming that line breaks are inserted arbitrarily, we just have to work out how each word would be "chunked" in each of several positions and compare the results.
Based on my sample, the top twelve most common "chunks" in the middle of the line (i.e., the units we get if we run all text together) would be:
[re] 2.53%
[te] 2.02%
[ta] 2.02%
[tu] 1.89%
[mi] 1.63%
[ne], [ra] 1.62%
[ri] 1.59%
[ti] 1.54%
[si] 1.45%
[ni] 1.43%
[se] 1.38%
At line-start (considering only the first "chunks" of individual words), the top twelve most common values would instead be:
i 8.18% † -- yes, the "i" should be in brackets, but that gets misinterpreted as an italics flag! Darn forum formatting.
[e] 8.16% †
[a] 7.04% †
[co] 3.12%
[re] 2.51% *
[o] 2.31% †
[se] 2.20% *
[no] 2.16%
[si] 2.10% *
[de] 2.07%
[u] 1.95% †
[pe] 1.53%
The asterisks mark cases that overlap the mid-line "top twelve," while daggers mark cases that could only occur line-initially. Meanwhile, at line-end (considering only the last "chunks" in individual words), the top twelve most common "chunks" would be:
[s] 18.10%
[m] 16.06%
[t] 11.90%
[r] 4.16%
[n] 3.94%
[d] 2.74%
[re] 2.45% *
[nt] 1.85%
[c] 1.79%
[ns] 1.38%
[ne] 1.26% *
[st] 1.22%
The two cases marked with asterisks overlap the most common mid-line "chunks," but the others would be exclusive to the end of the line.
The second "chunk" in the line -- analyzed so as to permit crossover to a new word, e.g., the second "chunk" in [ex urbe] would be [xu] -- also seems likely to have distinctive characteristics because it will tend disproportionately to represent the second syllable of a word. And indeed it does. For example, [re] is significantly less common as the second "chunk" in a line (1.03%) than as the first "chunk" in a line (2.51%) or in the mid-line as a whole (2.53%). Meanwhile, [mi] is somewhat more common as the second "chunk" (2.11%) than in the mid-line as a whole (1.63%).
As this illustrates, a syllabic encoding scheme along the lines I've described should predictably generate LAAFU effects considerably stronger than the ones we see in the Voynich Manuscript -- and they would affect not just first and last vords, but second vords as well (compare You are not allowed to view links. Register or Login to view. at Agnostic Voynich). I'm less sure about it producing subtler mid-line patterns, but I wouldn't rule out that it might, in practice.
If these effects seem too strong to be comparable to Voynichese, one way to weaken them would be to substitute this for rule #2:
(2) Within lines, the words that make up phrases are run together indiscriminately, but the phrases themselves are not run together.
The beginnings and ends of lines would still have heavily skewed statistical characteristics, but there would be fewer forms that could only be found there -- now limited to "chunks" that occur at beginnings and ends of individual words, but not at beginnings and ends of whole phrases.
Magnesium writes as follows about Voynichese LAAFU patterns:
(12-08-2025, 10:01 PM)magnesium Wrote: You are not allowed to view links. Register or Login to view.One of the things I want to explore is the extent to which the structure of the plaintext can create these biases within Naibbe ciphertext. For example, if the Naibbe cipher were used to encrypt a poem such as Dante's Divina Commedia, the poem's line-by-line structure would have rhyming, repeated phrases, etc. that would theoretically impose greater line-by-line positional biases in the frequencies of plaintext unigrams and bigrams relative to prose such as Pliny's Natural History. Is that sufficient to explain the full extent of the VMS's "line as a functional unit" properties? Maybe, maybe not. But maybe it becomes much easier to achieve "line as a functional unit" properties within a Naibbe-like ciphertext if the plaintext is a poem or poem-like in its structure.
There's certainly no harm in exploring that. But since one of his goals is to "(b) consistently replicate these properties [ = 'well-known VMS statistical properties' ] when encrypting a wide range of plaintexts in a well-characterized natural language," I assume he'd prefer to model a system that would reliably produce LAAFU effects when applied to any source text.
Just wondering: how difficult would it be to adapt the Naibbe approach from a unigram/bigram system to a syllabic "chunk" system? Might the frequencies of different "chunk" types result naturally in something like the frequency distributions simulated through playing cards?
![[Image: word_length_by_idx_in_line.png]](https://raw.githubusercontent.com/pwspen/pykeedy/61e62c57136b52d5c9703ad41c35fe7e24963fce/examples/results/word_length_by_idx_in_line.png)