The Voynich Ninja
[split] What can the structural peculiarities of the VMS tell us about the nature of the underlying text - Printable Version

+- The Voynich Ninja (https://www.voynich.ninja)
+-- Forum: Voynich Research (https://www.voynich.ninja/forum-27.html)
+--- Forum: Analysis of the text (https://www.voynich.ninja/forum-41.html)
+--- Thread: [split] What can the structural peculiarities of the VMS tell us about the nature of the underlying text (/thread-5718.html)

Pages: 1 2 3 4 5 6 7


RE: [split] What can the structural peculiarities of the VMS tell us about the nature ... - dashstofsk - 10-05-2026

(09-05-2026, 12:44 AM)tavie Wrote: You are not allowed to view links. Register or Login to view.initial d word types are more common the closer you get to Line End

For observations like this I find it helpful to have plots of the actual variability. In Herbal A1 pages ( quires 1 to 7 ) words starting  d become more frequent towards the end. But otherwise not in all the sections of the manuscript. In quire 20 they are high at the line start and are then level. Quire 13 starts high, then falls, then rises.

   
   
   

Similarly, many other line anomalies are not all uniform throughout the manuscript, but seem to be different in different sections.


RE: [split] What can the structural peculiarities of the VMS tell us about the nature ... - MarcoP - 10-05-2026

Emma’s idea that there was a conscious or unconscious “rule” for treating the first position of each line in a special way focuses on the individual observation about the abnormal words at line start. It’s a puzzling scenario, but in recent years things have become even more interesting.

In his You are not allowed to view links. Register or Login to view., Patrick has shown that lines have biases that are not limited to the first and last positions in lines. Using his visualization method, one can see, for example, that, in Currier B, the character sequences You are not allowed to view links. Register or Login to view.have roughly mirrored preferences, with peaks on the second and second to last positions respectively. (EDIT: this is the same style of visualization as in dashstofsk's post above, I of course agree that it's extremely helpful; by developing this idea, Patrick has provided a powerful new tool for understanding the details of the text).
   
A consequence of these preferences is that lines where an occurrence of ‘she’ precedes an occurrence of ‘ot’ are twice as frequent as those where an occurrence of ‘ot’ precedes an occurrence of ‘she’. As Patrick has shown, the phenomenon is pervasive.
   

It also has a vertical component, where characters and bigrams show preferences for the top or bottom lines of paragraphs. As for line-start anomalies, the most blatant cases of paragraph preferences (like the behaviour of p) have been known for a long time.

In You are not allowed to view links. Register or Login to view. added another piece to the puzzle by showing that line-start anomalies also follow vertical patterns, so that the character added at line N appears to depend on the character added at the start of line N+1. 
As she wrote above:
(09-05-2026, 12:44 AM)tavie Wrote: You are not allowed to view links. Register or Login to view.the line start word appears to be impacted in many cases by the word above it

In particular, in Currier A, these line-initial (apparently added) characters are placed in such a way that they do not repeat at the start of two consecutive lines: e.g. You are not allowed to view links. Register or Login to view..

This is vaguely comparable to what can be seen in acrostics in some medieval manuscripts (where of course the first letter of line N+1 depends on the first letter of line N according to the same rules with which words are formed in that language, often with vowels alternating with consonants). Of course, Voynichese line-start characters are too limited to be simple acrostics.

Again, I believe that these phenomena may provide vital insights into how the text was created, but I strongly doubt that they have anything to do with a hypothetical underlying text.


RE: [split] What can the structural peculiarities of the VMS tell us about the nature ... - JoJo_Jost - 10-05-2026

There are a lot of bigrams that tend to appear at the beginning of a line, and others that tend to appear at the end.  Angel


RE: [split] What can the structural peculiarities of the VMS tell us about the nature ... - JoJo_Jost - 10-05-2026

Oh, and the line start markers are also involved; they might determine how things are filled in. Qo is more neutral in that regard, while SH is the most extreme. But I haven't been able to figure that out yet. So take that with a grain of salt.


RE: [split] What can the structural peculiarities of the VMS tell us about the nature ... - Stefan Wirtz_2 - 10-05-2026

(10-05-2026, 06:14 AM)DG97EEB Wrote: You are not allowed to view links. Register or Login to view.Diane O'Donovan of course

Ok, I saw it now. Didn't have DoD on the radar here because she left this forum years ago, so I thought I had offended some other person with that name here.
I don't know DoD at all, stopped reading her blog some ~week ago when she confused me again with some other person(s). In general, she is "going south(east)" and promotes her theory that the VMS was written by some Arabs. I don't agree, so she dug out an older comment of me to target "ad hominem".

But tavie is right, that does not belong into this thread.


RE: [split] What can the structural peculiarities of the VMS tell us about the nature ... - MarcoP - 11-05-2026

At the end of his 2022 Malta Paper (You are not allowed to view links. Register or Login to view.) Patrick comments on possible explanations for the patterns he discussed:

Patrick Feaster Wrote:When it comes to explaining distribution patterns, there are various possibilities we might entertain. One is that each line of text corresponds to some unit of meaningfully patterned content, such as a grammatical sentence, a line of poetry, or an entry in a list. Exploratory studies of a few well-known works of poetry show that similar patterns can be detected in them, presumably due to a complex interplay of grammatical, metrical, and stylistic factors. In Virgil’s Aeneid, for example, if we compare homologous word pair sets ending [es] and [ibus], the [es] set has a midline average rightwardness score of 0.399 with 343 tokens, while the [ibus] set scores 0.708 with 390 tokens—a difference as stark as any presented above. Alternatively, we might hypothesize that distribution patterns arose as a byproduct of some method of encoding meaningful content rather than from the content itself. Here I’ll cite just one representative scenario. Fifteenth-century ciphers often sought to increase security by providing multiple options for encoding each plaintext character, and for this ploy to work as intended, a writer needed to alternate repeatedly among those options. One strategy for ensuring that happened would have been to favor different options in different areas of the page. Thus, there’s more than one angle from which we could try to explain distribution patterns, but the methods outlined above for identifying such patterns should be equally applicable to any and all prospective interpretations of them.

So, if the patterns do reflect structure in the underlying text, one possibility is that the line is indeed a functional unit, as in Virgil’s poetry, where sentences tend to begin at line start and end at line end (often on multiple lines, but the correlation is still highly significant). I don’t see other options for the structure we see to be an effect of structure in the underlying language. Personally, I find this scenario hard to believe, since the text is clearly formatted as medieval prose paragraphs.

The other example presented by Patrick is that of homophonic ciphers, so that an Ngram that is frequent on the left side of lines could encode the same plaintext sequence as another Ngram that is frequent on the right side: the original plaintext sequence would have a roughly constant frequency across line positions. More generally, if the scribe had different options for representing some plaintext sequences, he could have chosen among the options in a position-dependent way: we see this happen with Latin abbreviations, that in some manuscripts occur more frequently near the end of lines (though of course there are reasons to suspect that abbreviations do not play such a pervasive role in the VMS). The extreme case is that there is no underlying text, Voynichese is spontaneous gibberish (as discussed by You are not allowed to view links. Register or Login to view.) and the author had total freedom to choose what to “write” (all structure is the effect of their personal preferences); given the overall consistency of the text, I guess that this scenario is easier to imagine with a single author whose output was copied by different scribes.


RE: [split] What can the structural peculiarities of the VMS tell us about the nature ... - JoJo_Jost - 11-05-2026

Oh, funny, I just ended up there too... It's a position-dependent cipher, possibly monophonic... I had no idea that had already been suggested.


RE: Advice to Computational Voynichologists - Jorge_Stolfi - 12-05-2026

At the cost of sounding pretentious, please let me give some advice to all those who are studying the text.

When I was a PhD student at Stanford, I met a fellow Computational Geometer from Austria, who was absolutely shocked when he learned that, in the almost 10 years I had been there, I had never been to the famous Stanford Football Stadium to watch a game by the famous Stanford Football Team.  That, in fact, I had never watched a game of American football, not even on TV.  So he demanded that we go watch the next game there.  

I agreed to buy the tickets; but, since I had never done that, I choose what I thought were good seats -- right next to the field.

So we watched a full game of One-Dimensional Football -- where the field was just a green horizontal line, with white blobs running back and forth along it for some inscrutable reason.

So here is the advice: don't waste your time computing character-based statistics. Such as character and n-graph frequencies, character and n-graph entropy,  vertical or horizontal character correlations, etc.  

The reason is the same reason for which you should not buy front-row tickets to a football or soccer game. Namely, what you will get is a projection of the full text statistics onto a very small numeric space.  This  projection will almost certainly mix up many distinct meaningful features of the text; and you will never be able to untangle that mess by staring at all the tables and plots.

Like, the frequency of "h" in English is the sum of the frequency of the "h" sound and the frequencies of digraphs "ch", "sh", "th", "ph", "rh", etc; and the frequencies of these sounds and digraphs were defined by multiple distinct historical and linguistic factors.  You will never  be able to separate and understand these factors by studying letter and n-gram statistics.

For one thing, all character-based statistics are highly dependent on the spelling system and transcription alphabet.  For instance, one could easily double or halve the entropy per character of a text by a trivial change of spelling (not even encryption), such as using digraphs instead of single letters for common sounds, or encoding the same sound in several different ways, randomly chosen. 

Apart from that, character-based statistics are highly dependent on the nature and topic of the text.  There is no such thing as "the frequency of 'e' in English" or "the most common digraph in Latin".  The statistics of a character or n-gram are determined by the statistics of the most common words that contain it; and word statistics -- even of "function" words like "the", "of", "and", "is" -- are strongly dependent on topic.  

After all, when one is writing, one is constantly choosing words -- not characters.

In a booklet that explains how to make sixty six luxury boxes for mixed toxic wax halluxes with an ax, label them with Roman numerals, and then relax, character-based statistics will reveal an abnormally high frequency of the letter "x". Further investigation may reveal a puzzling absence of the digraph "ex" in spite of "e" and "x" being very common.  But what conclusions could one draw from such data? 

One may easily draw wrong conclusions, like that the pamphlet is in an original "language A" (dialect, encoding, etc), clearly distinct from the "language B" of that other document that advises against having sex in your vexillologist ex's triplex in Texas while texting an explanation of how to extract the index of a vertex in a convex cell complex.

A concrete example of misdirection due to character statistics is the observation that, on the VMS, words that begin with qo are more common at line-start than elsewhere.  Now we know that one cause of that anomaly has nothing to do with those characters: it is just the bias towards longer words in that position that is created by the process of breaking text into right-justified lines -- and it so happens that qo-words are longer than average.  (This side effect of line breaking may not be enough to explain all the anomalies of qo-words; but it is still not known where there is anything beyond it.)

In conclusion: if you must do statistical analysis of the text, don't focus on characters, focus on words.  And if you need to merge words into word classes (say, to reduce the volume of the results or the sampling noise), try to use classes that are defined by some semantic criterion (like co-occurrence), not by the characters that occur in them (like "qo-words" or "words with gallows").

All the best, --stolfi

PS. My Austrian friend was of course furious about my seat choice, and demanded a replay, with him buying the tickets that time.  

So we watched the next football game at the Stanford Stadium from a place with superb view of the field.  

At night. 

A cold night. 

A cold and rainy night.

But I don't know what VMS-relevant advice I could draw from this part of the story.  "Don't let an Austrian friend decide which statistics you should compute?"  Duhh ...


RE: Advice to Computational Voynichologists - JoJo_Jost - 12-05-2026

(12-05-2026, 08:02 AM)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view.In conclusion: if you must do statistical analysis of the text, don't focus on characters, focus on words.  And if you need to merge words into word classes (say, to reduce the volume of the results or the sampling noise), try to use classes that are defined by some semantic criterion (like co-occurrence), not by the characters that occur in them (like "qo-words" or "words with gallows").

Well - if only we knew that the word boundaries in the VMS are actually word boundaries. You can cover up to 82 percent of all spaces with just 8 simple rules (I’ll post those rules later in another thread) - which really goes against the idea that they’re normal word boundaries....


RE: [split] What can the structural peculiarities of the VMS tell us about the nature ... - ReneZ - 12-05-2026

We don't know what are individual characters, and we don't know what are individual words.

Any serious progress in the meaning of the text has to look at both. The main advantage of the characters is that there are many, many more, so statistics are more stable. They are also easy to 'play with'.
It is possible to create substitutions whereby the unusual bigram statistics are completely normalised.

With respect to words, it is an open question to me, whether it is possible to create a Voynich dictionary, in which every Voynich MS word type can be matched to one word in a single language, such that the corresponding substitution leads to a mostly meaningful text. I rather think that this is not possible. (Please note: "I rather think").

If the answer to this question were indeed: "no", then the vast majority of proposed solutions fail, because they rely on this.
The so-called Chinese Hypothesis (which isn't a proposed solution yet) would also be a victim.