[split] What can the structural peculiarities of the VMS tell us about the nature of the underlying text

[split] What can the structural peculiarities of the VMS tell us about the nature of the underlying text - Printable Version

+- The Voynich Ninja (https://www.voynich.ninja)
+-- Forum: Voynich Research (https://www.voynich.ninja/forum-27.html)
+--- Forum: Analysis of the text (https://www.voynich.ninja/forum-41.html)
+--- Thread: [split] What can the structural peculiarities of the VMS tell us about the nature of the underlying text (/thread-5718.html)

Pages: 1 2 3 4 5 6 7

RE: [split] What can the structural peculiarities of the VMS tell us about the nature ... - JoJo_Jost - 12-05-2026

Oh, I just noticed: Patrick Feaster (Griffonage 2020) provided a list of 7 rules that predicts ~95% of mid-word breaks.

I calculated it a little differently—taking all spaces into account - but I get essentially the same results. Well, then I don’t need to publish that.

There is no language (at least none that I know of) that produces something like this. In my opinion, this is very clear evidence that spaces are not “real” word boundaries - (but maybe Stolfi's chinese approach)

I see two reasons for this, but perhaps others have more:

1. The actual encryption takes place at a deeper level (whether as a cipher or a generator); the spaces are a diversionary tactic.
2. The spaces are part of a cipher.

RE: [split] What can the structural peculiarities of the VMS tell us about the nature ... - dashstofsk - 12-05-2026

I still think you are thinking too hard.

You have told us recently: it is Bavarian, at least two writers each spoke a different dialect of it; word boundaries are not word boundaries; spaces carry structural information; line start characters indicate how the rest of the line is encoded; it is a position-dependent cipher; it is monophonic; you are "sure" that it is a hybrid cipher-language system where letters or bigrams are substituted mechanically and rule-based.

You seem to have invented one of the most complex theories on this forum. But remember that having written it the writer ( or writers, or their patron ) would then have needed to be able to read it ( otherwise then what was it all for ? ). It would have been nice for them to be able to read it fluently. But this manuscript which took so much trouble to encrypt would now have to take a lot of effort to decrypt and read.

People have tried to be kind to you and drop hints that you might be wasting your time. But you don't seem to be listening.

RE: [split] What can the structural peculiarities of the VMS tell us about the nature ... - tavie - 12-05-2026

(12-05-2026, 08:02 AM)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view.A concrete example of misdirection due to character statistics is the observation that, on the VMS, words that begin with qo are more common at line-start than elsewhere. Now we know that one cause of that anomaly has nothing to do with those characters: it is just the bias towards longer words in that position that is created by the process of breaking text into right-justified lines -- and it so happens that qo-words are longer than average. (This side effect of line breaking may not be enough to explain all the anomalies of qo-words; but it is still not known where there is anything beyond it.)

We do not know this.

RE: [split] What can the structural peculiarities of the VMS tell us about the nature ... - JoJo_Jost - 12-05-2026

(12-05-2026, 01:28 PM)dashstofsk Wrote: You are not allowed to view links. Register or Login to view.People have tried to be kind to you and drop hints that you might be wasting your time. But you don't seem to be listening.

Thx, thx for ur help - but: Big Grin

I’m not wasting my time! I’m having a very, very good time with the VMS. Just think of everything I’ve learned over the past 7 months—just like that, on the side. A lot about the Middle Ages, cryptography, scripts, plants, medieval recipe books, Middle High German, and 15th-century Bavaria and above all, my English has gotten much better again.

By the way, this wouldn’t be the first major mystery that’s remained unsolved for centuries that I’ve cracked… and if I’ve learned one thing in the process, it’s

1. persistence and looking far away from where everyone else is looking. Because the truth is rarely found where everyone else is looking… there’s a deep logic to that

2. The more your “fellow researchers” cast doubt, the closer you are to the truth—well... it’s not quite that simple. That would be wrong—but you mabe get my point...

So I’m having a lot of fun, and believe me, my ideas are much closer to a possible reality of the 15th century than about 90 percent of the other theories here...especially the hoax theories... Big Grin

RE: [split] What can the structural peculiarities of the VMS tell us about the nature ... - dashstofsk - 12-05-2026

(12-05-2026, 08:02 AM)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view.words that begin with qo are more common at line-start

Only for Herbal A1 ( quires 1 to 7 ) is this so. Even so in those pages qo continues to fall in frequency. It is not just a first word effect. In quire 13 the frequency seems to be level. Quire 20 just does not seem to like qo at the start of a line.

Filename: Prefix 101-4o Herbal A1, q 1to7 ( 611 ).png Size: 6.23 KB 12-05-2026, 02:12 PM

Filename: Prefix 101-4o q 13 ( 1618 ).png Size: 5.91 KB 12-05-2026, 02:12 PM

Filename: Prefix 101-4o q 20 ( 1829 ).png Size: 5.89 KB 12-05-2026, 02:12 PM

RE: [split] What can the structural peculiarities of the VMS tell us about the nature ... - tavie - 12-05-2026

Exactly. And that's why it is important to look at the different sections because they show wildly different behaviour. This is the level I try to study it on. I'd go even more specific - e.g. individual pages or paragraphs - but the data samples get a bit too small to work with.

RE: [split] What can the structural peculiarities of the VMS tell us about the nature ... - quimqu - 12-05-2026

(12-05-2026, 02:10 PM)JoJo_Jost Wrote: You are not allowed to view links. Register or Login to view.this wouldn’t be the first major mystery that’s remained unsolved for centuries that I’ve cracked…

What misteries have you cracked? Just curious...

RE: [split] What can the structural peculiarities of the VMS tell us about the nature ... - JoJo_Jost - 12-05-2026

(12-05-2026, 03:26 PM)quimqu Wrote: You are not allowed to view links. Register or Login to view.
(12-05-2026, 02:10 PM)JoJo_Jost Wrote: You are not allowed to view links. Register or Login to view.this wouldn’t be the first major mystery that’s remained unsolved for centuries that I’ve cracked…

What misteries have you cracked? Just curious...

Oh, I just wanted to show off... Big Grin

RE: [split] What can the structural peculiarities of the VMS tell us about the nature ... - Jorge_Stolfi - 12-05-2026

(12-05-2026, 01:55 PM)tavie Wrote: You are not allowed to view links. Register or Login to view.
(12-05-2026, 08:02 AM)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view.words that begin with qo are more common at line-start than elsewhere. Now we know that one cause of that anomaly has nothing to do with those characters: it is just the bias towards longer words in that position that is created by the process of breaking text into right-justified lines -- and it so happens that qo-words are longer than average. (This side effect of line breaking may not be enough to explain all the anomalies of qo-words; but it is still not known where there is anything beyond it.)
We do not know this.

We do know that the mere act of formatting a text into paragraphs enhances the frequency at line-start of words that are longer than average; and, therefore, depresses the frequency of words that are shorter than average.

This line-breaking length bias (LBB) is not a theory; it is predicted by math and has been verified experimentally. It is also independent of the language, topic, spelling system, encoding, etc. And of the characters in those words. (Provided that the decision to break is based on the physical distance between the text rails and the physical length of the next word; or at least on their character counts -- not on some fixed max count of words per line.)

Therefore, if qo-words are indeed longer than average and increasingly common at line-start, the LBB is (as I wrote ) one of the causes of that increase. In sections where qo-words are less common at line-start, perhaps they are shorter than average?

Is the LBB the only cause of line-start qo anomalies? Of all line-start anomalies? I don't know; but any investigation of those anomalies should estimate the LBB, and subtract it, before claiming that "something else" is going on.

The LBB can also cause anomalies beyond the first word of the line. Suppose that, in some text, words that are longer than average tend to be followed by qo-words more than expected by chance. Then the LBB will cause qo-words to be more common than expected as the second word of each line.

All the best, --stolfi

RE: [split] What can the structural peculiarities of the VMS tell us about the nature ... - Jorge_Stolfi - 12-05-2026

(12-05-2026, 09:12 AM)ReneZ Wrote: You are not allowed to view links. Register or Login to view.We don't know what are individual characters, and we don't know what are individual words.

I agree, to the extent that there seems to be a lot of "random" noise in the word spaces. Much more than what is marked with commas.

(By "random" I do not mean like coin or die tossings, but like the color of the label on a bottle of wine: that is, mostly unrelated to the information that matters.)

But there is also evidence that the word spaces are mostly not random noise. Like the fact that labels are mostly one "word" and have the same internal structure as text words. Or the way that (LAAFU anomalies notwithstanding) line breaks and breaks around intruding figures generally look like word spaces.

Or the statistics of similar words. In an English text, the frequencies of the words "other", "another", "bother", brother", and "mother" will usually not be proportional to those of "others", "anothers", "bothers", "brothers", and "mothers".   While the analogous statistics of VMS words are noticeably "mushier", those asymmetries also are seen there:

| 58 other 10 others | 17 taiin 63 otaiin
| 63 another 0 anothers | 27 kaiin 76 okaiin
| 1 bother   0 bothers | 7 laiin 11 olaiin
| 91 brother 0 brothers | 21 raiin 6 oraiin
| 3 mother 1 mothers | 20 saiin 1 osaiin
| |   93 daiin 12 odaiin

The word counts on the left are from Well's novel War of the Worlds; those on the right are from the Starred Parags section of the VMS. The former are not proportional because the semantics of the roots are quite different, and the meaning of the "-s" inflection is different for each word. (The main character in the novel has only one brother, and their mother does not appear in the story.)

When one computes character statistics, all the semantic contents of the words is discarded. Each count ends up being the sum of counts from hundreds of words of totally unrelated meanings and grammatical functions that happen to use that letter or digraph.

With an English text, even the "-s" of plural will be conflated with the "-s" of 3rd person verbs, and with the "-s" of singular words like "gas" and "thus".   So, it is very unlikely that character-level statistics will ever give any useful insight into the grammar of the language.

Quote:Any serious progress in the meaning of the text has to look at both. The main advantage of the characters is that there are many, many more, so statistics are more stable.

But since each character count is the sum of the counts of a "random" set of words, character statistics end up having more noise than word-level ones. One could say that they are almost 100% noise...

Again: to reduce the sampling noise without losing the meaningful information, one should try to combine counts of words that are most likely to have similar grammatical and/or semantic roles. Based, for instance, on their correlations with other words. But not because they use the same characters or n-grams...

Quote:They are also easy to 'play with'.

Yes, that is a problem. Did I tell you the Educational Joke about the drunkard and his keys?

But sure, when tackling a new text in unknown language and script, which may be encrypted, one should start by computing character and n-gram statistics, among other things. With luck, those statistics will be useful hints leading to correct guesses about the language, script, and encoding. That, quite rightly, must have been the first thing that Friedman did with the punched cards.

But there was no such luck. In the past 80 years, all we learned from character statistics is that the solution is very unlikely to come from them. They ruled out any simple encoding of any "European" language, including Hebrew, Arabic, Turkish, and many more.

And yet other statistics (like the asymmetries above) all pointed towards a natural language, in a spelling and encoding that was mostly one-to-one on word types. It could be a codebook-cipher with a vaguely Roman-like number system. Or plaintext, but with words extremely abbreviated or split into chunks of bounded size. Or an invented "philosophical" language. Maybe a few other possibilities. Either way, I can't see how character-level statistics could help find the solution...

Quote:It is possible to create substitutions whereby the unusual bigram statistics are completely normalised.

I don't quite understand this claim. But wouldn't such "normalzation" simply throw away the little useful information that survives in the character and bigram statistics?

One "normalization" that I think we should all do is to replace ir and m by iin before any analyisis. Maybe even ar by ain and then ain and aiiin by aiin. And probably it will help also map some rare glyphs like u b g by their nearest common glyphs, and Cs to Sh, and CTHh by CThe, and Ih to Ch, etc. If these glyphs are indeed separate letters, this merging will do little harm because those letters are fairly rare. If they are not, the merging will remove one source of noise, simplify the character statistic tables, and increase the chances of identifying the function and meaning of specific words.

Quote:With respect to words, it is an open question to me, whether it is possible to create a Voynich dictionary, in which every Voynich MS word type can be matched to one word in a single language, such that the corresponding substitution leads to a mostly meaningful text. I rather think that this is not possible. (Please note: "I rather think").

I think so too, but probably for somewhat different reasons.

I suspect that, besides the word space noise, the text has a rather large incidence of scribal errors -- including wrong spellings, omitted, duplicated, and transposed words, etc. Maybe entire lines were skipped.

One hint at this problem is the three mega-paragraphs on pages You are not allowed to view links. Register or Login to view. (bottom 2/3), You are not allowed to view links. Register or Login to view. (top 2/3) and You are not allowed to view links. Register or Login to view. (top half). The stars in the margin strongly suggest that each of these text blocks is a dozen normal parags that were smashed together by the Scribe, without the due parag breaks. To me that says that the Scribe did not put much effort into getting the text right. (I have what I think is stronger evidence of his sloppiness, but I can't discuss it here.)

Here is a scenario that could explain that sloppiness. Imagine that the Author's eyesight is very poor (like Marci's apparently was when he sent the book to Kircher.) He was still able to teach the Voynichese alphabet to the Scribe, using a large enough "font". But he could not make out those small glyphs on the VMS; so he had to trust the Scribe. And the Scribe knew this, so he did not put much care in his job. Instead of going back and forth between draft an vellum one word at a time, he would quickly read maybe 5-6 words at a time, then write them down in one go. Like we are constantly tempted to do when we try to transcribe the VMS. Thus he often swapped a k with a t, added or skipped an e or i, added or skipped the plume of Sh or the ligature on Ch...

Quote:If [we cannot build a word-for-word dictionary from Voynichese to some known language], then the vast majority of proposed solutions fail, because they rely on this.

Indeed, it seems that many people here have assumed (consciously or unconsciously) that the text is mostly error-free. Maybe because they assume that it is encrypted?

Quote:The so-called Chinese Hypothesis (which isn't a proposed solution yet) would also be a victim.

The Chinese Theory in fact predicts that there will be many errors, not just by the Scribe but by the Author too. Like there would be in any text in a poorly-known language that is written down under dictation.

And that is indeed a problem for the acceptance of the SPS=SBJ theory, because the need to allow for such errors is seen by skeptics as convenient "slack" that would allow it to "work" even if the text is not the SPS and the language is not Chinese. I believe that is definitely not the case, but I see that it is hard to get this point across. I am still working on that...

All the best, --stolfi