(12-05-2026, 09:12 AM)ReneZ Wrote: You are not allowed to view links. Register or Login to view.We don't know what are individual characters, and we don't know what are individual words.
I agree, to the extent that there seems to be a lot of "random" noise in the word spaces. Much more than what is marked with commas.
(By "random" I do not mean like coin or die tossings, but like the color of the label on a bottle of wine: that is, mostly unrelated to the information that matters.)
But there is also evidence that the word spaces are mostly
not random noise. Like the fact that labels are mostly one "word" and have the same internal structure as text words. Or the way that (LAAFU anomalies notwithstanding) line breaks and breaks around intruding figures generally look like word spaces.
Or the statistics of similar words. In an English text, the frequencies of the words "other", "another", "bother", brother", and "mother" will usually not be proportional to those of "others", "anothers", "bothers", "brothers", and "mothers". While the analogous statistics of VMS words are noticeably "mushier", those asymmetries also are seen there:
| 58 other 10 others | 17 taiin 63 otaiin
| 63 another 0 anothers | 27 kaiin 76 okaiin
| 1 bother 0 bothers | 7 laiin 11 olaiin
| 91 brother 0 brothers | 21 raiin 6 oraiin
| 3 mother 1 mothers | 20 saiin 1 osaiin
| | 93 daiin 12 odaiin
The word counts on the left are from Well's novel
War of the Worlds; those on the right are from the Starred Parags section of the VMS. The former are not proportional because the semantics of the roots are quite different, and the meaning of the "-s" inflection is different for each word. (The main character in the novel has only one brother, and their mother does not appear in the story.)
When one computes character statistics, all the semantic contents of the words is discarded. Each count ends up being the sum of counts from hundreds of words of totally unrelated meanings and grammatical functions that happen to use that letter or digraph.
With an English text, even the "-s" of plural will be conflated with the "-s" of 3rd person verbs, and with the "-s" of singular words like "gas" and "thus". So, it is very unlikely that character-level statistics will ever give any useful insight into the grammar of the language.
Quote:Any serious progress in the meaning of the text has to look at both. The main advantage of the characters is that there are many, many more, so statistics are more stable.
But since each character count is the sum of the counts of a "random" set of words, character statistics end up having
more noise than word-level ones. One could say that they are almost 100% noise...
Again: to reduce the sampling noise without losing the meaningful information, one should try to combine counts of words that are most likely to have similar grammatical and/or semantic roles. Based, for instance, on their correlations with other words. But not because they use the same characters or n-grams...
Quote:They are also easy to 'play with'.
Yes, that is a problem. Did I tell you the Educational Joke about the drunkard and his keys?
But sure, when tackling a new text in unknown language and script, which may be encrypted, one should start by computing character and n-gram statistics, among other things. With luck, those statistics will be useful hints leading to correct guesses about the language, script, and encoding. That, quite rightly, must have been the first thing that Friedman did with the punched cards.
But there was no such luck. In the past 80 years, all we learned from character statistics is that the solution is very unlikely to come from them. They ruled out any simple encoding of any "European" language, including Hebrew, Arabic, Turkish, and many more.
And yet other statistics (like the asymmetries above) all pointed towards a natural language, in a spelling and encoding that was mostly one-to-one on word types. It could be a codebook-cipher with a vaguely Roman-like number system. Or plaintext, but with words extremely abbreviated or split into chunks of bounded size. Or an invented "philosophical" language. Maybe a few other possibilities. Either way, I can't see how character-level statistics could help find the solution...
Quote:It is possible to create substitutions whereby the unusual bigram statistics are completely normalised.
I don't quite understand this claim. But wouldn't such "normalzation" simply throw away the little useful information that survives in the character and bigram statistics?
One "normalization" that I think we should all do is to replace
ir and
m by
iin before any analyisis. Maybe even
ar by
ain and then
ain and
aiiin by
aiin. And probably it will help also map some rare glyphs like
u b g by their nearest common glyphs, and
Cs to
Sh, and
CTHh by
CThe, and
Ih to
Ch, etc. If these glyphs are indeed separate letters, this merging will do little harm because those letters are fairly rare. If they are not, the merging will remove one source of noise, simplify the character statistic tables, and increase the chances of identifying the function and meaning of specific words.
Quote:With respect to words, it is an open question to me, whether it is possible to create a Voynich dictionary, in which every Voynich MS word type can be matched to one word in a single language, such that the corresponding substitution leads to a mostly meaningful text. I rather think that this is not possible. (Please note: "I rather think").
I think so too, but probably for somewhat different reasons.
I suspect that, besides the word space noise, the text has a rather large incidence of scribal errors -- including wrong spellings, omitted, duplicated, and transposed words, etc. Maybe entire lines were skipped.
One hint at this problem is the three mega-paragraphs on pages You are not allowed to view links.
Register or
Login to view. (bottom 2/3), You are not allowed to view links.
Register or
Login to view. (top 2/3) and You are not allowed to view links.
Register or
Login to view. (top half). The stars in the margin strongly suggest that each of these text blocks is a dozen normal parags that were smashed together by the Scribe, without the due parag breaks. To me that says that the Scribe did not put much effort into getting the text right. (I have what I think is stronger evidence of his sloppiness, but I can't discuss it here.)
Here is a scenario that could explain that sloppiness. Imagine that the Author's eyesight is very poor (like Marci's apparently was when he sent the book to Kircher.) He was still able to teach the Voynichese alphabet to the Scribe, using a large enough "font". But he could not make out those small glyphs on the VMS; so he had to trust the Scribe. And the Scribe knew this, so he did not put much care in his job. Instead of going back and forth between draft an vellum one word at a time, he would quickly read maybe 5-6 words at a time, then write them down in one go. Like we are constantly tempted to do when we try to transcribe the VMS. Thus he often swapped a
k with a
t, added or skipped an
e or
i, added or skipped the plume of
Sh or the ligature on
Ch...
Quote:If [we cannot build a word-for-word dictionary from Voynichese to some known language], then the vast majority of proposed solutions fail, because they rely on this.
Indeed, it seems that many people here have assumed (consciously or unconsciously) that the text is mostly error-free. Maybe because they assume that it is encrypted?
Quote:The so-called Chinese Hypothesis (which isn't a proposed solution yet) would also be a victim.
The Chinese Theory in fact predicts that there will be many errors, not just by the Scribe but by the Author too. Like there would be in any text in a poorly-known language that is written down under dictation.
And that is indeed a problem for the
acceptance of the SPS=SBJ theory, because the need to allow for such errors is seen by skeptics as convenient "slack" that would allow it to "work" even if the text is not the SPS and the language is not Chinese. I believe that is definitely not the case, but I see that it is hard to get this point across. I am still working on that...
All the best, --stolfi