Options

The 'Chinese' Theory: For and Against

Index
The 'Chinese' Theory: For and Against
RE: The 'Chinese' Theory: For and Against

Jorge_Stolfi > 27-06-2026, 03:51 AM
(27-06-2026, 01:06 AM)ReneZ Wrote: You are not allowed to view links. Register or Login to view.
(26-06-2026, 04:12 PM)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view.But that is not true for the Chinese script -- because it is not phonetic! In Chinese writing, each symbol (or pair of symbols) represents directly a concept.
It is not that simple. There is a large group of characters which consist of a radical plus a sound element. At some point in time, for some version of the language, these sounds helped define the character set.

Yes, originally the characters stood for words. Some 80-90% of the characters now include a phonetic element; but that element must have been added later, since it is itself an ideographic character.

However, according to GAI (Google AI) the phonetic elements were accurate only for some Old Chinese language spoken between 1200 BCE and 200 BCE, when the majority of the characters were created. But then the characters remained frozen while the spoken languages evolved and diverged.

Thus today the "phonetic" component gives the correct Mandarin pronunciation for somewhat less than 30% of the characters. For Cantonese it is a bit more than that. It gives an incorrect but still somewhat useful pronunciation hint for another 40-45% of the characters.

An example of the latter is 怕 = "to fear" or "afraid" which has semantic component 忄= "heart" or "emotion" and phonetic component 白 = "white.
- In Mandarin, 白 is bái while 怕 is pà.
- In Cantonese, 白 is baak6 while 怕 is paa3.
GAI says that the phonetic element still gives the proper rhyme in Mandarin.

But it is notable that the phonetic element often works in different languages even though the sounds may be very different. Google AI gives the example of 煌 = "bright". It consists of the semantic element 火 = "fire", and the phonetic element 皇 = "emperor".
- In Mandarin, 皇 and 煌 are both pronounced huáng.
- In Cantonese, 皇 and 煌 are both pronounced wong4.
But he meaning of each character is exactly the same for both Mandarin and Cantonese speakers.

All the best, --stolfi
RE: The 'Chinese' Theory: For and Against

Jorge_Stolfi > 27-06-2026, 04:35 AM

(27-06-2026, 02:34 AM)pfeaster Wrote: You are not allowed to view links. Register or Login to view.This is an interesting case, but let's consider another one.

Thanks for the statistics! But I think they confirm my point...

Quote:For your "and", the percentage of [s] at end of preceding words is 26.1% -- still higher than average, though not by as much.

For me, the "-s" before "and" is almost three times the overall frequency of "-s". The difference may be because I included the subsection headers, like "Vertues and uses:" -- and thus "vertues" turned out to be the most common word before "and", by far.

Quote:Neither your example nor mine shows quite the kind of word profile we actually find around word-break combinations with skewed statistics in the Voynich Manuscript, in my experience. Importantly, I don't recall seeing anything like your "vertues" or my "they" that suggests whole words are driving the patterns as opposed to individual morphological elements -- glyphs, bigrams, and such.

However, the enhanced frequency of "-y" before "qo-" must be due due to the word frequency distribution before "qo-" being different from the overall distribution. Especially since, as you claim, the glyphs before the "-y" have an effect too.

Quote:But even if we look just at Culpeper, that 36.1% for [s] before [are] seems to demonstrate that the kind of statistic we're considering here can be meaningful and revealing.

My point is that the statistics of the character "-s" before "are" do not tell us anything except that there is some deviation from statistical independence. That alone is neither meaningful nor revealing. We only begin to understand why the deviation exists by looking at the words that caused it.

In fact, the existence such deviations is not remarkable by itself. If the text is meaningful, there will be correlations between consecutive words, and these will almost always cause correlations between the last glyph of a word and the first glyph of the next one. Or even between the first glyph of a word and the last glyph of the next one. (In my Culpeper file, I bet that the word before "and" will have an anomalous tendency to begin with "v"...)

What would be remarkable would be the absence of such deviations -- that is, if the frequency of any ending like -y was the same independently from the beginning of the next word. That would be very unlikely to happen, whether the text was meaningful or gibberish. In a modern encrypted text that independence could happen if the first and last letters of each word were randomly selected nulls, or if the encryption was a Vigenère cipher using the digits of Pi as the key, or if the text was generated by a Markov process of order zero, etc.. But these possibilities would be very unlikely for a text created in the early 1400s.

All the best, --stolfi
RE: The 'Chinese' Theory: For and Against

rikforto > 27-06-2026, 05:52 AM
皇
(26-06-2026, 08:13 PM)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view.The hanzi 人 does not represent any specific string of sounds.
The way this is true cannot bear the load resting on it.

It is true that printed there, without context, I cannot say how that is pronounced. (I can tell you I heard "인" in my head, but that's not actually at all likely in the 21st century.) This is less impressive than it may first seem. How would you read "vice versa"? What language is it in? It turns out that reading an alphabetic language relies a good deal on context and convention, and one of those contexts is what language you're reading in. It is certainly interesting that Chinese orthography admits a great many more such undistinguished words between languages, but this is just a consequence of not directly representing any phonology.

Any familiarity with a character dictionary will quickly dispel the idea that they have no string of sounds associated with them. I've been making ample use of You are not allowed to view links. Register or Login to view., and we see 人 has two etymologies, and hence two readings. Closer inspection will reveal that Wiktiionary is kind of a mess because though it says "Chinese", Etymology 2 is not available outside of a group of Min languages. You'll also see it gives the various denotations, which are listed separate from the reading, unlabeled in Mandarin and gives specific attestations in other languages with clear lables. What you cannot see are the gaps in the various languages and where the Mandarin isn't available in non-Mandarin languages. I would pay to read the oral history of how all these compromises got made, but that's not the point here. The point is you can see a number of interesting things that contradict the idea that there is some stable oversoul of ideas captured by the characters:
1. In Cantonese it is used as an impersonal pronoun
2. It is a plural marker in Hakka (but only for people)
3. It has a dedicated Beijing Mandarin use that is not found in Standard Chinese (e.g. it is probably slang)
4. In Taishanese it means paternal grandmother
5. Some Min languages use non-homophonic 儂 instead of 人, though it seems the Taiwanese government encourages the unusual measure of simply writing that as 人
6. 儂 has no listed meaning in Mandarin or Standard Chinese
I will say, if you look at the dialectical synonyms for the main Mandarin meaning of "person", that one has been fairly stable, though it is also a Swadesh word and predicted to be that way. If you want to see an example of an unholy mess at the opposite end of the spectrum, check out the dialectal synonyms listed for "paternal grandmother". The sum force of all this is that that Chinese "concepts" are not immutable under language change, but rather a living language family full of semantic drift and changing senses and words being supplanted.

Rene, the kind of character you bring up illustrates this nicely. The You are not allowed to view links. Register or Login to view. still exist and can still be used to guess (with all the uncertainty "guess" implies) the reading of a character. The essay linked by igajkgko dates the phonetic indicators to before the Late Shang, many hundreds of years before the Shennong Bencao Jing is supposed to have been written; my kanji textbook, printed in the 21st Century, still marks which characters have them and where. The characters have stood for sounds for a long, long time. Jorge, what you are discovering when you note that the same words tend to be in the same phonetic series is that the series are reflexes of each other.

Once you accept characters represent words and look at how they are used, you see a fully fledged language family. Now, for the most part you won't actually see this in writing. It is already in evidence that most new writing is done in Standard Chinese, which is what most people in the West mean when they say "Mandarin", and that prior to the 20th Century it was Classical Chinese. These are fully fledged languages which must be acquired through education. There are interesting things to be said about the standard practice of giving Mandarin Cantonese readings in Guangzhou, but they aren't all that different or all that much more interesting than You are not allowed to view links. Register or Login to view. in the Latin West.

To keep this on thread, let me briefly reiterate: Since Classical Chinese is not the vernacular of the 1400s, and since we are assuming that The Author avoided learning it the fact that it is an independent language implies one of two things:
1. The Author was unequipped to parse the manuscript, even if he had a phonetic translation
2. The Dictator translated the base text and the identification must account for the fact that the base text would be substantially different from the one Jorge used to make the identification
RE: The 'Chinese' Theory: For and Against

JoJo_Jost > 27-06-2026, 06:01 AM

@ Stolfi

The problem I see is that you have to make more and more assumptions to support this theory.

Culpeper’s herb book showed a relationship between a word and the word “und”—or between an ending and the word “und.” This relationship exists.

If you look at the structure of the “edy” families, it is broader and deeper.
For example, I could remove all standalone “shedy”s, and the effect would roughly remain the same. Precisely because there are other cases with a glyph preceding them: “dshedy,”, “olshedy,” even “qokshedy,” and other variants. However, the hit rate is low. But even here, one or two additional assumptions could call this problem into question.

Not to mention that the edy family itself is very flexible and, strangely enough, has varying degrees of connection to qo. That’s also tricky.

I could argue that the other token-ending families—namely, the aiin, air, ol, and ar families (roughly one-third of all endings)—have only an extremely low correlation with qo (between 4 and 8%), meaning they essentially reject qo. And that, in turn, naturally argues against such a flexible word as “and.” But there are certainly possible explanations here as well. Although it gets more difficult here, since “and” is supposed to be "very" flexible.

I could then argue that the sequence—that is, everything that comes after “qo”—is also predetermined to a high degree (for example, that “k” follows “qo” 64% of the time, but not just “k”—even “kee” , still qokeed follows, which in turn is followed by “y” and thus “qo”). And that, too, contradicts a word as flexible as “and.”

You are not allowed to view links. Register or Login to view.

Note: The figures in these graphs were calculated as a continuous stream without spaces and may therefore vary slightly.

And here, too, you need yet more assumptions to argue this within the context.

So you can make the assumption that qo = “and,” but then you run into problems on other levels, which you then have to support with further assumptions. If you make an assumption on those other levels, you in turn run into problems with the remaining ones...

In my opinion, you’ll find yourself drowning more and more in further assumptions that, in principle, cannot be substantiated—the structure of the VMS is too complex. Explaining all these structures within the context of a language in this natural form, however, will be difficult—very, very difficult...
RE: The 'Chinese' Theory: For and Against

Jorge_Stolfi > 27-06-2026, 01:40 PM

(27-06-2026, 06:01 AM)JoJo_Jost Wrote: You are not allowed to view links. Register or Login to view.@ Stolfi, The problem I see is that you have to make more and more assumptions to support this theory.

Plese clarify -- which theory and which assumptions?

If a word-final glyph like -y is more common in a certain context than in the whole book, then the frequency distribution of words in that context must be different from their distribution in the whole book. This is not an assumption or conjecture; it is a hard and simple mathematical deduction.

Quote:If you look at the structure of the “edy” families, it is broader and deeper.
For example, I could remove all standalone “shedy”s, and the effect would roughly remain the same. Precisely because there are other cases with a glyph preceding them: “dshedy,”, “olshedy,” even “qokshedy,” and other variants.

The striking anomaly for "-s before and" in my example was caused in good part by a single word "vertues", from the field title "vertues and uses". But, in general, the difference in word distributions need not be dominated by a single word.

In your example of "-s" before "are", the difference is spread over many words -- namely, the word "they" and many words ending in "-s" are more frequent before "are", while other words are less frequent.

In your example, it was still the case that the words that were enhanced were highly correlated with the ending "-s". But the feature that actually determined the enhanced frequency was not that letter per se, but whether the first word was plural or singular (or a second-person pronoun). It so happened that, in English, words that end with "-s" are often plural, and vice-versa. But there are many exceptions, like "they", "men", "women", "teeth", "feet", "people", "children", "mice", "dice", "sheep", "fish", "deer", "phenomena", "fungi", ...

If one focuses on the statistics of characters, like the final "-s" and initial "a-" one will never realize that a certain set of words -- including many that do not end in "-s" -- are the real cause of the statistical anomaly. One would get forever stuck at puzzling "why is the ending -s attracted to a leading a-?"

To be sure, sometimes it is the case that character correlations across a word gap are actually due to the characters, independently of the words. An example in English would be the (guessed) increased frequency of "-n" before words that begin with "a-". That would be due to the pronunciation quirk that changes the indefinite article "a" into "an" before such words.

This statistical "anomaly" should not depend on the rest of the second word; so, in that side of the gap, it would indeed be a character-level phenomenon. However, even in this example, if one looked at the first word, instead of just its final letter, one would notice that the effect is almost entirely due to the two words "a" and "an" -- not to the final letter being "-n".

Quote: And here, too, you need yet more assumptions to argue this within the context. So you can make the assumption that qo = “and,”

The proposal that qo means "and" is a separate, specific guess, that is motivated by other observations. It could help explain the anomalous frequencies of certain endings before qo-words. But I don't see how one could disprove or prove that guess with that sort of character-level statistics. Or any statistics. If qo indeed means "and", it may affect the distribution of the previous word -- and hence of its final glyph -- in arbitrarily weird ways, for all sort of peculiar reasons. Like how "and" enhanced the preceding final "-s" in my "vertues" example.

And, again, there is nothing remarkable in the fact that the frequency of -y endings before qo-words is different from their frequency in the book as a whole. It would be remarkable if there was no difference.

Quote:Explaining all these structures within the context of a language in this natural form, however, will be difficult—very, very difficult...

Yes indeed. And that is one of the reasons why many attempts at "deciphering" the VMS have failed: because they were doomed from the start by the assumption that the solution would be "easy" -- like a Caesar cipher or the Enigma code. But any "exotic" natural language is much, much more complicated than that...

All the best, --stolfi
RE: The 'Chinese' Theory: For and Against

JoJo_Jost > 27-06-2026, 02:35 PM

@ stolfi

So, sorry, but this is NOT just about “y” before “qo.” It’s about more specific relationships. And it’s not about individual cases either, but about the overall structure behind it. And that can’t be explained simply by plurals or by one letter following another—and certainly not by a / n. The entire structure is the problem, and trying to view this structure as separate parts, at least in my opinion, misses the point.

And I did write that I removed “shedy,” and the relationships remain the same—try doing that in your word example, because that aspect gets lost there. But i know it is hard to compare...

But let’s just leave it at that. We’re talking past each other. You see these as more or less normal word or letter effects; I see it as an overall structure that needs to be explained. We don't know who is right.

Ultimately, your assumptions need to be backed up by clear evidence: what “kedy” is, what “shedy” is, what the “y” after it is, what “qo” is, and why something starting with “k” follows “qo” so urgently, etc.

Let’s wait and see… And I'll still be keeping my fingers crossed for you that it works out.

All the best!!!

Jost
RE: The 'Chinese' Theory: For and Against

pfeaster > 27-06-2026, 04:43 PM

(27-06-2026, 04:35 AM)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view.My point is that the statistics of the character "-s" before "are" do not tell us anything except that there is some deviation from statistical independence. That alone is neither meaningful nor revealing.   We only begin to understand why the deviation exists by looking at the words that caused it.

Here's a playful analogy:

In a base-10 system, all multiples of five end with the digit 0 or 5.  For example, 12345 and 43210 are both multiples of five.

My position: Investigating that final digit looks like it might be worthwhile.

Your position: No, the study of individual digits is a dead end -- the only unit worthy of attention is the entire number as a unitary whole. After all, any patterns found among digits are caused by the numbers in which they appear. This alleged pattern involving numbers ending in 0 or 5 will probably turn out to be an illusion on further investigation -- or else there will be so many potentially complicating factors that we won't be able to distinguish the noise from the signal. Besides, we know from the following multiples of seven -- 7, 14, 21, 28, 35, 42, 49 -- that it's possible for the final digit of a number not to reveal in any meaningful way whether that number is a multiple of some other number.

And another playful analogy:

Consider a hypothetical text of unknown character in an unknown script and language. The characters found most often at the ends of words in general are @ (12%), $ (10%), # (8%), & (7%), and ^ (6%). But if we limit our study to the words that appear before one particular word -- ~+*/% -- the character most often found at the end of those words is, by far, ^ (60%), although no specific word or words predominate among them; we just find a lot of different words that end in ^.

My position: Golly, that's remarkable! What kinds of explanation(s) could we find that would be consistent with a pattern like that? Could this give us a clue as to the structure of a language? Or the mechanism of a cipher?

Your position: That statistic alone tells us nothing. Not only that; the type of statistic is all wrong as well. We instead need to be looking at the words that most frequently precede ~+*/%. Anything else is just going to cause confusion.

Returning now from analogy-land:

If we were trying to "decipher" Culpeper (and had no knowledge of the English language), the observation that words ending in [s] are so much more common than usual before [are] would actually be a useful crib in practice -- more useful, I'd say, than any statistics about the specific words that precede [are]. It may reveal nothing in itself, I suppose, but for someone seeking a linguistic solution, I believe it would suggest the hypothesis that [s] is a morphological marker that correlates somehow with the word [are]. That hypothesis would be correct. The [s] is the regular plural marker on nouns; [are] is a plural copula. A decipherer wouldn't know that yet. But noticing the pattern (and pondering possible explanations for it) would be a step in the right direction, and enough steps in the right direction might eventually lead to a solution.

On the other hand, with an analytical/isolating language like Chinese, I suppose no such clues should exist -- which I suspect may be why you're eager to rule them out in the case of the Voynich Manuscript. That motive wouldn't invalidate your reasoning, of course, but it's probably worth acknowledging openly, if only to clarify what's at stake.

(27-06-2026, 01:40 PM)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view.In your example, it was still the case that the words that were enhanced were highly correlated with the ending "-s". But the feature that actually determined the enhanced frequency was not that letter per se, but whether the first word was plural or singular (or a second-person pronoun). It so happened that, in English, words that end with "-s" are often plural, and vice-versa. But there are many exceptions, like "they", "men", "women", "teeth", "feet", "people", "children", "mice", "dice", "sheep", "fish", "deer", "phenomena", "fungi", ...

If one focuses on the statistics of characters, like the final "-s" and initial "a-" one will never realize that a certain set of words -- including many that do not end in "-s" -- are the real cause of the statistical anomaly. One would get forever stuck at puzzling "why is the ending -s attracted to a leading a-?"

Of course, there are many words that can precede "are" and yet are not plurals either. Those cases might outnumber the irregular plurals.

Still, this researcher you posit who gets "forever stuck" strikes me as rather a naive kind of straw man. I could equally posit a researcher who somehow identifies and collects hundreds of English plurals without ever noticing that they tend overwhelmingly to end in [s] -- remaining convinced a priori that "statistics about characters tell us nothing."

I guess each reader of this thread can decide independently: in Culpeper, is the statistic that words preceding [are] are a lot more likely than usual to end in [s] a useful data point or not? If you could have an equally revealing data point about Voynichese -- no more revealing, no less revealing -- would you want to have it?
RE: The 'Chinese' Theory: For and Against

Jorge_Stolfi > 28-06-2026, 02:27 AM

(27-06-2026, 04:43 PM)pfeaster Wrote: You are not allowed to view links. Register or Login to view.Here's a playful analogy: In a base-10 system, all multiples of five end with the digit 0 or 5.  For example, 12345 and 43210 are both multiples of five. My position: Investigating that final digit looks like it might be worthwhile. ...

But note that, from character-level statistics, you will never learn that numbers that end in '5' are divisible by 5. Looking for numbers that end in '5' may be useful only if you already know that fact, and you suspect that quantities divisible by 5 have a special role in that text.

In fact, given a text with lots of numbers, I cannot imagine any useful insight one could obtain about those numbers from digit statistics alone.

And note also that the statistics for tokens that end in '5' would be contaminated by counts of tokens like "6.25" and "x^5" and "9-to-5" and "10:35" and "You are not allowed to view links. Register or Login to view.".

(By the way, I intended to add to that list a YouTube link ending with '5'. But after scanning a couple hundred items my list of funny videos in vain, I began to suspect that no such thing exists. And indeed Goggle AI explained that YouTube assigns a 64 bit ID number to each video, then encodes that with 11 digits in base-64; and since 11 x 6 = 66, the last two bits of this code will be zero. So there are only 16 possibilities instead of 64 for last base-64 character in the code; and the only digits in that set are 0, 4, or 8.

So I guess that this would be an example of a puzzling fact that would stand out in character-based statistics ... but would hardly ever be explained by them...)

Quote:And another playful analogy: Consider a hypothetical text of unknown character in an unknown script and language. The characters found most often at the ends of words in general are @ (12%), $ (10%), # (8%), & (7%), and ^ (6%). But if we limit our study to the words that appear before one particular word -- ~+*/% -- the character most often found at the end of those words is, by far, ^ (60%), although no specific word or words predominate among them; we just find a lot of different words that end in ^.

My position: Golly, that's remarkable! What kinds of explanation(s) could we find that would be consistent with a pattern like that? Could this give us a clue as to the structure of a language? Or the mechanism of a cipher?

Your position: That statistic alone tells us nothing. Not only that; the type of statistic is all wrong as well. We instead need to be looking at the words that most frequently precede ~+*/%. Anything else is just going to cause confusion.

Yes, and I stand by the latter statement. What insight could we possibly get out of the observation "[^] is more common than usual before [~+*/%]"?

To make further progress, we must look at the words that have different frequency before [~+*/%]. We will probably find that not just words that end with "^", but other words as well are "attracted" to [~+*/%], and on the other hand some words that end with "^" are "repelled" by it. Then we will realize that the important feature is not "ends with [^]", but may be something else, like "it is a transitive verb" or "it is a word related to horse management"; and the fact that many of those words end with [^] is a mostly meaningless coincidence. In any case, we would have identified a set of words that are somewhat related, possibly semantically or syntactically.

Quote:If we were trying to "decipher" Culpeper (and had no knowledge of the English language), the observation that words ending in [s] are so much more common than usual before [are] would actually be a useful crib in practice -- more useful, I'd say, than any statistics about the specific words that precede [are]. It may reveal nothing in itself, I suppose, but for someone seeking a linguistic solution, I believe it would suggest the hypothesis that [s] is a morphological marker that correlates somehow with the word [are].

Yes, we could make the hypothesis that "-s is a morphological marker that correlates with [are]". But that hypothesis in fact is nothing more than the observation "the word-final -s is more common than usual before the word [are]". It is just stated in a more sophisticated way...

Quote: But noticing the pattern (and pondering possible explanations for it) would be a step in the right direction, and enough steps in the right direction might eventually lead to a solution.

Not necessarily. Character statistics may be just as likely to lead you astray. For instance, in my earlier post I noted that the word final "-s" is also 3 times more common before "and" than in Culpeper's book as a whole. Should we take that as a hint that "and" and "are" have similar meanings or syntactic functions? That would be a mistake, because those numbers for "and" are due mostly to one word ("vertues") and one peculiarity of the structure of the text (that it has the phrase "vertues and uses" in every recipe).

Quote:On the other hand, with an analytical/isolating language like Chinese, I suppose no such clues should exist -- which I suspect may be why you're eager to rule them out in the case of the Voynich Manuscript.

On the contrary. In any text in any language, and for any pair of symbols X and Y, we expect that the frequency of word-final "-X" before words starting with "Y-" will be either higher or lower than the frequency of word-final "-X" in the whole text. That happens because the frequency of word pairs is never just the product of the individual frequencies.

Check You are not allowed to view links. Register or Login to view. in modern Mandarin pinyin, for example. (This is the older version I was using six months ago, not the new version that I am still extracting from the Zhenghe Bencao. You must download the file; opening it in the browser and copy-pasting will not work because of UTF-8/ISO-Latin screwup by our WWW server.) Consider these counts:
2676 g. 273 n.g 304 ī.m 201 n.w 511 g.s 103 n.p   436 ǔ.z
2235 n. 85 i.g 109 g.m 122 g.w 204 n.s 62 g.p 340 g.z
1297 ǔ. 61 g.g 62 n.m 66 ì.w 161 ì.s 49 ǔ.p 305 n.z
1276 ì. 35 ì.g 26 ù.m 54 ǔ.w 67 i.s 22 ì.p 94 ú.z
823 i. 28 ú.g 24 ì.m 48 í.w 63 o.s 13 è.p 86 ì.z
615 ī. 18 ā.g 14 ǔ.m 46 o.w 56 ù.s 12 i.p 64 í.z
569 ú. 13 è.g 13 i.m 37 ǐ.w 46 è.s 10 ú.p 62 ù.z
568 ù. 13 ù.g 13 r.m 33 i.w 45 ǔ.s 6 ù.p 50 ǐ.z
479 è.    10 é.g 12 è.m 25 ù.w 44 ú.s 4 ǐ.p 33 i.z
413 ǐ. 9 í.g 7 o.m 24 ī.w 41 í.s 4 o.p 31 ǚ.z
The first column above is the count of words that end with each letter (10 most common only). That is, there are 2676 words that end with "-g" (actually "-ng") , 2235 that end in "-n", 1297 that end with "-ǔ", etc.  The second column is the same counts but only for words that occur before a word that begins with "g-". That is, there are 273 word spaces where "-n" faces "g-", 85 where "-i" faces "g-", etc. The other columns are similar for word breaks before "m-" words, "w-" words, etc.

As you can see, there are plenty of strong "anomalous final-letter frequencies", comparable to the anomaly of "-y" before "qo-" in the VMS. In the whole text, the most common word ending is "-g" followed closely by "-n", then "-ǔ" and "-ì" almost tied in distant 3rd and 4rth places. But before a word that begins with "g-", the ending "-n" is more than three times as common as "-ì" and four times as common as "-g". Whereas before "s-" the ending "-g" is 2.5 times as common as "-n". And before "m-" the most common ending is "-ì" (3x the second place), and before "z-" the most common ending is "-ǔ".

The causes of these anomalies are not some phonetic phenomenon of the language or peculiarity of the encoding that causes certain characters to attract or repel. Like the "-s before and" anomaly of Culpeper, these anomalies are due to certain words being more common or less common before certain other words, in that text.

The high frequency of "-ǔ before z-", for example, is due to the common pair "zhǔ zhì" 主治, which, in that version, is what I have identified as the source of daiin in the Starred Parags section.

Quote:this researcher you posit who gets "forever stuck" strikes me as rather a naive kind of straw man.

It is not a straw man! How many people have spent years tabulating those character frequencies, without getting any useful conclusion out of them?

Quote:I could equally posit a researcher who somehow identifies and collects hundreds of English plurals without ever noticing that they tend overwhelmingly to end in [s] -- remaining convinced a priori that "statistics about characters tell us nothing."

But the statistics for "-s" endings did not lead to the identification of "plurals" being what what was attracted by "are"! We figured that out that from our knowledge of English grammar. Those "-s" statistics indeed told us nothing besides the statistics themselves.

Quote:If you could have an equally revealing data point about Voynichese -- no more revealing, no less revealing -- would you want to have it?

Well, we have the statistics "-y is more common before qo-" and variations on that. What do they reveal?

All the best, --stolfi
RE: The 'Chinese' Theory: For and Against

JoJo_Jost > 28-06-2026, 04:39 AM

If it were just a matter of y and qo, but of course it isn't...
RE: The 'Chinese' Theory: For and Against

JoJo_Jost > 28-06-2026, 06:29 AM

Sorry, there was a mistake in this post and I had downloaded the Bencao file and noticed a few errors—probably from Google Translate. I've now decided to wait until Stolfi finishes the new version.
Next Oldest Next Newest

The 'Chinese' Theory: For and Against

Index

RE: The 'Chinese' Theory: For and Against

RE: The 'Chinese' Theory: For and Against

RE: The 'Chinese' Theory: For and Against

RE: The 'Chinese' Theory: For and Against

RE: The 'Chinese' Theory: For and Against

RE: The 'Chinese' Theory: For and Against

RE: The 'Chinese' Theory: For and Against

RE: The 'Chinese' Theory: For and Against

RE: The 'Chinese' Theory: For and Against

RE: The 'Chinese' Theory: For and Against