The Voynich Ninja

Full Version: The 'Chinese' Theory: For and Against
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
(20-05-2026, 11:26 PM)nablator Wrote: You are not allowed to view links. Register or Login to view.
(20-05-2026, 06:22 PM)kckluge Wrote: You are not allowed to view links. Register or Login to view.All of which is getting into the weeds. The point is that *if* Voynichese were a cipher that breaks up words into smaller chunks, the process that breaks them up is unlikely to be syllabification (and likely isn't deterministic in general?) due to the extremely low TTR that results.

The TTR of Voynichese can be reduced as much as you need... by simplifications, equivalences, re-spacing. However the babble-like sequences of similar words would not produced plausible Latin (or Chinese).

Let me start by saying that we are in vehement agreement that the distribution of the number of words between instances of words with a length-normalized edit distance below some threshold (which is a fancy way of saying "babble-like sequences of similar words") is an incredibly significant statistical signature of the Voynich text that any theory about how the text is generated needs to reproduce. It's for Stolfi to address that issue -- quantitatively -- in the context of explaining/defending his theory. 

As for the TTR stuff...I'm acutely aware that this is the "Chinese" theory thread (and not, for instance, the (AFAIK non-existant) "verbose cipher" theory thread), bur moderators and readers please bear with me because what follows will, in fact, wrap back around to being on topic...

When you say, "The TTR of Voynichese can be reduced as much as you need... by simplifications, equivalences, re-spacing" I'll at least conditionally agree. You could, for example, assume that all the glyphs are either nulls or homophones of a single underlying glyph, in which case the TTR would be 1/NumTokens. While that may be true, it's not a terribly useful observation, and the underlying problem (at least from my POV, YMMV) is that methodologically it's a bass-ackwards way of thinking about the issue for reasons that:

1) tie into the whole "if you have a linguistic/cryptographic theory about the Voynich text, then actually applying that theory to translate/decipher the Voynich text is literally the last thing you should be doing, not the first" position that people have probably seen me rant about any number of times in the old mailing list/comments on Nick's blog/the Ninja, and

2) offers a good chance to provide an illustration of what I think methodologically is a better approach, that

3) results in a critique of Stolfi's approach here

In the context of the experiment I was talking about, the question is *not* "can you fiddle with the Voynich text to lower the TTR?" The questions is "if you assume that Voynichese represents a natural language (in this case, Latin) where the words have been broken apart into syllables and then enciphered with a (probably to no one's surprise) verbose cipher, how do the statistical characteristics of that cipher text match up with the characteristics of Voynichese?" In that context, can you do things to the enciphered text that would raise the TTR? Sure, but the problem is to find a way to do that such you both

1) raise the TTR to a level that is quantitatively consistent with the Voynich text, without 

2) (and this is the really important kicker) screwing up the quantitative agreement of the cipher text with any of the *other* characteristics of the Voynich text.

Could you raise the TTR of the enciphered Latin -- by the quantitatively necessary amount -- by (say) introducing homophones? Sure -- but in reducting the predictability of the next glyph what is that going to do to the entropy values? Is that going to mess with how Zipfian the "word" frequency distribution is? These things aren't independent variables that you can arbitrarily fiddle with in isolation, they're coupled. To the extent that the answer to those questions are (probably) "raise them unacceptably" and "yes", the low TTR -- completely independently from any other problems like "babble-like sequences of similar words" -- suggests (as a preliminary result, to the extent that Latin may or may not be typical) that we can probably rules out "it's enciphered syllables in a natural language text." Or at least "...in an Indo-European language text."

Whether or not there is a different way of breaking words in an underlying natural language text apart and enciphering them that produces an ciphertext that is quantitatively consistent with all the "greatest hits" properties of Voynichese is an open question.

All of which leads to wrapping this back around to discussing Stolfi's theory (and tying it into the whole "babble-like sequences of similar words" text signature issue). It's not *impossible* that Stolfi has stumbled into a solution with his approach, but that's certainly not where I'd be putting my money in Kalshi or Polymarket at this point. IMHO -- and he is obviously convinced otherwise -- I think it is far more likely that he has fallen prey to the siren call of the "crib" that has lured so many mariners sailing on the treacherous seas of the Voynich Mss to their doom. What he *should* be doing -- again, IMHO -- is taking texts in one or more SE Asian languages (dealer's choice), assigning some scheme for representing them with Voynich glyphs, simulating whatever "confused ignorant scribe" error processes he thinks are there, and then showing that you wind up with something that -- quantitatively, and for all the "greatest hits" properties (including "babble-like sequences of similar words") -- looks like Voynichese.

Please excuse any typos/word skips --it's late...
(20-05-2026, 03:43 PM)nablator Wrote: You are not allowed to view links. Register or Login to view.I estimate it to 800-1000 in long texts (without many exotic words), more than double what Mandarin has (~400).

Mandarin has almost 1300 distinct syllables (non-compound words), as counted You are not allowed to view links. Register or Login to view..  The number "400" is what one gets by discarding the tones.  Since tone is totally significant (it changes completely the meaning of the word), discarding them would be like counting Latin syllables after deleting all vowels except "u".

All the best, --stolfi
(20-05-2026, 03:42 PM)Stefan Wirtz_2 Wrote: You are not allowed to view links. Register or Login to view.
(19-05-2026, 08:25 PM)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view.[..]
The Voynichese daiin (and sometimes slight variants like dainkaiin and laiin, and the abbreviation dam)
There is no proof for some of them being just „slight variants“ or even an „abbreviation“ for daiin or anything else.

Regardless, the point is that all attempts to identify articles in Voynichese have failed.

But I think I do have good proof that dairdainkaiin and laiin can be variants of daiin. It seems you stll don't accept that evidence, but I hope it will eventually be irrefutable; I am working on that.  

And I have also an explanation for those particular variations: the glyphs k, d and l (and, separately, r and in)  would look very similar if written in a "cursive" handwriting, such as the Author may have used in the draft.

The k may look very different from d because of its size; but even in the "print letters" of the Scribe's clean copy it can be squeezed down to the same height of a d, like this
[attachment=15665]
(f103r, line 9) and then the only real difference is whether the corner at top left is sharp or rounded.

As for dam being an abbreviation of daiin, (or, more generally, m being an abbreviation of iin) besides the evidence of the SPS=SBJ matches, there are statistical clues, like the prevalence of m at line end.

All the best, --stolfi
(21-05-2026, 10:33 AM)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view.But I think I do have good proof that dairdainkaiin and laiin can be variants of daiin. It seems you stll don't accept that evidence, but I hope it will eventually be irrefutable; I am working on that.  

And I have also an explanation for those particular variations: the glyphs k, d and l (and, separately, r and in)  would look very similar if written in a "cursive" handwriting, such as the Author may have used in the draft.

Just out of curiosity, would you also include taiin in this? Do you consider k and t to be two distinct characters?
(21-05-2026, 07:56 AM)kckluge Wrote: You are not allowed to view links. Register or Login to view.the distribution of the number of words between instances of words with a length-normalized edit distance below some threshold [...] is an incredibly significant statistical signature of the Voynich text that any theory about how the text is generated needs to reproduce.

Can you be more specific?

Quote:It's for Stolfi to address that issue -- quantitatively -- in the context of explaining/defending his theory.  [...] The "explanation" would be to show that the same anomalies that you see in the SPS are spelling system, and incidence of errors.

Can you see those anomalies in You are not allowed to view links. Register or Login to view.?

Quote:...I'm acutely aware that this is the "Chinese" theory thread (and not, for instance, the (AFAIK non-existant) "verbose cipher" theory thread), bur moderators and readers please bear with me because what follows will, in fact, wrap back around to being on topic...

I don't mind.  Besides, any argument you present for why the "Chinese" Origin theory (COT) is wrong will be in-topic.

Quote:3) results in a critique of Stolfi's approach here

First, note that the type-to-token ratio (TTR) of a text in any language usually depends on the text size.  You are not allowed to view links. Register or Login to view. (which may be just a consequence of Zipf's law) says that the size L of a a text's lexicon (number of "word types") is related to size N of the text (number of tokens) like L ≈ K sqrt(N), where K may depend on the language, style, topic, etc.  

So, when comparing TTRs of different texts, it is absolutely important to trim them to the same number of tokens.  Or instead compute the apparent constant K = L/sqrt(N)  instead of the ratio L/N.

Second, spelling and transcription errors can inflate L while having little effect on N.   Mandarin uses only 1300 syllables (~18%) out You are not allowed to view links. Register or Login to view. by the known phonetic constraints of the language. That means that changing one phoneme in a random syllable will, with high probability, increase the lexicon size L -- even if the change respects the phonetic constraints. 

Thus, when comparing lexicon sizes, it may be prudent to exclude word types that occur only once or twice.  Or maybe even more, depending on the rate of errors.

The Starred Parags section (SPS)  of the VMS has, by my count, N = 10875 tokens and L = 2749 word types.  The TTR would then be ~0.25, and Heap's K would be ~26.4.  if we exclude word types that occur only 1 or 2 times, we get L = 591 word types in N = 8386 tokens; the TTR drops to ~0.070 and K drops to ~6.45.

For comparison, that text above (which is Mandarin Chinese obfuscated by a weird but one-to-one spelling system, with no spelling errors) has L = 587 word types and N = 9458 tokens; the TTR is ~0.062 and Heap's K is ~6.03.  Excluding words that occur only once or twice, we get L = 376 and N = 9170, giving TTR = ~0.041 and K =  ~3.92.

Third, one must not forget that statistics -- word and character frequencies, correlations, Zipf's and Heap's law, etc -- are a property of a text, or a specific collection of texts -- not of a language.  There is no such thing as "the frequency of 'e' in English", "the most common word in Latin", "the Heap K of Mandarin".  Even if the dialect and spelling are fixed, those statistics are greatly affected by topic, style, nature of the text, cost of vellum, etc.  One can have a meaningful and grammatically correct text in English with 50'000 tokens and only 100 word types -- that does not use the word "the" even once.

Quote:All of which leads to wrapping this back around to discussing Stolfi's theory [...]. It's not *impossible* that Stolfi has stumbled into a solution with his approach, but [...] I think it is far more likely that he has fallen prey to the siren call of the "crib" that has lured so many mariners sailing on the treacherous seas of the Voynich Mss to their doom.

Well, what can I say?  I think that the matches I have found are quantitatively categorical.

Note that my solution is very different from most other proposals. I don't know what is the language and the encoding, but I claim that I have identified a specific plaintext (the SBJ) whose structure  matches the SPS much better than could be expected by chance.  Specifically, I have a small dictionary (the "cribs") such that those words occur in the SBJ with spacings that are very closely proportional to the spacings of the corresponding words in the SPS.  

AFAIK there haven't been many such proposals yet.  Maybe the "Book of Enoch" theory is one.  I know of another proposal for a specific Kabbalistic book in Hebrew.  Presumably such proposals are rare because they are unlikely to work for more than a few pieces of the text, and break down when one tries to extend them beyond that scope.

Quote: What he *should* be doing -- again, IMHO -- is taking texts in one or more SE Asian languages (dealer's choice), assigning some scheme for representing them with Voynich glyphs, simulating whatever "confused ignorant scribe" error processes he thinks are there, and then showing that you wind up with something that -- quantitatively, and for all the "greatest hits" properties (including "babble-like sequences of similar words") -- looks like Voynichese.

Well, you have that text above to play with.  Do you see "babble-like sequences" in it?

All the best, --stolfi
Just about the potential cribs here, would it be worth searching for a far less common match? 

What I mean is a word that only appears once or twice in many recipes, but always in the same position when used within a recipe. For example, the word "grows" or the phrase "also known as". If you were to find a VMS match for a word like this, showing a positional match across recipes, that would be interesting.

Lets say hypothetically the word "grows" is found near the end of 80% of recipes, but is found at halfway through 20% of recipes (but NEVER both), finding a VMS word that acts the same way would be convincing. Even more convincing would be pairs of such words in recipes. If "also known as" and "grows" also happen to always be a similar distance apart in the SBJ recipes, and you find a pair that also match those attributes, it would be seriously interesting. 

[attachment=15666]

I don't know if such words and phrases exist in either text, and this is obviously far easier said than done. 

With something like "daiin", it is so common throughout the text that it will undoubtedly coincidentally match with something. I'm not saying your results show nothing, but my suspicion is that if you were to do a similar analysis between a european recipe book and the SBJ that you would also find matches given the nature of the content and the natural flow of such works.
(21-05-2026, 02:28 PM)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view.Well, you have that text above to play with.  Do you see "babble-like sequences" in it?

I don't, but I'm not going to read it entirely and I don't have a babble detector.

How to build a babble detector? Not sure. Some compression algorithm run on all (short) substrings?

Anyway, this doesn't happen in the VMS:

Repetitions of:
- 10 entire words:
6x vuef jhut quky jhokd mep bifa yitd duitd jhotd xuikd
2x reak jhut bupy gotd keity zafe vuef fet quky jhokd
- 9 entire words:
2x fep zhek nheiky jup zhek fotu rip jikd mopu
2x zhef eqef zipy xuot qup foky hitd jhuk mup
- 8 entire words:
2x vuef fet quky jhokd yitd duitd mep bifa
2x vuef fet quky jhokd mep bifa yop cuty
2x vuef fet ofg cep naky cuty quky jhokd
- 7 entire words:
2x zhef zhapy foky jhiky hitd tate tapy

[attachment=15670]

[attachment=15671]
(21-05-2026, 06:53 PM)nablator Wrote: You are not allowed to view links. Register or Login to view.How to build a babble detector? Not sure. Some compression algorithm run on all (short) substrings?

Okay, I wrote a quick babble detector with Byte Pair Encoding. The size of the string (with spaces) is divided by the size of the compressed string (without dictionary) to get the compression factor.

Looking for 4 consecutive words with BPE compression factor >= 4.

3 in chovynoise:
nikd jheuf nikd jheuf
hitd hitd gop gop
ret sef ret sef

23 in VMS Q20:
qokaiin shey qokaiin shedy (error in RF1b, should be qokaiin shey qokain shedy on line f103r.20)
chedy qokeey qokeey lchedy
okaiin shedy shedy qokaiin
shey l shey lshey
qodain shedy chedy qodain
okeeey qokeeey okeey okeey
otshedy qokchdy qotshdy qokchdy
kchedy shdy kchedy chedy
oteey oteey lkain okain
okedy qokchedy chedy qokedy
okeey qokeey qokeedy qokeey
otedy qokedy qokeedy qokeedy
lkedy lkeedy chedalkedy lkeedy
okeedy oteedy qokeedy okeedy
oteedy qokeedy okeedy okeedy
aiin cheey qokeey qokeeaiin
qokeedy qokeedy chey qokeedy
qokeedy qokeedy qokeedy qotey
qokeedy okeedy qokey qokeedy
shey qokear shey qokeey
ateey qokeedy okeey qokeedy
chedy qotain chody qotain
sheey qokey qokchey qokchey

Looking for 5 consecutive words with BPE compression factor >= 4.

2 in chovynoise (actually it's twice the same "nikd jheuf nikd jheuf" string):
ro nikd jheuf nikd jheuf
nikd jheuf nikd jheuf vepd

18 in VMS Q20: 
shckhy qokaiin shey qokaiin shedy
qokeey chedy qokeey qokeey lchedy
otchedy qodain shedy chedy qodain
qokaiin chey chokaiin chear ar
aiin otal taiin qokaiin otal
okair chtl lkaiin okair chtl
qokedy okar chdy okar char
qokeedy qokeedy lchdy oteedy lchedy
qokeedy lchdy oteedy lchedy qokeedy
lchdy oteedy lchedy qokeedy qokeedy
lchedy qokeedy qokeedy qokeo lchedy
okeedy oteedy qokeedy okeedy okeedy
qokeedy qokeedy chey qokeedy qokedy
qokeedy qokeedy qokeedy qotey qokeey
qokain okeiin chey qokain okeey
ateey qokeedy okeey qokeedy qoky
qo qokain sheckhy qokain shekain
qokain sheckhy qokain shekain shkain

Looking for 6 consecutive words with BPE compression factor >= 4.

0 in chovynoise

15 in VMS Q20:
shdy qokeey chedy qokeey qokeey lchedy
odaiin otchedy qodain shedy chedy qodain
chkeiin okair chtl lkaiin okair chtl
qokchedy chedy qokedy okar chdy okar
qokeey qokeedy qokeey chedal chedy qokeey
qokeedy qokeedy lchdy oteedy lchedy qokeedy
qokeedy lchdy oteedy lchedy qokeedy qokeedy
lchedy qokeedy qokeedy qokeo lchedy qokey
qokeedy qokeedy qokeedy qotey qokeey qokeey
qokeedy qokeedy qotey qokeey qokeey otedy
qochey qokeey chey teey qokeedy qokeedy
qoaiin chedy qotaiin chety laiin chedy
shedain qokeedy chodain otedain qokeedy qokeedy
qo qokain sheckhy qokain shekain shkain
qokain sheckhy qokain shekain shkain shedy

Note: factor 4 is a bit too much, it favors exact repetitions.

For example "qokaiin shey qokain shedy " is compressed to "WiSRWSdR", factor = 26/8 = 3.25
Z = "qo"
Y = "Zk"
X = "Ya"
W = "Xi"
V = "n "
U = "Vs"
T = "Uh"
S = "Te"
R = "y "

VMS Q20: 406 occurrences of 4 consecutive words with BPE compression factor >= 3.

chovynoise: 5 occurrences of 4 consecutive words with BPE compression factor >= 3.

VMS Q20: 403 occurrences of 10 consecutive words with BPE compression factor >= 3.

chovynoise: 2 occurrences of 10 consecutive words with BPE compression factor >= 3.
(21-05-2026, 02:28 PM)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view.
(21-05-2026, 07:56 AM)kckluge Wrote: You are not allowed to view links. Register or Login to view.the distribution of the number of words between instances of words with a length-normalized edit distance below some threshold [...] is an incredibly significant statistical signature of the Voynich text that any theory about how the text is generated needs to reproduce.

Can you be more specific?

Sure. See Schinner, Andreas (2007) 'The Voynich Manuscript: Evidence of the Hoax Hypothesis', Cryptologia,  31:2, 95 - 107 (specifically, p 100-103). Key passage: "The Levenshtein distance of two character strings is an integer ranging from 0 (exact match) to the maximum of the two string lengths (no similarity), denoting the number of elementary edit operations necessary to make both strings equal. Mapping this number to the interval [0,100]  yields a 'percentage of dissimilarity' for two tokens. In Figure 3, the similar token repetition distance distribution Pn for the VMS compared with normal texts is presented. Here n denotes the number of other tokens between two similar ones, i.e., n = 0 corresponds to the situation of two alike tokens in immediate vicinity. Two words are considered 'similar' if their dissimilarity as defined above is less or equal to 30%; it turns out that the precise value (+/-10%) of this threshold changes Pn only quantitatively, not qualitatively."

Quote:
Quote:It's for Stolfi to address that issue -- quantitatively -- in the context of explaining/defending his theory.  [...] The "explanation" would be to show that the same anomalies that you see in the SPS are spelling system, and incidence of errors.

Can you see those anomalies in You are not allowed to view links. Register or Login to view.?

Not sure where the "The 'explanation'...of errors" bit came from -- probably a Ninja editing quote nesting error.

Having incorporated code that replicates Schinner's analysis into my Swiss Army text stats analyzer (although not in the version I released on the Ninja), I can check at some point. Given that I'm currently trying to help someone get a start-up off the ground it may not be very soon. Just based on eye-balling it, I'll be surprised if it has a distribution that's similar to the Voynich text. 

Quote:
Quote:...I'm acutely aware that this is the "Chinese" theory thread (and not, for instance, the (AFAIK non-existant) "verbose cipher" theory thread), bur moderators and readers please bear with me because what follows will, in fact, wrap back around to being on topic...

I don't mind.  Besides, any argument you present for why the "Chinese" Origin theory (COT) is wrong will be in-topic.

Appreciated, I just want to avoid the thread getting dragged too far off topic.

Quote:
Quote:3) results in a critique of Stolfi's approach here

First, note that the type-to-token ratio (TTR) of a text in any language usually depends on the text size.  You are not allowed to view links. Register or Login to view. (which may be just a consequence of Zipf's law) says that the size L of a a text's lexicon (number of "word types") is related to size N of the text (number of tokens) like L ≈ K sqrt(N), where K may depend on the language, style, topic, etc.  

So, when comparing TTRs of different texts, it is absolutely important to trim them to the same number of tokens.  Or instead compute the apparent constant K = L/sqrt(N)  instead of the ratio L/N.

Understood, and the word count of the Roger Bacon excerpt was picked to match the sample of Voynich text I was comparing it to. Given that You are not allowed to view links. Register or Login to view. appears to show that the Voynich text follows Heap's Law, I'll need to add testing for fit to it into my Swiss Army text analysis program...

Quote:Second, spelling and transcription errors can inflate L while having little effect on N.   Mandarin uses only 1300 syllables (~18%) out You are not allowed to view links. Register or Login to view. by the known phonetic constraints of the language. That means that changing one phoneme in a random syllable will, with high probability, increase the lexicon size L -- even if the change respects the phonetic constraints. 

Thus, when comparing lexicon sizes, it may be prudent to exclude word types that occur only once or twice.  Or maybe even more, depending on the rate of errors.

I have no problem believing there are scribal and transcription errors in the Voynich text -- "There is supposed to be a space before every EVA 'q', even though there isn't" is a hill I am more than willing to die on. I think if you're positing rates that are unusually large compared to known rates of scribal errors in manuscript texts then that's potentially a problem (and way back in this thread I suggested using rates of scribal errors in Greek mss. copied by an "illiterate" scribe as a reasonable proxy for the Voynich).

I am also fully on board with needing to try to clean the data -- in the experiments I've done with algorithms from the literature for induction of regular grammars with only positive training examples I always exclude hapax, and generally exclude types that only occur twice, even though that massively reduces the number of training samples.

As for "spelling and transcription errors can inflate L while having little effect on N", look, I'm taking a very firm methodological position here that I'm very comfortable defending. Don't tell me "can", show me "does." Look though the right end of the telescope. If you think the Voynich Mss. text is Chinese, don't start by translating Voynichese into Chinese. Start by showing how to translate Chinese into Voynichese. And yes, I understand that you don't actually think it's Chinese per se, and per my comment below I understand that you can't follow this path given the nature of your claim. From my point of view, that's a problem.

Quote:Third, one must not forget that statistics -- word and character frequencies, correlations, Zipf's and Heap's law, etc -- are a property of a text, or a specific collection of texts -- not of a language.  There is no such thing as "the frequency of 'e' in English", "the most common word in Latin", "the Heap K of Mandarin".  Even if the dialect and spelling are fixed, those statistics are greatly affected by topic, style, nature of the text, cost of vellum, etc.  One can have a meaningful and grammatically correct text in English with 50'000 tokens and only 100 word types -- that does not use the word "the" even once.

It's a strawman to suggest anyone doesn't understand that these are things that have some variance around mean values. I think you exagerate the extent to which "those statistics are greatly affected by topic, style, nature of the text, cost of vellum, etc." (and the fact that hill climbing n-gram stats decrypters work as well as they do would appear to be an existence proof that they are not). If you are familar with studies quantitatively analyzing how letter and word frequency stats differ between (say) a corpus of chemistry and physics texts and a corpus of mystery and spy novels, I'd genuinely love to see them.

Quote:
Quote:All of which leads to wrapping this back around to discussing Stolfi's theory [...]. It's not *impossible* that Stolfi has stumbled into a solution with his approach, but [...] I think it is far more likely that he has fallen prey to the siren call of the "crib" that has lured so many mariners sailing on the treacherous seas of the Voynich Mss to their doom.

Well, what can I say?  I think that the matches I have found are quantitatively categorical.

I know you do, and believe me, nothing would make me happier than if you or anyone else came up with something I saw as a compelling solution, if only because then I could stop wasting time on the damned thing. It's not like I'm making any progress.

Quote:Note that my solution is very different from most other proposals. I don't know what is the language and the encoding, but I claim that I have identified a specific plaintext (the SBJ) whose structure  matches the SPS much better than could be expected by chance.  Specifically, I have a small dictionary (the "cribs") such that those words occur in the SBJ with spacings that are very closely proportional to the spacings of the corresponding words in the SPS.  

And there's the rub, isn't it? If you don't have any idea what the language and encoding are, then you can't follow the methological path I'm advocating. To be clear, that doesn't mean you're wrong, it just makes it harder for you to convince me you're right. 

As for the (statistical) significance of the spacings of the cribs, there is a limit to how closely I've been following the thread after the initial "daiin" match, and I don't want to get into why I'm unconvinced of the significance of the "daiin" spacing. 

Quote:
Quote: What he *should* be doing -- again, IMHO -- is taking texts in one or more SE Asian languages (dealer's choice), assigning some scheme for representing them with Voynich glyphs, simulating whatever "confused ignorant scribe" error processes he thinks are there, and then showing that you wind up with something that -- quantitatively, and for all the "greatest hits" properties (including "babble-like sequences of similar words") -- looks like Voynichese.

Well, you have that text above to play with.  Do you see "babble-like sequences" in it?

At the end of the day, the question isn't "am I willing to put the time into doing it?". It's "are there other higher priority things I need to be using that time for?". Unfortunately, right now the answer is yes, there are. I've pointed you at the Schinner paper, it has figures showing the difference between the behavior of the Voynich text and the samples of natural language text he looked at. If I find myself having a half hour or so window of free time, I'll try to run the analysis.

Quote:All the best, --stolfi

Ditto. If it turns out you're right, no one will be happier for you than I will. -- Karl
(20-05-2026, 01:35 AM)JoeyB Wrote: You are not allowed to view links. Register or Login to view.FIRST: which SBJ or ZHB version is the right comparator. And SECOND, whether the source text could have existed in the 'right form' around 1400. And I definitely don't have anything useful to add there.

The answer will require a rather long post; it will come.  But in brief: as @richforto noted, the version of the Shennong Bencaojing (SBJ) that became the Starred Parags section (SPS) or the VMS was not the "reconstructed" file that I had been using, but a version that was embedded in some later materia medica; most likely the Zhenghe Bencao (ZHB), which was composed around 1080 CE.  On block-printed copies of the ZHB circulating in the 1300-1400 CE time frame, the SBJ text would have been clearly marked out by being printed in double-size font and white-on-black instead of black-on-white.  Thus a doctor or scholar who had a copy of the ZHB could have read it out aloud easily, excluding certain ~500 CE additions that had become as "sacred" as the SBJ itself.

Quote:BUT, the other issue seems to be more testable: if this is a positional-distance hypothesis, can the method we use to match also recover the rooster/f105v32-38 pair when we run the whole thing blind? Which files shouldI grab for that? EG, compare all SPS paragraphs against all SBJ/ZHB entries without preselecting the rooster pair,

Yes, and and that is exactly what I am doing now for every SPS entry, including the Rooster one.  Except that I am testing only against the 243 "good" SPS parags -- excluding those which have two or more stars on the margin, which presumably are in fact two or more parags run together.  So I expect that at most 2/3 of the SBJ entries will have an identifiable match.

However, the Rooster entry is a very easy case because it is the only one with eight instances of 主. Thus f105v.32 is the only parag with enough daiin to even begin to compete, even allowing for "quillos" like dain, laiin, or kain.  So it easily comes out as the match for the Rooster entry.  I will discuss this in detail later.

Apart from the Rooster one, there are a few SBJ entries with three 主, and  a few with two. The vast majority has only one.   On these, my programs will often find two or more parags that seem to have a daiin (or quillos thereof) at the right places.  So I have to put those entries on hold for now.

On the other hand, I now have a couple more cribs that can be used in the comparison, notably 气 qì (some sort of "vital energy"; sort of like the Western "humors" but completely different).  That is a rather common character in the SBJ, and seems to correspond to chedy (or sometime slight variations thereof, like chdy or cheda) in the SPS.  By including those cribs in the list of keywords to be matched I can often resolve those ambiguities, and make already identified matches more certain.

Quote:I started looking at and playing with files in the ic.unicamp....Notes/077 folder but before I go too far down the road, if the method is bad I don't want to keep going, and if there are files that are final versions or authoritative I'd want to use those.

Sorry, those programs and files were not meant for other people's use.  They are still messy and buggy tools that I am using to match SBJ recipes to SPS parags. A task that is progressing, but much more slowly than I hoped, for reasons that I will detail later.  Feel free to use those programs, but I cannot guarantee them.

All the best, --stolfi