The Voynich Ninja

Full Version: It is not Chinese
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Pages: 1 2 3 4 5 6 7 8 9 10 11 12 13 14
(17-06-2025, 09:38 PM)oshfdk Wrote: You are not allowed to view links. Register or Login to view.So, the Chinese text was transcribed to pinyin and then space separation of pinyin groups was treated as word breaks?

Word breaks were guessed and spaces inserted (or removed) accordingly. Google Translate is not optimized for old text, the meaning can be different in old and recent texts for the same characters, and so is the word segmentation.

I can't ask Google Translate how it does the segmentation, so I asked DeepSeek. I don't know why it assumed the text is Classical Chinese, anyway, these are its explanations:

DeepSeek Wrote:Notes on Segmentation & Pronunciation
• Word boundaries follow Classical Chinese grammar (e.g., 上藥 shàng yào = "high-grade herbs," treated as one compound).
• Tone marks are added for clarity (e.g., wéi 爲/为, yǐ 以).
• Fixed terms (e.g., 君臣佐使 jūn chén zuǒ shǐ = "sovereign, minister, assistant, envoy") are kept intact.
• Modern Pinyin is used for readability, though some archaic pronunciations may differ.


DeepSeek Wrote:In ancient Chinese texts, words are indeed written without spaces, which can make word identification challenging. However, there are several reliable methods and principles that scholars use to segment and interpret words correctly:
1. Grammar & Syntax Rules
• Classical Chinese has a relatively fixed grammatical structure. For example:
 Subjects usually precede verbs (e.g., 王曰 "The king said").
 Objects often follow verbs (e.g., 伐国 "attack a state").
 Modifiers (adjectives, possessives) come before nouns (e.g., 大国 "great state," 吾子 "my lord").
• Function words (e.g., 之, 乎, 者, 也) often mark boundaries between phrases.
2. Morpheme-Based Nature
• Most words in Classical Chinese are monosyllabic (one character = one word), but there are also disyllabic compounds (e.g., 天下 "all under heaven," 君子 "gentleman").
• Context helps determine whether two characters form a single word or are separate.
3. Context & Meaning
• The broader passage often clarifies word boundaries. For example:
 In 學而時習之, the segmentation is 學 / 而 / 時習 / 之 ("Learn and often practice it").
• Misreading can happen if context is ignored (e.g., 可以 is often 可 / 以 "can / use to" rather than the modern 可以 "can").
4. Fixed Phrases & Idioms
• Many set phrases recur in Classical Chinese (e.g., 不亦樂乎, 如之何), making them easier to identify.
5. Punctuation in Later Texts
• Many received editions (e.g., Song dynasty printed texts) include punctuation marks (句讀 judou) that guide segmentation.
6. Comparative & Philological Methods
• Comparing parallel passages in different texts.
• Analyzing rhyme schemes in poetry (e.g., Shijing) helps confirm word divisions.
Example Analysis
Take an unpunctuated sentence:
民可使由之不可使知之
Possible interpretations depend on segmentation:
1. 民可 / 使由之 / 不可 / 使知之 ("The people can be made to follow, but not to understand.")
2. 民可使 / 由之 / 不可使 / 知之 (alternative reading with different emphasis).
The correct choice depends on context and grammatical logic.
(17-06-2025, 06:28 PM)oshfdk Wrote: You are not allowed to view links. Register or Login to view.BTW, I'm not sure what method was used to determine the number of words for the Chinese manuscript.

I fed the Chinese text of the webpage you posted to Google Translate, and copy-pasted the pinyin that it provided (at the bottom of the left pane).

Two-word compounds are not explicitly marked in the Chinese script; each syllable is a separate character, and the spaces between characters are all the same.   In pinyin transcriptions, some pairs of syllables are written as a single word. I don't know what the rule is and how rigid it is.  Does anyone here know? Answered by pervious post...

Most entries in Chinese and Chinese-Xxx dictionaries are two-syllable compounds where the meanings of the individual syllables have only vague relation to that of the compound, like English "pineapple", "necktie", or "typewriter".  But there may be a gradation in that regard; I suppose that in English one could write either "back-scatter" "backscatter" or "back scatter" without upsetting the reader, and the same perhaps happens when one writes Chinese in pinyin (which I suppose is something that the Chinese themselves very rarely do).

So perhaps the pair joining in pinyin is defined by some standard dictionary.  But my guess is that Google Translate just gets that info from crunching a big pile of random pinyin texts, and therefore is random to some extent.
(17-06-2025, 06:28 PM)oshfdk Wrote: You are not allowed to view links. Register or Login to view.With some numbers this looks more interesting to me. However, I don't see this as 6 near coincidences. The total number of entries is the obvious optimization parameter. You wouldn't even consider comparing Voynich stars section with a manuscript of 30 entries or a tome of 3000 folios. If I understand it correctly, the similar number of entries was one of the things that attracted your attention to the Chinese MS. If VMS had 30 entries, there would be another Chinese (or Arabic or Hindi) piece of interest with 30 entries and maybe a different origin story.

Yes, and if you find such a manuscript, please let us know.

Indeed I came to this hunch when I was surf-searching for Chinese medical texts, and noticed that the SBJ had about the right number of entries and right length of each entry.

But, besides those and the other statistical similarities, what makes the SBJ a strong candidate is that it was widely known and available in the whole the Chinese cultural sphere (including countries that have never been under Chinese control); and, if someone asked a doctor anyhwere in that area "what is the most important medical book you have here", the answer would very likely have been the SBJ.

Quote:Also, you decided to remove section titles from bencao, if I understand it correctly. Were they a later addition and not part of the original work?

All I know is that the SBJ that exists today is a version that was said to have been "expanded" around 1400 CE, and (IIUC) somewhat reorganized.  IIUC the expansion included separating the "recipes" into mineral (Google's "Jade"), vegetable, animal, etc.  Which is what those subsection titles seem to be.

But indeed maybe there were section titles in the original SBJ too.  Descriptions of the original only mention the division into 120/120/125 entries.  So perhaps 3 of those 4 "titles" in the SPS are the headers of those original sections.  But the first two are on the same page (f105r) only a dozen recipes apart; and there seems to be no title at the start of the SPS (page f103r)....
(17-06-2025, 11:32 PM)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view.But indeed maybe there were section titles in the original SBJ [Shennong Bencao Jing] too.  Descriptions of the original only mention the division into 120/120/125 entries.  So perhaps 3 of those 4 "titles" in the SPS are the headers of those original sections.  But the first two are on the same page (f105r) only a dozen recipes apart; and there seems to be no title at the start of the SPS (page f103r)....

Here are the lines of the Starred Parags section (SPS; from page 105r to page 116r line 30) that I am currrently considering to be "titles":

<f105r.T1.9a>
[attachment=10835]

<f105r.T2.36>
[attachment=10834]

<f108v.T1.52>
[attachment=10836]

<f114r.T1.34>
[attachment=10837]

The middle two, <f105r.T2.36> and <f108v.T1.52>, which are centered on their respective lines, seem to be correct: if there are titles in that section, those two must be among them.

The creators of the interlinear file were uncertain about the last one, <f114r.T1.34>.  Three possible interpretations are
  1. Line 34 is a title too.  
  2. Line 34 is the last line of the previous paragraph.
  3. Line 34 is words that should go between lines 35 and 36.
In any case, it is very likely that line 34 was written after line 35 was started but before it was completed, because line 35 is bent and tilted with respect to line 33, as if to leave space for line 34.

An argument for (2.) is that line 34 ends with -m, a glyph that is common at end of paragraphs, while the previous line (33) has full width and does not end in -m.  

A possible explanation for the right-justification in options (1.) and (2.)  is that the Scribe copied line 33, then skipped line 34 by mistake, started to write line 35, noticed the omission, wrote line 34 at the right, then continued line 35, bending it down to avoid line 34.

For option (3.), a possible explanation is that that the Scribe skipped several words at the "carriage return" from line 35 to line 36.  Several lines later, when the omission was noticed, he/she went back and added the missing words in the nearest space available  -- right-justified above line 35.  But this conjecture does not explain why line 35 is tilted and bent as it is.  

So now I am more inclined to believe in (2.) above.  I will fix the file accordingly.

The first title <f105r.T1.9a> is even more dubious.  Here it is again for convenience:
[attachment=10835]
Here the previous paragraph ends with a half-line (9) so line 9a is unlikely to be part of it.  Line 10 starts right below line 9, with only a bit of extra space, as expected for a parag break; but is tlted with respect to line 9, as if to leave space for line 9a.  That would suggest that line 10 was written after line 9a.  However, the way the words of line 9a avoid the tall p gallows of line 10 says that line 9a was written after line 10 was completed.

Thus, in this case, explanation (3.) above seems more likely than (1.) and (2.), except that it does not explain the extra tilt of line 10.  Anyway, I think I will accept this explanation and insert line 9a between lines 10 and 11.

Incidentally, note that the handwriting changes abruptly at the next parag break, between lines 12 and 13.  Could this coincidence be related to that anomalous line?

I have fun visualizing the Author checking on the work of the Scribe after he finished line 12, noticing the hack of line 9a, and firing him on the spot.  Then hiring another Scribe, training him on the script, and instructing him to write with smaller "font" to save vellum.

However, I don't see much difference between the two handwritings, apart from "font" size, ink color, and stroke width.   So I think it is more likely that the explanation for the discontinuity is more banal, simply a long pause in the work between lines 12 and 13.
(17-06-2025, 09:02 PM)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view.
(17-06-2025, 02:27 PM)Pepper Wrote: You are not allowed to view links. Register or Login to view.
(17-06-2025, 08:35 AM)oshfdk Wrote: You are not allowed to view links. Register or Login to view.There are generally two kinds of Voynich theories: the solution kind (providing some specific plaintext for specific parts of the MS, be it labels, lines, etc) and the origin story kind, of which your Chinese theory is an example.
I think the origin story is not at all convincing but that's largely irrelevant to whether the solution is correct or not, so it's a shame to get bogged down in arguments about it.

Abstractly that may be true, but in practice any attempt to decipher the VMS must make some assumption about its origin and how it was produced. That is necessary to limit the possibilities for the language and encoding, to estimate the fraction of errors, and to exclude spurious features from analysis. 

In fact, most attempts at decipherment to date have made the same assumption about the origin: the manuscript was created in Europe, and the text and diagrams (not just the script) were original creations by the Author, and either they were a nonsensical hoax, or their meaning was perfectly known to the author. In the second case, every word and every detail of the drawings was intentional; and therefore could be a clue for the decipherment, or had to be explained by it.

And I believe that those attempts failed, and were doomed to fail, because that assumption is false.  The "Chinese Theory", in contrast, provides an entirely different set of candidate languages and a very different type of "encryption"; and it implies that, while the text and diagrams had meaning, the Author himself had only a limited understanding of them.  Thus he must have made many errors, and (in the Herbal especially) made up a lot of stuff that he had failed to record.  And therefore it demands very different approaches to decipherment.

That all sounds like a convenient means to explain away the oddities of the text that don't fit the proposed translation, which is certainly a feature of nearly all Voynich theories!

(17-06-2025, 03:00 PM)oshfdk Wrote: You are not allowed to view links. Register or Login to view.
(17-06-2025, 02:27 PM)Pepper Wrote: You are not allowed to view links. Register or Login to view.The solution part of the theory IS falsifiable. Jorge has even suggested a plaintext for the recipes section. Falsifying it won't be easy but also not impossible, if somebody is sufficiently motivated.

I disagree. It may be theoretically falsifiable in the same way as a teapot orbiting the sun is theoretically falsifiable, but practically not. It has been suggested that the plaintext can be some older version of a known Chinese book possibly transcribed with mistakes from an unknown version of an Oriental language. How do you refute this? Other than providing a complete solution, which would falsify most competing theories.

On the other hand, if we assume a perfect transcription of the known text from Classical Chinese, then yes it is falsifiable and as far as I'm concerned my experiment with computing longest repeated contexts a few posts ago did falsify it.

Having read more, I agree.
(17-06-2025, 03:00 PM)oshfdk Wrote: You are not allowed to view links. Register or Login to view.On the other hand, if we assume a perfect transcription of the known text from Classical Chinese, then yes it is falsifiable and as far as I'm concerned my experiment with computing longest repeated contexts a few posts ago did falsify it.
Agreed, you did falsify the Chinese theory with that assumption included.

All the best, --jorge
Before merging the two dubious "titles" of the Starred Parags section (SPS) into the adjacent parags, the situation could be described by the following diagram:
Code:
  ------ start of the SPS
    |
    | 61 parags
    |
  ...... dubious title <f105r.9a> sairy.ore.daiindy.ytam
    |  8 parags
  ...... tile <f105r.36> otoiis.chedaiin.otair.otaly
    |
    | 104 parags
    |
  ...... title <f108v.52> olchar.olchedy.lshy.otedy

  ****** missing 4 pages f109r-f110v
    |
    | 106 parags
    |
  ...... dubious title <f114r.34> ytain.olkaiin.ykar.chdar.alkam
    |
    | 51 parags
    |
  ------ end of the SPS
If we were to remove just the first dubious title, <f105r.9a>, we would have 69 parags between start of the SPS and the first title <f105r.36>, and 51 parags between the last title <f114r.34> and the end of the SPS. 

And 69 + 51 = 120.  :puzzled:

Between the two definite titles there are 106 parags, which is neither 120 nor 125 but not very far off.  Assuming the average of ~14 parags per page, the 4 missing pages should have ~56 parags; in that case, between the second definite title and the last (dubious) title there would be 106 + 56 = 152 parags.  Also neither 120 nor 125, but also not very far off. 

Let's suppose for a moment that indeed the SPS is the old (pre-1400) version of the SBJ (in some language, with the pronunciation of 1400), and the latter indeed had three sections of 120, 120, 125 entries, with a title at the top of each entry, and these titles got transcribed by the Author as <f105r.36>, <f108v.52>, and either <f114r.34> or a lost title in the missing bifolio. 

Then we would need an explanation for how the titles of the SPS ended up in their current positions among the list of paragraphs. Two possibilities are that the bifolios of the quire got scrambled before the folios were numbered (something which we know happened in the Biology section), and/or the pages of the Author's draft got scrambled before the Scribe copied them to the vellum.  

I won't try to propose a detailed scrambling scheme yet...

All the best, --jorge
If the Starred Parags section (SPS) is indeed a phonetic transcription of some version of the Shennong Bencao Jing (SBJ) in some language and with some old pronunciation, one of the smallest obstacles in exploiting that "Rosetta Stone" is the uncertainty about the paragraph breaks in the SPS.

To illustrate the problem, I picked one of the shortest parags in my SPS file: the one that starts on page f111v, line 14, which has only 12 words. Here is the relevant section of that page (image clipped from the Beinecke Library scans, contrast-enhanced):
[attachment=10841]
The green triangles indicate the paragraph breaks as marked in my SPS file.  

As I posted before, my assumed criteria for identifying a parag break are
  1. If the line has one or more one-leg gallows (p or f), it must be the first one of a parag.
  2. If the line ends well before the right margin, it must be the last one of a parag.
  3. If the line ends in -m or -g, its likely to be the last one of a parag.
  4. If the spacing above the line is larger than average, it is likely to be the first of a parag.
In the old interlinear file (version 1.6e6) some of the line breaks above were not marked as parag breaks, but there were comments "There may be a parag break here", presumably because of the -m endings.  I promoted some of these to parag breaks by visual inspection of the Beinecke images.  All this before I started comparing the parag lengths of the SPS and SBJ, so I was not "cheating".
By these criteria, the only parag break in the image above that can be trusted is that between lines 25 and 26, that satisfies criteria (1.) and (2.).  All other breaks are dubious guesses based only on criteria (3.) and/or (4.)

And now I can already see some mistakes. There is an f in line 16; so, by criterion (1.), there should be a parag break between 15 and 16. That is probably a bug in the file, even though the other three criteria are not met.

The break between 13 and 14 was one of my additions.  But now I would guess that there should be a break between 12 and 13, because of the wider spacing, and the break between 13 and 14 should be removed, because that am as the next-to-last word is not convincing evidence.

An there are 23.5 pages in that section...  Sad
(18-06-2025, 04:15 PM)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view.As I posted before, my assumed criteria for identifying a parag break are
  1. If the line has one or more one-leg gallows (p or f), it must be the first one of a parag.
  2. If the line ends well before the right margin, it must be the last one of a parag.
  3. If the line ends in -m or -g, its likely to be the last one of a parag.
  4. If the spacing above the line is larger than average, it is likely to be the first of a parag.

I'm not sure how good these criteria are, except for 2). For example, the below are the one leg gallows from You are not allowed to view links. Register or Login to view. that don't appear to be on the first lines of paragraphs. Most of them are in cPh/cFh clusters, but a few are not, and there is one with no suspicious ligatures at all, near the center of the page.

As far as I understand, p and f do frequently appear on the first lines of paragraphs, but I don't think they define the first lines of paragraphs.

[attachment=10842]
(18-06-2025, 04:47 PM)oshfdk Wrote: You are not allowed to view links. Register or Login to view.As far as I understand, p and f do frequently appear on the first lines of paragraphs, but I don't think they define the first lines of paragraphs.

As I see it, they are not grammatical markers of paragraphs, but only ornate letters that (as in many other medieval manuscripts) are typically used by scribes to highlight the start of paragraphs, or other noteworthy lines or words.

I would compare them to the ornate capitals that were used in other manuscripts at the start of paragraphs or sections.  Or to the still current English custom of capitalizing every word in paper titles and newspaper headlines,

I see the ornate and "bridging" gallows in some Herbal pages as instances of the same thing: bits of fanciness added by the Scribe, by influence of the general scribal customs.

Moreover, page You are not allowed to view links. Register or Login to view. is clearly special in many ways.  The Scribe may have made more liberal use of one-leg gallows there, because of its importance.  Just as the frontispice of books is usully much more ornate than regular pages.

Or maybe some of those ps and fs are highlighting proper names. 

Or maybe some of those lines with ps and fs are indeed top-of-parag lines.  Note the -m ending on previous lines...

All the best, --jorge
Pages: 1 2 3 4 5 6 7 8 9 10 11 12 13 14