17-06-2025, 09:51 PM
(17-06-2025, 09:38 PM)oshfdk Wrote: You are not allowed to view links. Register or Login to view.So, the Chinese text was transcribed to pinyin and then space separation of pinyin groups was treated as word breaks?
Word breaks were guessed and spaces inserted (or removed) accordingly. Google Translate is not optimized for old text, the meaning can be different in old and recent texts for the same characters, and so is the word segmentation.
I can't ask Google Translate how it does the segmentation, so I asked DeepSeek. I don't know why it assumed the text is Classical Chinese, anyway, these are its explanations:
DeepSeek Wrote:Notes on Segmentation & Pronunciation
• Word boundaries follow Classical Chinese grammar (e.g., 上藥 shàng yào = "high-grade herbs," treated as one compound).
• Tone marks are added for clarity (e.g., wéi 爲/为, yǐ 以).
• Fixed terms (e.g., 君臣佐使 jūn chén zuǒ shǐ = "sovereign, minister, assistant, envoy") are kept intact.
• Modern Pinyin is used for readability, though some archaic pronunciations may differ.
DeepSeek Wrote:In ancient Chinese texts, words are indeed written without spaces, which can make word identification challenging. However, there are several reliable methods and principles that scholars use to segment and interpret words correctly:
1. Grammar & Syntax Rules
• Classical Chinese has a relatively fixed grammatical structure. For example:
Subjects usually precede verbs (e.g., 王曰 "The king said").
Objects often follow verbs (e.g., 伐国 "attack a state").
Modifiers (adjectives, possessives) come before nouns (e.g., 大国 "great state," 吾子 "my lord").
• Function words (e.g., 之, 乎, 者, 也) often mark boundaries between phrases.
2. Morpheme-Based Nature
• Most words in Classical Chinese are monosyllabic (one character = one word), but there are also disyllabic compounds (e.g., 天下 "all under heaven," 君子 "gentleman").
• Context helps determine whether two characters form a single word or are separate.
3. Context & Meaning
• The broader passage often clarifies word boundaries. For example:
In 學而時習之, the segmentation is 學 / 而 / 時習 / 之 ("Learn and often practice it").
• Misreading can happen if context is ignored (e.g., 可以 is often 可 / 以 "can / use to" rather than the modern 可以 "can").
4. Fixed Phrases & Idioms
• Many set phrases recur in Classical Chinese (e.g., 不亦樂乎, 如之何), making them easier to identify.
5. Punctuation in Later Texts
• Many received editions (e.g., Song dynasty printed texts) include punctuation marks (句讀 judou) that guide segmentation.
6. Comparative & Philological Methods
• Comparing parallel passages in different texts.
• Analyzing rhyme schemes in poetry (e.g., Shijing) helps confirm word divisions.
Example Analysis
Take an unpunctuated sentence:
民可使由之不可使知之
Possible interpretations depend on segmentation:
1. 民可 / 使由之 / 不可 / 使知之 ("The people can be made to follow, but not to understand.")
2. 民可使 / 由之 / 不可使 / 知之 (alternative reading with different emphasis).
The correct choice depends on context and grammatical logic.