(14-05-2026, 06:59 AM)JoJo_Jost Wrote: You are not allowed to view links. Register or Login to view.Since I’m currently looking into word boundaries and the glyph stream, I asked myself how Chinese theory can be reconciled with the 7–8 rules that define about 90% of all spaces in VMS?... If it were written phonetically, would that mean the Chinese language implicitly follows these rules?
Sorry for missing your question.
The answer will depend on the particular "Chinese" language (the Net says that there are more than 50 of them) and on the phonetic notation used.
For Mandarin (the "Chinese" of Beijing) there is a widely used phonetic notation called pinyin, that uses Latin letters plus diacritics. It looks like this
- léi shǔ zhǔ duò tāi lìng yì chǎn shēng píng gǔ
Each syllable (word) there corresponds to one Chinese character. In that notation, it is fairly straightforward to recover the spaces if they are deleted. In fact, compounds of two or more syllables are usually strung together without a space. There re only a few ambiguous cases where the compounds have to be separated by an apostrophe. Like, shengú could be shēng ú or shēn gú; the orthography rules say that one of them is the default split, and the other one needs an apostrophe (I forgot which one).
Sometimes, when Mandarin names and places are mentioned in English texts, the diacritics are omitted. Technically Mao Zedong (Mao Tse-Tung in older spelling) could be
Máo Zédōng or Mào Zēdǒng or a few dozen other names, and only one of them is correct. But usually the context says which one is meant. But even without diacritics the splitting of compounds into syllables (Like Zedong -> Ze Dong) is almost always trivial and unambiguous.
Another variant of pinyin uses digit suffixes 1-4, instead of diacritcs, to indicate the tones. Like this
- lei2 shu3 zhu3 duo4 tai1 ling4 yi4 chan3 sheng1 ping2 gu3
In this notation the spaces are totally obviously completely recoverable.
This numeric encoding of tones is more often used for tonal monosyllabic languages other than Mandarin, especially those that have more than four tones. In Cantonese, the "dialect" of HongKong, the same characters that correspond to that pinyin text would be read and phonetically transcribed as
- to4 syu2 zyu2 do6 toi1 ling6 ji6 caan2 saang1 ping4 guk1
Vietnamese is another monosyllabic tonal language. It has an official script and spelling system based on Latin letters with lots of diacritics. The literal translation of that text into Vietnamese, as would be read by a scholar from the same Chinese characters, would be written down as something like this:
- đà thử chủ đọa thai linh dị sản sinh bình cốc.
(At least that is what Google says...) A more natural translation, that tries to respect the grammar and modern vocabulary of Vietnamese, would be
- chuột đà chủ trị phá thai giúp dễ đẻ sống ở thung lũng bằng phẳng.
Either way, I bet that recovering deleted spaces would be almost as easy as with pinyin.
All the best, --stolfi