Thanks for the numbers! You write:
(15-03-2026, 01:30 PM)nablator Wrote: You are not allowed to view links. Register or Login to view.All 4 or 5 tones (there is a neutral tone in Mandarin, not in Cantonese) don't exist for each syllable, so it's less than ~400*5.
IIUC Mandarin's "neutral" tone is not a fifth tone. Rather, syllables with a neutral tone will have their tone determined by the context. Thus in a precise phonetic rendering any such syllables would be rendered in two or more different ways, but with one of the four main tones.
That said, it is true that not all possible letter and tone combinations have meaning in Mandarin, and only some of the meaningful combinations will show up in a 20'000 word corpus.
But that can be said of any language.
I found some those numbers rather surprising. I will have to check his thesis.
Offhand I would say that 20'000 words is a rather small corpus. If the corpus includes only one or two texts of each language, it could easily have an unusually small vocabulary, and therefore an unusually small number of syllables.
I recall that two translations of the Pentateuch into Chinese (or Vietnamese, not sure now) that I got had very different vocabulary sizes, apparently because one was created by an European or American missionary, the other by a native priest.
The number for Japanese seems too high. If I counted correctly, Japanese has only ~15 consonant sounds (including the voiced/unvoiced variants) and 5 vowel sounds plus three glides. With consonant doubling and vowel lengthening, that would give less than 500 possible syllables. Finding 643 in that small corpus seems surprising. Did I miss something? Maybe he counted the final 'n' as part of the previous syllable?
On the other hand, the numbers for Mandarin and Cantonese seem way too low. Maybe because his corpus of "20'000 words" for those languages was actually 20'000
syllables?
The fact that he got almost exactly the same number for Mandarin and Cantonese suggests that he used the same text in ideographic cararactters but rendered phonetically with Mandarin or Cantonese readings. The difference of 20 would then be due to homophones (different characters that have the same pronunciation).
And I would guess that his very high number for English must be the result of counting syllables of the
written language, according to the traditional hyphenation rules, rather than of the spoken one with some phonetic definition of syllable. So that 'to', 'too', 'two' would count as three distinct syllables, and 'squirrelled' would be a single distinct syllable...
All the best, --stolfi