Anton > 05-02-2018, 01:56 PM
Koen G > 05-02-2018, 03:39 PM
ReneZ > 05-02-2018, 04:50 PM
davidjackson > 05-02-2018, 06:20 PM
Quote:Zipf's law on word frequency and Heaps' law on the growth of distinct words are observed in Indo-European language family, but it does not hold for languages like Chinese, Japanese and Korean. These languages consist of characters, and are of very limited dictionary sizes. Extensive experiments show that: (i) The character frequency distribution follows a power law with exponent close to one, at which the corresponding Zipf's exponent diverges. Indeed, the character frequency decays exponentially in the Zipf's plot. (ii) The number of distinct characters grows with the text length in three stages: It grows linearly in the beginning, then turns to a logarithmical form, and eventually saturates.
Linyann Lü et al, You are not allowed to view links. Register or Login to view.
Anton > 05-02-2018, 07:54 PM
(05-02-2018, 03:39 PM)Koen Gh. Wrote: You are not allowed to view links. Register or Login to view.On infinity, remember that proper names and numbers are also words which can appear in a text. Also, a language like Dutch theoretically allows for infinite compounding, though in practice this is just as impossible as an infinite text. Still, with these three tools, language can in fact generate an infinite amount of different words.
Quote:Anton,
there is a relationship between F and D as well.
Quote:Heap's suggests that the dictionary size continues to increase as the collection adds documents, and that no maximum dictionary size is ever reached.
DonaldFisk > 06-02-2018, 05:31 AM
(05-02-2018, 06:20 PM)davidjackson Wrote: You are not allowed to view links. Register or Login to view.As for Heap's observation with different natural languages:
Quote:Zipf's law on word frequency and Heaps' law on the growth of distinct words are observed in Indo-European language family, but it does not hold for languages like Chinese, Japanese and Korean. These languages consist of characters, and are of very limited dictionary sizes. Extensive experiments show that: (i) The character frequency distribution follows a power law with exponent close to one, at which the corresponding Zipf's exponent diverges. Indeed, the character frequency decays exponentially in the Zipf's plot. (ii) The number of distinct characters grows with the text length in three stages: It grows linearly in the beginning, then turns to a logarithmical form, and eventually saturates.
Linyann Lü et al, You are not allowed to view links. Register or Login to view.
I skimmed through the paper and it appears that they're using hanzi when processing Chinese text, kanji or kana for Japanese, and syllables for Korean, when they should be using words (which are generally polysyllabic in all three languages, and in Korean are separated by spaces). The number of words in each of the three languages will be similar to European languages and I would be surprised if Zipf's Law and Heaps' Law don't apply to them.
ReneZ > 06-02-2018, 06:49 AM
Helmut Winkler > 06-02-2018, 11:05 AM
(06-02-2018, 06:49 AM)ReneZ Wrote: You are not allowed to view links. Register or Login to view.To David, from post nr.14:
there are some 10 or so papers about the Voynich MS in Cryptologia. This journal requires subscription, but one can
MarcoP > 07-02-2018, 01:27 PM