Heap's Law - Printable Version

Heap's Law - Printable Version

+- The Voynich Ninja (https://www.voynich.ninja)
+-- Forum: Voynich Research (https://www.voynich.ninja/forum-27.html)
+--- Forum: Analysis of the text (https://www.voynich.ninja/forum-41.html)
+--- Thread: Heap's Law (/thread-2279.html)

Pages: 1 2 3 4 5

RE: Heap's Law - Anton - 05-02-2018

I think that with the implied range of its parametric values (which, furthermore, would be language-dependent, as far as I understand), this Heaps law is something too vague to be deemed a "law", or, at least, a "practical law". Moreover, it implies that with the infinite volume of the text the vocabulary would be also infinite, which is simply not true, applied to a natural language, the vocabulary of which is apparently finite.

I would also say that the Heaps law is somewhat excessive, since one can derive the relationship from the Zipf law - but it would not be the Heaps law to which one would arrive.

Consider a text utilizing a vocabulary of D words. If the count of the most frequent word is F, then according to the Zipf law the count of the second most frequent word would be F/2, the count of the third most frequent word would be F/3, et cetera, and the count of the D-th most frequent word (that is, that of the least frequent word) would be F/D. The sum total, which is the volume of the text, would be V = F*(1 + 1/2 + 1/3 + ... + 1/D). What is in brackets is the so-called "harmonic series", the sum of which, under large values of D, is approximated as ln(D) + g, where g is Euler's constant (approximately 0,577). Hence, V = F*[ln(D) + g].

Now, expressing D (vocabulary) through V (volume), we get from this: D = A*exp[V/F], where A is a constant which I introduced for brevity, and A = [exp(-F*g)]^(1/F), which means that A is positive but less than unity.

In contrast to this, Heaps law suggests D = K*V^b, where K and b are positive constants, with K >> 1, and b < 1. The wikipedia article argues that the two laws are "asymptotically equivalent", referring in this to some papers of which full texts are not available for free, so I could not check them, but offhand I can't see how would they be, since a brief check shows me that, with V > F (which of course holds true by definition) and b < 1, A*exp[V/F] grows faster than K*V^b.

RE: Heap's Law - Koen G - 05-02-2018

On infinity, remember that proper names and numbers are also words which can appear in a text. Also, a language like Dutch theoretically allows for infinite compounding, though in practice this is just as impossible as an infinite text. Still, with these three tools, language can in fact generate an infinite amount of different words.

RE: Heap's Law - ReneZ - 05-02-2018

Anton,

there is a relationship between F and D as well. The least frequent word should appear once, which does not mean that F = D, but I would suggest to take F = D/2, i.e. cut off the harmonic series when the count drops below 0.5. Basically rounding.

Apart from that, it is always dangerous to use the word 'infinite' Wink

.

The number of words ever written down in the history of mankind is rather large, but still finite.

10 E+16 seems like a reasonable upper limit.

RE: Heap's Law - davidjackson - 05-02-2018

Rene - I don't have those articles you cite. Are they available?

Anton - rather than suggesting infinity, Heap's suggests that the dictionary size continues to increase as the collection adds documents, and that no maximum dictionary size is ever reached. Which is logical, as in natural texts the entire corpus of the OED will never be used - but the more text there is, the more likelyhood there is of rare and technical terms being used. We see a massive increase at the beginning of the text which quickly trails off.

As for Heap's observation with different natural languages:

Quote:Zipf's law on word frequency and Heaps' law on the growth of distinct words are observed in Indo-European language family, but it does not hold for languages like Chinese, Japanese and Korean. These languages consist of characters, and are of very limited dictionary sizes. Extensive experiments show that: (i) The character frequency distribution follows a power law with exponent close to one, at which the corresponding Zipf's exponent diverges. Indeed, the character frequency decays exponentially in the Zipf's plot. (ii) The number of distinct characters grows with the text length in three stages: It grows linearly in the beginning, then turns to a logarithmical form, and eventually saturates.
Linyann Lü et al, You are not allowed to view links. Register or Login to view.

Torstein - you are correct I only used one point for each Currier language. It is entirely possible that the first few pages use the bulk of the dictionary and then repeat those words.

RE: Heap's Law - MarcoP - 05-02-2018

Not sure. Is Rene's graph discussed You are not allowed to view links. Register or Login to view. related?

"(Note that the figure headings say 'tokens' which by new convention should be 'types'.)"

[Image: attachment.php?aid=1823]

RE: Heap's Law - Anton - 05-02-2018

(05-02-2018, 03:39 PM)Koen Gh. Wrote: You are not allowed to view links. Register or Login to view.On infinity, remember that proper names and numbers are also words which can appear in a text. Also, a language like Dutch theoretically allows for infinite compounding, though in practice this is just as impossible as an infinite text. Still, with these three tools, language can in fact generate an infinite amount of different words.

Well, it possibly can, but it does not. Of course there are neologisms appearing et cetera, but I understand that the spirit of the "Heaps law" is to consider a "snapshot" of a (vocabulary of a) language at a time when the text under test was created, is not it?

Good point about numbers, especially if we consider them words in themselves, but also without that, one can infinitely invent new terms to designate large numbers - like thousand, million, billion etc. Well, this is also valid for neologisms terming new objects. Philosophically speaking, the variety of material objects is infinite, and so could be infinite language constructs to designate them.

But alas the lifetime of a language is necessarily limited at least to that of mankind Tongue

, so the vocabulary is fundamentally limited in terms of neologisms and internet usernames. While the size of a text is not fundamentally limited by anything.

But, apart from that philosophy, consider an arbitrary text (of a sufficient length) in a given language. Next, construct another text, say, twice as long as that, but using exactly the same vocabulary. The simplest way to do that would be just to copy and paste. And then, what would the Heaps law tell? Smile

Nay, there's some fundamental flaw about it.

Quote:Anton,

there is a relationship between F and D as well.

Ah yes, surely a very good point. Must be where they manage to get their "asymptotiс equivalence" or whatever.

Quote:Heap's suggests that the dictionary size continues to increase as the collection adds documents, and that no maximum dictionary size is ever reached.

Qualitatively it's logical to a certain extent (see above), but the Heaps goes beyond and brings forward a qualitative dependance which IMHO is not very well defined to serve as a "law" quantitatively (twiddling K on the order of magnitude!) and is somewhat flawed from the fundamental side (as I suggested above).

So maybe one can make useful conclusions observing that a text does not fit the Heaps law, but maybe one cannot say anything definite if a text does fit it.

RE: Heap's Law - DonaldFisk - 06-02-2018

(05-02-2018, 06:20 PM)davidjackson Wrote: You are not allowed to view links. Register or Login to view.As for Heap's observation with different natural languages:

Quote:Zipf's law on word frequency and Heaps' law on the growth of distinct words are observed in Indo-European language family, but it does not hold for languages like Chinese, Japanese and Korean. These languages consist of characters, and are of very limited dictionary sizes. Extensive experiments show that: (i) The character frequency distribution follows a power law with exponent close to one, at which the corresponding Zipf's exponent diverges. Indeed, the character frequency decays exponentially in the Zipf's plot. (ii) The number of distinct characters grows with the text length in three stages: It grows linearly in the beginning, then turns to a logarithmical form, and eventually saturates.
Linyann Lü et al, You are not allowed to view links. Register or Login to view.

I skimmed through the paper and it appears that they're using hanzi when processing Chinese text, kanji or kana for Japanese, and syllables for Korean, when they should be using words (which are generally polysyllabic in all three languages, and in Korean are separated by spaces). The number of words in each of the three languages will be similar to European languages and I would be surprised if Zipf's Law and Heaps' Law don't apply to them.

RE: Heap's Law - ReneZ - 06-02-2018

To David, from post nr.14:

there are some 10 or so papers about the Voynich MS in Cryptologia. This journal requires subscription, but one can buy individual articles online, which is not inexpensive. Probably similar to JSTOR prices.
Authors usually get some reprints (depends on the journal / book) which they can hand out. Apart from that, small
numbers of copies may circulate in the frame of 'collaboration', which is based on the 'fair use' provision in the copyright laws.

To Marco, from post nr.15:

You are right in both cases. This page, and especially these graphs, desperately need to be redone, a.o. due to the new transcriptions and the fact that, as it is, it is a bit cryptic.

RE: Heap's Law - Helmut Winkler - 06-02-2018

(06-02-2018, 06:49 AM)ReneZ Wrote: You are not allowed to view links. Register or Login to view.To David, from post nr.14:

there are some 10 or so papers about the Voynich MS in Cryptologia. This journal requires subscription, but one can

At least in Germany, journals like Cryptologia atre available in the big (university) libraries

RE: Heap's Law - MarcoP - 07-02-2018

These are the graphs for the first 400 / 2000 words in vms A / B (Takahashi's transcription) and the two "synthetic" versions of the text posted by You are not allowed to view links. Register or Login to view. and You are not allowed to view links. Register or Login to view..
As always, I could have messed up something: be careful.

PS: I add the 10K words graph, which also looks interesting.