(11-02-2018, 12:44 PM)davidjackson Wrote: You are not allowed to view links. Register or Login to view.Quote:MarcoP: David, what do you mean when you say that the ratio of creation is "constant"? Do you mean that the number of unique words increases linearly with text length?
Here I was referring to Torsten's text. Words in his corpus were created on a linear basis, whereas the Heap's line is a power line, and hence the two diverge.
If you look at the graph from the stars section you see how the creation rate rises and falls -something which I assume without proof is due to the introduction of new topics-, but still has an association with the power line.
Thank you, David!
A straight line is of course a very particular case of f(x)=A+x^B with B=1.
The attached curves correspond to random files of words (each word 4 characters long) based on different size alphabets: 8 characters, 12 characters, 26 characters.
With 8 characters, you can only have 8^4=4096 words: with 10000 random tries, we almost have the whole dictionary covered.
On the other hand, with 26 characters, the possible "dictionary" of words is much large than the x-range of this graph: we have only covered a tiny fraction of it and each next word is very unlikely to have occurred before. The corresponding blue line is very close to f(x)=x.
Obviously, with the 12 characters alphabet, we get something intermediate between the two.
These are samples from the random files:
Code:
==> rnd8.txt <==
fbba bgfc bhgc adba bced cded dfhc aceh egag fgaa ahge ecde ebge hhed egab edah heff efaf fddg hbhc
hhhh aace eegd bcag bcaf egbc fcfh accg cggh gfdh hdgg hcfd eede ccbg cceh aabd eheh bfeg hcdg dgbh
==> rnd12.txt <==
gjbl dglh jgkc hhke cjha babg idih ejge egjf ffef klhj fbgk jldh aajc fljj lcie ljga hleh lhda jhag
bjal lbde jeha lhfd efif afah iflj hfae lfel bklj bkca dlbk eleb bdfe ebck lehi figk ldag ahie ehga
==> rnd26.txt <==
lrdb revy asdn ajsn zhmy ajrs glve lhkr pywn kdoj jijz asbs xuku cpdk vsvz uwyw aacf okdy pgxa hsik
rlvl uagf qmrr rlmi qdmf zked luvj zhcz mhoa pqgw praj icxo wzdb dbie anmv dytl dvul vkea mmxh bxds
What happens with language is something similar, in my opinion. The number of legal combinations of characters is limited, not by the size of the alphabet, but by linguistic rules. Moreover (something my simple experiment doesn't capture) some words are more frequent than others (e.g. the function words you discussed a few moths ago).
Now, if Torsten's algorithm produces something close to f(x)=0.5*x, it could be that it copies half of the words from the already generated text (and these words obviously do not add to Heap's total of unique words) and randomly generates the other words (similarly to the blue line representing the rnd26 file). While copying already generated words is an excellent way to approximate Heap's law, one must also find a way to appropriately constrain the generation of new words.
Have you thought of also examining You are not allowed to view links.
Register or
Login to view.? My impression is that he has a good amount of "morphological rules" in place, so his dictionary should be more constrained and possibly produce a curve similar to that of the actual VMS.