Torsten > 07-02-2018, 03:18 PM
ReneZ > 07-02-2018, 03:47 PM
Koen G > 07-02-2018, 03:53 PM
MarcoP > 07-02-2018, 04:55 PM
(07-02-2018, 03:47 PM)ReneZ Wrote: You are not allowed to view links. Register or Login to view.The change in slope in VMS-B coincides with the transition from Herbal-B pages to Biological pages.
My guess was that it is around 3300 words.
It is a very clear change in trend, showing that the Bio-B text is more repetitive or has a smaller vocabulary.
Torsten > 07-02-2018, 06:14 PM
davidjackson > 07-02-2018, 11:25 PM
Quote:Denote r the rank of a word according to its frequency Z(r), Zipf's law is the relation Z(r) ~ r−α, with α being the Zipf's exponent. Heaps' law is formulated as Nt ~ tλ, where Nt is the number of distinct words when the text length is t, and λ ≤ 1 is the so-called Heaps' exponent.
Quote:Different probability models of text generation (under the assumption that Zipf’s law is fulfilled) result inBut that's a project for another day.
a simple relation between the exponents a and λ:
λ=a^-1
Emma May Smith > 08-02-2018, 12:44 AM
Quote:I repeat the first experiment with the 12,974 words from the "stars" section, and again obtain a good fit to the formula prediction. We see a decrease in the number of expected uniques in the 2200-2600 word range, then it picks up again. I have no idea why BUT I suspect that may be because of an aberrant text with a large number of repetitions sneaking in there - I'll check the data again to confirm this tomorrow.
MarcoP > 08-02-2018, 11:38 AM
Anton > 08-02-2018, 11:53 AM
davidjackson > 09-02-2018, 06:55 PM
A confession - my excel formula was pointing to the wrong column, so the above are wrong. (The formula prediction was running against tokens found, no total types). So my above post was a waste of everybody's time, sorry.