NOTE: See post #You are not allowed to view links.
Register or
Login to view., the original graphs in this post were incorrect graphed. Please ignore them.
I took my original research in a new direction by taking samples from the manuscript to see if they formed a curve line in accordance with Heap's. I used the software and webresources outlined in post 1 of this thread.
I wanted to plot a line for the real results, calculate the variables A and B for the formula of Heap's Law (which states that total number of unique tokens = A*total tokens^b), then predict the expected number of unique tokens and see if the two lines were similar.
Here is the graph for all of the text of the Herbal section (I split it into about 18 different files in order to get the increasing number of corpus size). The formula (red line) fits better for larger amounts of text than for small amounts, but this is expected. The values of A (5) and k(8) are extreme, but this is expected for small amounts of data, and the two lines start to join up once we get past 10,000 tokens.
This is a graph of the unique words against total words taken from:
f1r,f1v,f2r,f2v,f3r,f18r,f18v,f29r-f30v,f50r,f50v,f76r,f76v,f82r,f82v,f111r,f111v,f114v
These pages give me a total of 4874 words (tokens). I wasn't really expecting anything from this sample, because after I ran the numbers I realised it was a very small sample of text. But then I graphed it:
The variables of the formula remain the same as in the first graph - the best fit line continues. But just as interestingly, we see an decrease in the total number of unique tokens once total token count goes above 3,000, just as the formula prediction (red line) predicts. That's just what MarcoP's graph above shows. However, I am mixing A&B together here.
I repeat the first experiment with the 12,974 words from the "stars" section, and again obtain a good fit to the formula prediction. We see a decrease in the number of expected uniques in the 2200-2600 word range, then it picks up again. I have no idea why BUT I suspect that may be because of an aberrant text with a large number of repetitions sneaking in there - I'll check the data again to confirm this tomorrow.
I now run the test for the whole corpus, and add it to the previous results (in order to extend the sample to its maximum testing potential). A small adjustment of B from 0.8 to 0.82 is all that is needed to correct the formula prediction line.
Here's the data the above graphic was based upon:
Although these graphs by themselves do not prove anything, it is interesting to see how the variables remain constant across different sections of the book. However, these values are extreme. A is usually quoted to be in the range 30 < A <100 and b is usually around .5. However, for extremely small samples - and the VMS is, compared to the usual corpus used in such tests - then a value of A close to 100 is not considered to be abnormal.
The next step is to consider the relationship between the above variables and any information we can extract from Zipf's law:
Quote:Denote r the rank of a word according to its frequency Z(r), Zipf's law is the relation Z(r) ~ r−α, with α being the Zipf's exponent. Heaps' law is formulated as Nt ~ tλ, where Nt is the number of distinct words when the text length is t, and λ ≤ 1 is the so-called Heaps' exponent.
Quote:Different probability models of text generation (under the assumption that Zipf’s law is fulfilled) result in
a simple relation between the exponents a and λ:
λ=a^-1
But that's a project for another day.