The Voynich Ninja
Heap's Law - Printable Version

+- The Voynich Ninja (https://www.voynich.ninja)
+-- Forum: Voynich Research (https://www.voynich.ninja/forum-27.html)
+--- Forum: Analysis of the text (https://www.voynich.ninja/forum-41.html)
+--- Thread: Heap's Law (/thread-2279.html)

Pages: 1 2 3 4 5


RE: Heap's Law - Torsten - 07-02-2018

The difference between vmsA and vmsB is interesting. Even within vmsA and vmsB the increase of the vocabulary is changing from time to time. There is for instance a break for vmsB at x=3500. For vmsA multiple smaller breaks exists. It seems that the parts in vmsA and vmsB are not very homogenous. Therefore it would be interesting to know which quire/part of the manuscript stands for which part of the graph.


RE: Heap's Law - ReneZ - 07-02-2018

The change in slope in VMS-B coincides with the transition from Herbal-B pages to Biological pages.
My guess was that it is around 3300 words.

It is a very clear change in trend, showing that the Bio-B text is more repetitive or has a smaller vocabulary.


RE: Heap's Law - Koen G - 07-02-2018

Intreresting. Is it possible to extract the "new" words from the second part of the B-sample? This might mean that those will be lexical words (nouns, adjectives, verbs) related to the new subject. It might offer us a glimpse into Voynichese grammar.


RE: Heap's Law - MarcoP - 07-02-2018

(07-02-2018, 03:47 PM)ReneZ Wrote: You are not allowed to view links. Register or Login to view.The change in slope in VMS-B coincides with the transition from Herbal-B pages to Biological pages.
My guess was that it is around 3300 words.

It is a very clear change in trend, showing that the Bio-B text is more repetitive or has a smaller vocabulary.

Thank you, Rene!
If I counted words correctly, the Bio section starts  at word 3445.
BTW, it's interesting that there is no clear "jump up" at the start of the new section, just a flattening. As if the new section, more than bringing in new words, simply marked a more consistent vocabulary, as you noticed. It is as if the "herbal" pages were more independent of one another, each "plant" introducing several new (unique?) terms, while Q13 is more uniform.


RE: Heap's Law - Torsten - 07-02-2018

Quire 13 is the most repetitive part of the VMS. 

A second effect is that the curve is flattening for pages containing more text. This happens since the vocabulary can change from page to page for the VMS. With other words a page full of text is using a smaller vocabulary then multiple pages containing the same amount of text. Therefore it would be interesting to compare the curve for quire 13 with the curve of quire 20.

Two new words on page You are not allowed to view links. Register or Login to view. are 'qokechdy' and 'olkeedy'. But this is less surprising if you know that the word 'qokeedy' is typical for quire 13 and that words co-occur with similar ones. In my eyes 'qokeedy' is for instance co-occuring with words like 'qokedy', 'qoteedy' and 'qotedy' (see You are not allowed to view links. Register or Login to view.).


RE: Heap's Law - davidjackson - 07-02-2018

NOTE: See post #You are not allowed to view links. Register or Login to view., the original graphs in this post were incorrect graphed. Please ignore them.

I took my original research in a new direction by taking samples from the manuscript to see if they formed a curve line in accordance with Heap's. I used the software and webresources outlined in post 1 of this thread.
I wanted to plot a line for the real results, calculate the variables A and B for the formula of Heap's Law (which states that total number of unique tokens = A*total tokens^b), then predict the expected number of unique tokens and see if the two lines were similar.


Here is the graph for all of the text of the Herbal section (I split it into about 18 different files in order to get the increasing number of corpus size). The formula (red line) fits better for larger amounts of text than for small amounts, but this is expected. The values of A (5) and k(8) are extreme, but this is expected for small amounts of data, and the two lines start to join up once we get past 10,000 tokens.
   


This is a graph of the unique words against total words taken from:
f1r,f1v,f2r,f2v,f3r,f18r,f18v,f29r-f30v,f50r,f50v,f76r,f76v,f82r,f82v,f111r,f111v,f114v
These pages give me a total of 4874 words (tokens). I wasn't really expecting anything from this sample, because after I ran the numbers I realised it was a very small sample of text. But then I graphed it:
   
The variables of the formula remain the same as in the first graph - the best fit line continues. But just as interestingly, we see an decrease in the total number of unique tokens once total token count goes above 3,000, just as the formula prediction (red line) predicts. That's just what MarcoP's graph above shows. However, I am mixing A&B together here.

I repeat the first experiment with the 12,974 words from the "stars" section, and again obtain a good fit to the formula prediction. We see a decrease in the number of expected uniques in the 2200-2600 word range, then it picks up again. I have no idea why BUT I suspect that may be because of an aberrant text with a large number of repetitions sneaking in there - I'll check the data again to confirm this tomorrow.
   

I now run the test for the whole corpus, and add it to the previous results (in order to extend the sample to its maximum testing potential). A small adjustment of B from 0.8 to 0.82 is all that is needed to correct the formula prediction line.
   

Here's the data the above graphic was based upon:
   

Although these graphs by themselves do not prove anything, it is interesting to see how the variables remain constant across different sections of the book. However, these values are extreme. A is usually quoted to be in the range 30 < A <100 and b is usually around .5. However, for extremely small samples - and the VMS is, compared to the usual corpus used in such tests - then a value of A close to 100 is not considered to be abnormal.

The next step is to consider the relationship between the above variables and any information we can extract from Zipf's law:
Quote:Denote r the rank of a word according to its frequency Z(r), Zipf's law is the relation Z(r) ~ rα, with α being the Zipf's exponent. Heaps' law is formulated as Nt ~ , where Nt is the number of distinct words when the text length is t, and λ ≤ 1 is the so-called Heaps' exponent.
Quote:Different probability models of text generation (under the assumption that Zipf’s law is fulfilled) result in
a simple relation between the exponents a and λ:
λ=a^-1
But that's a project for another day.


RE: Heap's Law - Emma May Smith - 08-02-2018

Quote:I repeat the first experiment with the 12,974 words from the "stars" section, and again obtain a good fit to the formula prediction. We see a decrease in the number of expected uniques in the 2200-2600 word range, then it picks up again. I have no idea why BUT I suspect that may be because of an aberrant text with a large number of repetitions sneaking in there - I'll check the data again to confirm this tomorrow.

Can you pinpoint exactly where the change happens? If it's You are not allowed to view links. Register or Login to view. then that's very curious. The text shows signs of having been abandoned part way through and later resumed.


RE: Heap's Law - MarcoP - 08-02-2018

Hi David,
it seems I am misunderstanding this graph. From the description, I thought the orange line was V(k)=5*k^0.8.

[Image: attachment.php?aid=1938]

If I plot f(k)=5*k^0.8, I get the attached curve. The two axises are switched, but the value for k=12000 is close to 9000, while it is considerably lower in the graph above (~3000).
Also, the curve seems to be "smoothly" convex, without the flex point at k=9000.
Could you please explain how the formula prediction works?


RE: Heap's Law - Anton - 08-02-2018

Yes, argument in the power of 0.8 would not have a flex point.


RE: Heap's Law - davidjackson - 09-02-2018

Sick  A confession - my excel formula was pointing to the wrong column, so the above are wrong. (The formula prediction was running against tokens found, no total types). So my above post was a waste of everybody's time, sorry.
Let me run the numbers afresh. I've also swapped the axis around.
The corrected graph for unique words (tokens) against all words (types) for the star section is this (spreadsheet with numbers also attached):
   

And for herbal section
   

Both with very similar numbers - the power B is increased by 0.01 in the second graph to give a slightly better fit.

Irrespective of the Heap's plotline, it's interesting to note that in the stars (which is mainly text) the rate of new tokens drops between 7000 and 11000 types of the corpus before picking up again. This could be representative of a repetitive subject being treated in this section of the text.

This doesn't happen in the herbal section, which has a much more constant rate of growth.