The Voynich Ninja
Heap's Law - Printable Version

+- The Voynich Ninja (https://www.voynich.ninja)
+-- Forum: Voynich Research (https://www.voynich.ninja/forum-27.html)
+--- Forum: Analysis of the text (https://www.voynich.ninja/forum-41.html)
+--- Thread: Heap's Law (/thread-2279.html)

Pages: 1 2 3 4 5


RE: Heap's Law - davidjackson - 09-02-2018

Emma - I've been thinking about your lemmization comment in an earlier post. How could we parse the text to remove lemmas?


RE: Heap's Law - Emma May Smith - 09-02-2018

I don't know if you can for the Voynich text. That's my concern. Known languages can be lemmatized. If lemmatized and non-lemmatized texts are used, is the result significantly differently? (I admit to not knowing enough about Heap's law.)


RE: Heap's Law - davidjackson - 09-02-2018

I have no idea.
If we lemmatize English without knowing the rules, what would be the result? You might be able to guess at the regular ones ( laugh, laughs, laughing) but irregular ones ( find, found, finding) would be far more difficult.
And as for romance languages with all the different cases... ( Ir, voy, va, iré, irá, vaya fuese and all the other cases of the verb to go in Spanish, to take one difficult example)


RE: Heap's Law - MarcoP - 10-02-2018

(09-02-2018, 06:55 PM)davidjackson Wrote: You are not allowed to view links. Register or Login to view.Sick  A confession - my excel formula was pointing to the wrong column, so the above are wrong. (The formula prediction was running against tokens found, no total types). So my above post was a waste of everybody's time, sorry.

Hi David,
I just wanted to say that I am grateful for this discussion and I don't see your previous post as a waste of time. I like the idea that one can submit to the forum results he is not 100% sure of, expecting others to double-check and point out problems. If I should only post something when I am sure I made no errors, I would post nothing at all.


RE: Heap's Law - davidjackson - 10-02-2018

MarcoP, it's always a pleasure to be corrected by you. Possibly because I have no academic reputation to maintain, it's easier for me to do something foolish and laugh it off, but I do my little best to be rigorous. Tongue  But this is what the forum is for, to advance knowledge by bouncing ideas off one another - not to present pre-packaged tidy little snippets!

Anyway, back to the thread.
I originally became interested in Heap's Law thanks to the You are not allowed to view links. Register or Login to view. - it seems a good way of quickly proving whether a generation method could be authentic or not.

I have run a sample against the text generated by T.Timm and found You are not allowed to view links. Register or Login to view.. The graphed result is this:
   
(NB: Caption "total words in texts" should read "unique word count")

If we compare this to the result found in the stars section, here:
   

We see that Timm's text is far more regular in its creation of new words than the stars section is. I have added a trend line (green) to both graphs to more clearly illustrate this.

Both samples have a similar sized corpus. Timm's consists of 10,210 words, the star section 12,974. I chose the star section as it is almost illustration free.

A further difference is in the predicted rate of unique word creation. The stars diagram above has an average difference between real and predicted creation of unique words of just 13. Timm's average difference is 114.

If I change the formula to read B=0.56 for Timm's, then I get an average difference of -16. But the lines do not fit as well.

Since the ratio of creation is constant, I get two rapidly diverging lines, as illustrated when I overlay two trend lines and expand them outwards:
   

This graph shows the trend lines expanded outwards for the stars section. Note that the real creation line is variable, but tends to follow the Heap's Law prediction, unlike its trend line.
   

To an extent, the Timm's result could be predicted. It is computer generated and as such is following a constant generation algorithm.

The human generated text of the VMS is not following a constant algorithm. Instead, it appears to be creating an inconstant flow of new unique words, which may -and only may- be linked with the introduction of new topics within the text. Alternatively, this could be down to the scribe falling into a pattern when creating random text and repeating previous work.

I attach the two spreadsheets used to generate the graphs for reference.


RE: Heap's Law - Torsten - 11-02-2018

Why did you use different predictions A=12, B=0,56 vs. A=13 B=0,57? In the first case the prediction value for a text corpus with 10 000 words is 2000 and in the second case 2500.

The  computer simulation is using a constant generation algorithm. Moreover for this implementation some aspects are simplified. Therefore a constant graph is expected. 

The result demonstrates again that the text of the VMS is not homogenous. Therefore Heap's law didn't apply here.


RE: Heap's Law - davidjackson - 11-02-2018

In the second graph on your result, A=13 gave less variance once trended out than A=12. However, the intercept of the two power lines is at aprox 5600 total words using [12,0.56], but then more sharply diverges.

Quote: The  computer simulation is using a constant generation algorithm. Moreover for this implementation some aspects are simplified. Therefore a constant graph is expected.

As I suggested. However, this shows that the current implementation of your software is not generating true pseudo-Voynich text, as it does not generate unique words in the same way as the exemplar did.

Quote: The result demonstrates again that the text of the VMS is not homogenous. Therefore Heap's law didn't apply here.
Actually, I would argue that Heap's is applying here, as we see in the the stars graph (and herbal in a previous post). Remember we are dealing with a statistical law not a true power law, and as such variation is to be expected. But the prediction line and real plots both have a weak constant association.


RE: Heap's Law - Torsten - 11-02-2018

It only shows that my simulation didn't vary some parameters. It is a simulation for the text generation mechanism and not for reconstructing the VMS. I didn't claim that a computer was used to generate the text of the VMS. The text of the VMS corresponds to its container. This means that the length of a line and the number of lines per page matters. Both parameters vary in the VMS and are constant in the simulation. 


Heap's law applies for a homogenous text. You can't put a name list and a page full of text together and say the resulting graph is following Heap's law. The same is true in the case of the VMS. There are different graphs for each section of the manuscript. Therefore the order of the sections matters. If you would reorder the sections for to Q13, Q20 and then herbal pages in Currier B you would get a different graph.

You have to split the text of the VMS into different parts. If you then compare the graphs for the different sections this can be interesting. It seems for instance that the graph for herbal pages in Currier A is similar to the graph for herbal pages in Currier B. This is an interesting result. Even if they use different word types they have something in common.


RE: Heap's Law - MarcoP - 11-02-2018

Thank you David and Torsten, I find the discussion of the synthetic text very interesting!

(10-02-2018, 09:12 PM)davidjackson Wrote: You are not allowed to view links. Register or Login to view.Since the ratio of creation is constant, I get two rapidly diverging lines, as illustrated when I overlay two trend lines and expand them outwards:

David, what do you mean when you say that the ratio of creation is "constant"? Do you mean that the number of unique words increases linearly with text length?

This would be extremely interesting. The absence of fluctuations in the curve for the synthetic text is easy to understand, but if unique words grow linearly, than Heap's law doesn't apply and I would be curious to understand why.


RE: Heap's Law - davidjackson - 11-02-2018

Torsten, I appreciate that. I'm not criticising your generation theory, simply pointing out that algorithm you used to create the corpus of text on the referenced post creates words in a linear fashion. Nothing more.

Quote:You have to split the text of the VMS into different parts. If you then compare the graphs for the different sections this can be interesting. It seems for instance that the graph for herbal pages in Currier A is similar to the graph for herbal pages in Currier B. This is an interesting result. Even if they use different word types they have something in common.
This is what I have been trying to do, in a very roundabout way. It will be interesting to see how the labels respond, for example.


Quote:MarcoP: David, what do you mean when you say that the ratio of creation is "constant"? Do you mean that the number of unique words increases linearly with text length?


Here I was referring to Torsten's text. Words in his corpus were created on a linear basis, whereas the Heap's line is a power line, and hence the two diverge.

If you look at the graph from the stars section you see how the creation rate rises and falls -something which I assume without proof is due to the introduction of new topics-, but still has an association with the power line.