Heap's Law - Printable Version

Heap's Law - Printable Version

+- The Voynich Ninja (https://www.voynich.ninja)
+-- Forum: Voynich Research (https://www.voynich.ninja/forum-27.html)
+--- Forum: Analysis of the text (https://www.voynich.ninja/forum-41.html)
+--- Thread: Heap's Law (/thread-2279.html)

Pages: 1 2 3 4 5

Heap's Law - davidjackson - 04-02-2018

In linguistics, Heaps' law (also called Herdan's law) is an empirical law which describes the number of distinct words in a document (or set of documents) as a function of the document length (so called type-token relation).

Heaps' law means that as more instance text is gathered, there will be diminishing returns in terms of discovery of the full vocabulary from which the distinct terms are drawn. More on You are not allowed to view links. Register or Login to view..

To put this another way, the vocabulary size in any textual stream grows according to Heaps law: it is proportional to the square root of the total number of tokens in the stream.

So if the Voynich is based on an underlying natural language text - if the word tokens we observe actually have unique meanings - then it should correspond to Heap's Law, which has extensively shown to exist in many different natural languages.

If the word tokens do not have unique meanings - ie, this is gibberish or tokens are actually comprised of morphemes with different meanings - then we should not observe Heap's Law.

I used the edu-texts-analyzer (You are not allowed to view links. Register or Login to view.) to run a test on the Voynich.

I prepared two distinct corpus using the Voynich freie literature You are not allowed to view links. Register or Login to view., one of Currier A and the other of Currier B. I attach the transcripts, which were cleared up by turning . into spaces, and removing line end / paragraph markers.

CurrierA contained 11,558 total words of which 3487 were unique.
CurrierB contained 25,489 total words of which 4710 were unique.

Both of these results strongly correspond with the result predicted by Heap's Law, as shown in these two charts (in both cases the small line going from zero to a point shows the position of our result, against the line which Heap's Law plots):

Filename: CurrierA-Heaps.png Size: 11.39 KB 04-02-2018, 09:44 PM

Filename: CurrierB-Heaps.png Size: 11.51 KB 04-02-2018, 09:44 PM

I used the opportunity to run a Zipf's law test. As expected, both texts show the characteristics of a Zipfs chart.

Filename: CurrierA-Zipfs.png Size: 8.13 KB 04-02-2018, 09:52 PM

Filename: CurrierB-Zipfs.png Size: 8.57 KB 04-02-2018, 09:52 PM

I attach two spreadsheets with the sorted words in both Currier variants, sorted according to their frequency of occurrence.

Comparison with random text
I used an online You are not allowed to view links. Register or Login to view. to generate gibberish to run the same tests against. The Heap's results fitted exactly with the prediction. I assume this is because the webpage -and most of its ilk - attempt to mimic the English language when producing its gibberish.

So I then put that random text into a markov chain randomizer found You are not allowed to view links. Register or Login to view.. This produced truly random text. I ran several attempts and none produced any results that were anywhere near a Heap's regression line.

However, all text produced very jerky Zipf's like results. Here's a typical chart.

Filename: LoremIpsum-Zipfs.png Size: 8.38 KB 04-02-2018, 10:18 PM

Brief conclusions

These rapid tests should not be relied upon for any real conclusions, but they do seem to point towards Voynichese tokens having unique individual identities. There is more work to be done in analysing random and cipher texts, but I hope the tools I include in this post may help other researchers carry out their own tests.

RE: Heap's Law - Koen G - 04-02-2018

Well found, David! You have gained entry into the so far rather exclusive graph thread Big Grin

RE: Heap's Law - Emma May Smith - 04-02-2018

Does lemmatization make a difference to the applicability of the law?

RE: Heap's Law - davidjackson - 04-02-2018

Emma, if you have a plaintext formatted corpus upload it and I'll run the tests. If not, define the parameters and I'll try to create the corpus.

RE: Heap's Law - -JKP- - 04-02-2018

Very good idea for a thread.

I don't have time to run tests right now (I wish I did), but I would like to see the statistics for semi-random text.

By that, I mean (as possible examples):

create a big block of random text that consists only of consonants
now insert vowels at reasonable intervals, as might apply to natural languages

Run the stats.

Adjust it so the vowels are at the same intervals as the VMS and run the stats again.

Now try the same essential approach as follows:

create a big block of random text that consists only of consonants
sort the characters in each token (there are a number of ways this can be done, such as by alphabetical order or by frequency)
now insert vowels at reasonable intervals

Run the stats.

RE: Heap's Law - ReneZ - 05-02-2018

One of the key points in Schinner's paper is that this law is not observed, but the number of unique words increases faster. He argues that this indicates that the MS has meaningless text.

Schinner, Andreas: The Voynich Manuscript. Evidence of the Hoax Hypothesis, in: Cryptologia. Philadelphia Vol. 31.2007 (April), pp. 95 ff.

RE: Heap's Law - Koen G - 05-02-2018

Did he take into account Currier A and B?

Also, since the VM appears to treat wildly different subjets, I'd expect specialized vocabulary to skew the results. Imagine combining medical instructions with a car's manual and a legal document into one book with three sections. I bet you'd get a different curve.

RE: Heap's Law - -JKP- - 05-02-2018

(05-02-2018, 07:52 AM)Koen Gh. Wrote: You are not allowed to view links. Register or Login to view.Did he take into account Currier A and B?

Also, since the VM appears to treat wildly different subjets, I'd expect specialized vocabulary to skew the results. Imagine combining medical instructions with a car's manual and a legal document into one book with three sections. I bet you'd get a different curve.

It might be worth doing for exactly that reason. If it doesn't produce different curves, what does that say about the text?

RE: Heap's Law - ReneZ - 05-02-2018

In my opinion what he was seeing was the effect of the 'Line as a functional unit', made famous by Currier.
This is not easily explained in specific detail, but seems to manifest itself mostly in 'something unusual' with first words of lines.

Gabriel Landini clearly noticed this in his Crytologia paper:

Landini, Gabriel: Evidence of linguistic structure in the Voynich MS using spectral analysis, in Cryptologia Vol. 25 Issue 4, 2001, pp. 275-295.

RE: Heap's Law - Torsten - 05-02-2018

If I understand David correctly he only has one point for Currier A and one point for Currier B. One point is not enough to decide if the number of new words decreases over time.

It is known that the text in the VMS is changing from Currier A to Currier B. Therefore Schinner is probably correct and the number of new words didn't decrease for additional text as expected.