In linguistics,
Heaps' law (also called Herdan's law) is an empirical law which describes the number of distinct words in a document (or set of documents) as a function of the document length (so called type-token relation).
Heaps' law means that as more instance text is gathered, there will be diminishing returns in terms of discovery of the full vocabulary from which the distinct terms are drawn. More on You are not allowed to view links.
Register or
Login to view..
To put this another way, the vocabulary size in any textual stream grows according to Heaps law: it is proportional to the square root of the total number of tokens in the stream.
So if the Voynich is based on an underlying natural language text - if the word tokens we observe actually have unique meanings - then it should correspond to Heap's Law, which has extensively shown to exist in many different natural languages.
If the word tokens do
not have unique meanings - ie, this is gibberish or tokens are actually comprised of morphemes with different meanings - then we should not observe Heap's Law.
I used the edu-texts-analyzer (You are not allowed to view links.
Register or
Login to view.) to run a test on the Voynich.
I prepared two distinct corpus using the Voynich freie literature You are not allowed to view links.
Register or
Login to view., one of Currier A and the other of Currier B. I attach the transcripts, which were cleared up by turning . into spaces, and removing line end / paragraph markers.
CurrierA contained 11,558 total words of which 3487 were unique.
CurrierB contained 25,489 total words of which 4710 were unique.
Both of these results strongly correspond with the result predicted by Heap's Law, as shown in these two charts (in both cases the small line going from zero to a point shows the position of our result, against the line which Heap's Law plots):
I used the opportunity to run a Zipf's law test. As expected, both texts show the characteristics of a Zipfs chart.
I attach two spreadsheets with the sorted words in both Currier variants, sorted according to their frequency of occurrence.
Comparison with random text
I used an online You are not allowed to view links.
Register or
Login to view. to generate gibberish to run the same tests against. The Heap's results fitted exactly with the prediction. I assume this is because the webpage -and most of its ilk - attempt to mimic the English language when producing its gibberish.
So I then put that random text into a markov chain randomizer found You are not allowed to view links.
Register or
Login to view.. This produced truly random text. I ran several attempts and none produced any results that were anywhere near a Heap's regression line.
However, all text produced very jerky Zipf's like results. Here's a typical chart.
Brief conclusions
These rapid tests should not be relied upon for any real conclusions, but they do seem to point towards Voynichese tokens having unique individual identities. There is more work to be done in analysing random and cipher texts, but I hope the tools I include in this post may help other researchers carry out their own tests.