The Voynich Ninja

Full Version: The rate of vocabulary growth as the content type marker
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
A few days ago I made a You are not allowed to view links. Register or Login to view. that the speed at which the vocabulary stabilizes itself in a text might be an indicator as to the type of the text contents. In a fiction novel, where one could expect any change of the plot, and where artistic descriptions and depictions constitute the essence of the writing, each new page can introduce notable additions to the cumulative vocabulary. In contrast to that, a narrow-topic professional text would probably have a limited base vocabulary, which is then watered by special terminology which can well arrive up to the last page, but the rate of the vocabulary growth would be slow towards the end of the text.

Hence, by plotting the vocabulary growth curve of the VMS we can potentially make conclusions as to the nature of the contents thereof. Of course, we should separate sections of apparently different topics - because in the framework of this discourse, they present different texts.

Just an idea. I have not run any tests. What do you think?
Cool idea. Even if it doesn't turn out as one might guess, patterns like this are worth documenting.
There was already a similar thread about You are not allowed to view links. Register or Login to view.

One effect for the VMS is that the curve is flattening for pages containing more text. This happens since the vocabulary is changing from page to page. With other words a page full of text has a smaller vocabulary than multiple pages containing the same amount of text. Another reason is that the resonance effect requires some amount of text to take place. Therefore a page containing continuous text will contain more repeated words than the same amount of separately used labels.
Ah yes, the Heap's law. I already forgot about that discussion. Have to read it afresh.
Upon consideration (and what was missed in the Heaps law discussion), some VMS pages are missing and some have been, or may have been, reordered. Unfortunately, this renders the "running vocabulary" results unreliable.
Anton, that may be true, but if there are folio-specific patterns, then the order might matter less.