The Voynich Ninja

Full Version: Word Entropy
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Pages: 1 2 3 4 5 6 7 8 9
Yes, precisely. Since TTR is dependent on text size (tokens keep increasing while types increase much slower after a while), it's better to use MATTR, which averages TTR over a given window. I don't know if there are standard notations for MATTR, so I just came up with "m500" for a sliding window of 500 and so on. 

I also attach the file with the highest h1-h2 in the set I used for the last graph (9.817801344).
I added word and type count (thanks Nablator):

You are not allowed to view links. Register or Login to view. (WordEntropy sheet)
(14-09-2019, 02:22 PM)RenegadeHealer Wrote: You are not allowed to view links. Register or Login to view.[quote="Anton" pid='30822' dateline='1568462503']Out of curiosity, if I'm addressing someone directly in Russian, what case do I use for their name?


You use nominative. There has been vocative once, but it began to fade away some 1000 years ago, and almost disappeared from common speech some 500 years ago. With the Soviet language reform of 1918, it was formally obsoleted. It is traced in several archaicisms though.
(14-09-2019, 03:40 PM)Koen G Wrote: You are not allowed to view links. Register or Login to view.I added word and type count (thanks Nablator):

You are not allowed to view links. Register or Login to view. (WordEntropy sheet)

Well from most of these text sizes I won't expect reliable results. Computing word entropy on a text size of 1693 words where as many as 667 words are unique is like computing English character entropy on a text size of 70 characters.

And yes, the VMS itself is not much better in that respect, unfortunately. So perhaps word entropy is not a good metric in our case, after all.
Where would you draw the line?
Thanks, that is very helpful.

I think that something has gone wrong with the creation of this file, but only in some parts.

The word-h0 is a direct reflection of the number of word types in the text, namely the base-2 logirithm of this number. For some texts, the lines seem to have been swapped.

One such group is lines 67 to 76. Another one is lines 181 to 198.

Another problem is that there are many texts that have quite an anomalous value for the ratio of word types over word tokens. In some cases these still have a normal m500.

I attach a modified version of the Excel file where I added a few columns to the page 'WordEntropy'.

I compute 2 ** h0 which should be equal to the number of word types.

I also compute 2 ** (h1+h2) which should be close to, but less than the number of word tokens.

You can see the problem areas.

[attachment=3305]
(14-09-2019, 04:27 PM)Koen G Wrote: You are not allowed to view links. Register or Login to view.Where would you draw the line?

What do you mean?
Rene: thanks, I'll have a look.
Anton: what would you say is the minimum amount of words required for the calculations to be worth something?
I would have said 10,000 but that eliminates the vast majority of texts.
We could strike a deal at 4096 ?
(14-09-2019, 05:14 PM)Koen G Wrote: You are not allowed to view links. Register or Login to view.Anton: what would you say is the minimum amount of words required for the calculations to be worth something?

As I said yesterday in I don't remember which statistics thread, I think at least 2 orders of magnitude more than the vocabulary size. But, once again, you can explore that by taking a large text of, say, 100000 words, and calculate entropy over a fragment thereof, say, of first 10000 words (note that its vocabulary would probably be less than that of the whole text). Then take a larger fragment of, say, 20000 words. And so on. And observe how the values change.
Pages: 1 2 3 4 5 6 7 8 9