(14-09-2019, 05:35 PM)Anton Wrote: You are not allowed to view links. Register or Login to view.As I said yesterday in I don't remember which statistics thread, I think at least 2 orders of magnitude more than the vocabulary size. But, once again, you can explore that by taking a large text of, say, 100000 words, and calculate entropy over a fragment thereof, say, of first 10000 words (note that its vocabulary would probably be less than that of the whole text). Then take a larger fragment of, say, 20000 words. And so on. And observe how the values change.
Okay I'll test this in a bit.
Rene: I redid everything directly copy-pasting from command prompt, so it should be better now. There are still the same high h1-h2 values though.
So apparently text size does matter. However, I don't think there is any ideal size for h2, since even up to 70k words it keeps increasing. I tested two texts at the first 1k,2k,5k,10k,35k and 70k words. This shows that h2 by itself is completely unreliable unless some kind of uniformization or normalization takes place. However, h1-h2 is less variable and there is no huge difference between 5k and 10k.
[
attachment=3306]
The top graph is h2, is not it?
Could you post how h0 changes along?
Sorry, it is indeed h2.
Here's a chart for each text:
[
attachment=3307]
And the values:
Text1:
h0 h1 h2
1k 9.319672121 8.761989719 1.0836556
2k 10.18115226 9.513244511 1.341051273
5k 11.1792871 10.18085782 1.938496171
10k 11.87382857 10.52704111 2.478757416
35k 13.18936129 11.21249127 3.388827059
70k 13.83041617 11.48206278 3.944103731
Text2:
h0 h1 h2
1k 8.851749041 8.089312357 1.73445866
2k 9.566054038 8.499076048 2.249659701
5k 10.37177664 8.817176607 3.084718418
10k 10.96144969 9.024066872 3.695144981
35k 12.02583167 9.329523064 4.737718001
70k 12.54109662 9.414111121 5.238267741
OK, this is what Rene spoke about, the vocabulary continues to grow alog with the text size. With h0=13.83, this is roughly 15k vocabulary, and compared to that, the 70k text size is of course not sufficient.
I guess that's because the texts under test represent some kind of fiction, where there is no quick limit to the vocabulary. In contrast to that, some professional text would probably have a more rapidly stabilizing vocabulary.
Yeah, maybe this will be too dependent on text type to derive some formula for normalization.
Another problem is that texts aren't uniform. Q13 is a good example.
h1-h2 seems somewhat stable though, wouldn't it be the best to just rely on this?
Edit: on the other hand, I just tried a line chart with log scale and the three lines are pretty straight. Doesn't that mean one could normalize the data somehow?
I've also checked Pliny's Natural History:
[
attachment=3308]
h0 h1 h2
1k 9.422064766 8.955524195 0.9827503151
2k 10.23959853 9.592929606 1.311693669
5k 11.19660213 10.17381079 2.003107139
10k 12.03823313 10.73489704 2.415263961
35k 13.62798978 11.820678 3.029928693
70k 14.42101286 12.2592404 3.499844451
(14-09-2019, 09:40 PM)Koen G Wrote: You are not allowed to view links. Register or Login to view.Doesn't that mean one could normalize the data somehow?
If this is the general behaviour of all texts in all languages, then yes it does, but I'm not sure if it is.
Pliny's text is an interesting model text, and perhaps not 'typical'.
By its encyclopedic nature, it has more word types (different words) than the average text.
The h0 values shows that at 70,000 words it has 21,936 word types. I don't think that many texts will even come near this. Effectively, h0 and h1 grow faster than would be typical.
Also, I've come to the conclusion that the h2 that I estimated before, namely:
2log (Ntoken) - h1
represents a theoretical maximum.
For the Pliny samples, the actual h2 is very close to this maximum.
At 1000 words it is 97% of the maximum and at 70,000 words it is 91%.
For this text, the h1-h2 is relatively stable. Its value is higher than I originally expected, and this is clearly due to the Zipf law model that I applied, which doesn't fit this text, and perhaps also not many texts.
(14-09-2019, 08:51 AM)nablator Wrote: You are not allowed to view links. Register or Login to view.(Values for entire TT with unclear "?" vords removed):
h0 h1 h2
12.97 10.45 4.36
When vords are randomly shuffled h2 increases:
12.97 10.45 4.52
Using also the numbers in Koen's table, I find that for this version of the TT transcription, the number of word tokens was 37,886 and the number of word types 8078.
This gives a maximum theoretical h2 of 4.76 .
The shuffled value is closer to it (at 95%)