![]() |
Word Entropy - Printable Version +- The Voynich Ninja (https://www.voynich.ninja) +-- Forum: Voynich Research (https://www.voynich.ninja/forum-27.html) +--- Forum: Analysis of the text (https://www.voynich.ninja/forum-41.html) +--- Thread: Word Entropy (/thread-2928.html) |
RE: Word Entropy - Koen G - 14-09-2019 (14-09-2019, 05:35 PM)Anton Wrote: You are not allowed to view links. Register or Login to view.As I said yesterday in I don't remember which statistics thread, I think at least 2 orders of magnitude more than the vocabulary size. But, once again, you can explore that by taking a large text of, say, 100000 words, and calculate entropy over a fragment thereof, say, of first 10000 words (note that its vocabulary would probably be less than that of the whole text). Then take a larger fragment of, say, 20000 words. And so on. And observe how the values change. Okay I'll test this in a bit. Rene: I redid everything directly copy-pasting from command prompt, so it should be better now. There are still the same high h1-h2 values though. RE: Word Entropy - Koen G - 14-09-2019 So apparently text size does matter. However, I don't think there is any ideal size for h2, since even up to 70k words it keeps increasing. I tested two texts at the first 1k,2k,5k,10k,35k and 70k words. This shows that h2 by itself is completely unreliable unless some kind of uniformization or normalization takes place. However, h1-h2 is less variable and there is no huge difference between 5k and 10k. RE: Word Entropy - Anton - 14-09-2019 The top graph is h2, is not it? Could you post how h0 changes along? RE: Word Entropy - Koen G - 14-09-2019 Sorry, it is indeed h2. Here's a chart for each text: And the values: Text1: h0 h1 h2 1k 9.319672121 8.761989719 1.0836556 2k 10.18115226 9.513244511 1.341051273 5k 11.1792871 10.18085782 1.938496171 10k 11.87382857 10.52704111 2.478757416 35k 13.18936129 11.21249127 3.388827059 70k 13.83041617 11.48206278 3.944103731 Text2: h0 h1 h2 1k 8.851749041 8.089312357 1.73445866 2k 9.566054038 8.499076048 2.249659701 5k 10.37177664 8.817176607 3.084718418 10k 10.96144969 9.024066872 3.695144981 35k 12.02583167 9.329523064 4.737718001 70k 12.54109662 9.414111121 5.238267741 RE: Word Entropy - Anton - 14-09-2019 OK, this is what Rene spoke about, the vocabulary continues to grow alog with the text size. With h0=13.83, this is roughly 15k vocabulary, and compared to that, the 70k text size is of course not sufficient. I guess that's because the texts under test represent some kind of fiction, where there is no quick limit to the vocabulary. In contrast to that, some professional text would probably have a more rapidly stabilizing vocabulary. RE: Word Entropy - Koen G - 14-09-2019 Yeah, maybe this will be too dependent on text type to derive some formula for normalization. Another problem is that texts aren't uniform. Q13 is a good example. h1-h2 seems somewhat stable though, wouldn't it be the best to just rely on this? Edit: on the other hand, I just tried a line chart with log scale and the three lines are pretty straight. Doesn't that mean one could normalize the data somehow? RE: Word Entropy - Koen G - 14-09-2019 I've also checked Pliny's Natural History: h0 h1 h2 1k 9.422064766 8.955524195 0.9827503151 2k 10.23959853 9.592929606 1.311693669 5k 11.19660213 10.17381079 2.003107139 10k 12.03823313 10.73489704 2.415263961 35k 13.62798978 11.820678 3.029928693 70k 14.42101286 12.2592404 3.499844451 RE: Word Entropy - Anton - 14-09-2019 (14-09-2019, 09:40 PM)Koen G Wrote: You are not allowed to view links. Register or Login to view.Doesn't that mean one could normalize the data somehow? If this is the general behaviour of all texts in all languages, then yes it does, but I'm not sure if it is. RE: Word Entropy - ReneZ - 15-09-2019 Pliny's text is an interesting model text, and perhaps not 'typical'. By its encyclopedic nature, it has more word types (different words) than the average text. The h0 values shows that at 70,000 words it has 21,936 word types. I don't think that many texts will even come near this. Effectively, h0 and h1 grow faster than would be typical. Also, I've come to the conclusion that the h2 that I estimated before, namely: 2log (Ntoken) - h1 represents a theoretical maximum. For the Pliny samples, the actual h2 is very close to this maximum. At 1000 words it is 97% of the maximum and at 70,000 words it is 91%. For this text, the h1-h2 is relatively stable. Its value is higher than I originally expected, and this is clearly due to the Zipf law model that I applied, which doesn't fit this text, and perhaps also not many texts. RE: Word Entropy - ReneZ - 15-09-2019 (14-09-2019, 08:51 AM)nablator Wrote: You are not allowed to view links. Register or Login to view.(Values for entire TT with unclear "?" vords removed): Using also the numbers in Koen's table, I find that for this version of the TT transcription, the number of word tokens was 37,886 and the number of word types 8078. This gives a maximum theoretical h2 of 4.76 . The shuffled value is closer to it (at 95%) |