![]() |
[split] Percentage of word types that occur more than once - Printable Version +- The Voynich Ninja (https://www.voynich.ninja) +-- Forum: Voynich Research (https://www.voynich.ninja/forum-27.html) +--- Forum: Analysis of the text (https://www.voynich.ninja/forum-41.html) +--- Thread: [split] Percentage of word types that occur more than once (/thread-3255.html) |
RE: [split] Percentage of word types that occur more than once - Koen G - 23-06-2020 Thread split. I think Emma's remarks must certainly be taken into account. Word count must be the same for each text (this is exactly the same as with TTR research) and ideally should be limited to one VM section / dialect. RE: [split] Percentage of word types that occur more than once - bi3mw - 23-06-2020 (23-06-2020, 01:12 PM)Stephen Carlson Wrote: You are not allowed to view links. Register or Login to view.5571 is more reasonable, but that's not the number on Rene's page. Where is the 5571 coming from? It comes from total word types - non unique word types: 8078 - 2507 = 5571 You are not allowed to view links. Register or Login to view. The VMS text file is discussed You are not allowed to view links. Register or Login to view. RE: [split] Percentage of word types that occur more than once - Stephen Carlson - 23-06-2020 (23-06-2020, 02:55 PM)bi3mw Wrote: You are not allowed to view links. Register or Login to view.Oh, you're referring to hapax legomena. I'm not talking about those, but that number is also surprising high in comparison with natural language texts.(23-06-2020, 01:12 PM)Stephen Carlson Wrote: You are not allowed to view links. Register or Login to view.5571 is more reasonable, but that's not the number on Rene's page. Where is the 5571 coming from?It comes from total word types - non unique word types: 8078 - 2507 = 5571 RE: [split] Percentage of word types that occur more than once - Alin_J - 23-06-2020 (21-06-2020, 08:33 PM)bi3mw Wrote: You are not allowed to view links. Register or Login to view.During a study of the VMS, I found that only a relatively small percentage of Word Types occur more than once. Can anyone confirm this ? If using the 101 transliteration (Glen Claston), the percentage of unique words (word types) that occurred more than once in the Voynich manuscript is about 28%. So, 72% of the word-types in the VMS are hapax legomenas. RE: [split] Percentage of word types that occur more than once - bi3mw - 23-06-2020 (23-06-2020, 03:57 PM)Stephen Carlson Wrote: You are not allowed to view links. Register or Login to view.Oh, you're referring to hapax legomena. I'm not talking about those, but that number is also surprising high in comparison with natural language texts. Yeah, that's kind of the flip side of my observation in the opening post. Thanks @Alin_J, I would not have thought that the value is even lower. RE: [split] Percentage of word types that occur more than once - Alin_J - 23-06-2020 (23-06-2020, 04:16 PM)Alin_J Wrote: You are not allowed to view links. Register or Login to view.(21-06-2020, 08:33 PM)bi3mw Wrote: You are not allowed to view links. Register or Login to view.During a study of the VMS, I found that only a relatively small percentage of Word Types occur more than once. Can anyone confirm this ? But then again, this is IMO nothing unusual for natural language texts. For example, the corresponding number for the Finnish translation of Hamlet (total word-count: 23448 tokens), is 26%, i.e. 74% hapax legomenas. RE: [split] Percentage of word types that occur more than once - Alin_J - 23-06-2020 (23-06-2020, 11:16 AM)Stephen Carlson Wrote: You are not allowed to view links. Register or Login to view.(22-06-2020, 04:07 PM)ReneZ Wrote: You are not allowed to view links. Register or Login to view.For these two points, I can recommend to look at Table 3 on You are not allowed to view links. Register or Login to view. , that shows a great spread in the number of unique words (word types).From the linked page: This is also not unusual, at least not for non-English works... I found that the Swedish novel Inferno by August Strindberg has a number of 4.38 (total number of words about 46 000). RE: [split] Percentage of word types that occur more than once - bi3mw - 23-06-2020 (23-06-2020, 04:30 PM)Alin_J Wrote: You are not allowed to view links. Register or Login to view.But then again, this is IMO nothing unusual for natural language texts.Hmm, I would have rather thought that a ratio of 45% / 55% is the "normal case" in longer texts, but surely it depends strongly on the text genre and language. RE: [split] Percentage of word types that occur more than once - Alin_J - 23-06-2020 (23-06-2020, 05:02 PM)bi3mw Wrote: You are not allowed to view links. Register or Login to view.(23-06-2020, 04:30 PM)Alin_J Wrote: You are not allowed to view links. Register or Login to view.But then again, this is IMO nothing unusual for natural language texts.Hmm, I would have rather thought that a ratio of 45% / 55% is the "normal case" in longer texts, but surely it depends strongly on the text genre and language. Yeah, it does seem to vary a lot depending on both language and on type of work, e.g. fiction, encyclopedic, poetic etc. RE: [split] Percentage of word types that occur more than once - RobGea - 23-06-2020 After wrestling a bit with the web and terminology. Here are some quickly put together stats ( errors and omissions included ). Code: VoynichTT Here you can see Plinys' Natural History, an encyclopedic work has lots of hapax and almost every 3rd word is a new addition to the vocabulary. Whereas Dickens 'Tale Of Two Cities' is at the other end of the scale ( perhaps explaining some of his appeal ) where is a new word is introduced only every 13.4 words. 'Tale Of Two Cities' is a popular book with stats comparable to 'The Adventures of Tom Sawyer'. Dickens also has the lowest percentage of hapax but nearly half of this books vocabulary are still unique words ( hapax legomena ). Genre could be an influence on these numbers as noted by Alin_J and bi3mw. Interestingly we can see that Dante has the closest numbers to the VoynichMS. The divisions of The Divine Comedy perhaps affecting the statistics of that text in a similar manner to the way the 6 sections of the VMS are possibly culpable for its ( The VMS's ) stats. Further Investigation Required as noted by Emma May Smith. Hapax legomena are the other side of the coin to the concept of 'word types that occur more than once' as noted by bi3mw. Ref:https://en.wikipedia.org/wiki/Hapax_legomenon And this looks quite interesting as well. You are not allowed to view links. Register or Login to view. Edit: 24/06/20 bi3mw pointed out the Dante stats are wrong..see thread page4. |