![]() |
[split] Percentage of word types that occur more than once - Printable Version +- The Voynich Ninja (https://www.voynich.ninja) +-- Forum: Voynich Research (https://www.voynich.ninja/forum-27.html) +--- Forum: Analysis of the text (https://www.voynich.ninja/forum-41.html) +--- Thread: [split] Percentage of word types that occur more than once (/thread-3255.html) |
[split] Percentage of word types that occur more than once - bi3mw - 21-06-2020 During a study of the VMS, I found that only a relatively small percentage of Word Types occur more than once. Can anyone confirm this ? 31.034909631% of all Word Types in the VMS occur more than once. 44.928611163% of all Word Types in the comparison text ( Regimen Sanitatis ) occur more than once. You are not allowed to view links. Register or Login to view. You are not allowed to view links. Register or Login to view. RE: General chat thread - RobGea - 21-06-2020 From voynichese.com which uses T.Takahashi transliteration, though not sure which version. You are not allowed to view links. Register or Login to view. Vocabulary: 8078 Hapax legomenas: 5571 MoreThanOnce: ( 8078 - 5571 ) = 2507 ( 100 / 8078 ) *2507 = 31.034909631096806% words that occur more than once. From regimem_sanitatis_corpus (not sure where i got this from) my parser. Vocabulary: 5399 Hapax legomenas: 2978 More Than Once: 2421 ( 100 / 5399 ) *2421 = 44.84163734024819% words that occur more than once. Close ![]() I imagine that ive somehow lost a few words in processing the regimem text file. ![]() Havent yet installed OpenOffice so cant check your files. RE: General chat thread - bi3mw - 21-06-2020 Addendum: I used Torsten Timm`s generator to create text. 46,51611658% of all Word Types in the generated text occur more than once. You are not allowed to view links. Register or Login to view. RE: General chat thread - Emma May Smith - 22-06-2020 The size of the vocabulary and the number of one token words might be misaligned with the apparent length of the text. The language of each scribe/section varies so that we're effectively looking at multiple texts with only a fraction of the words from the manuscript as a whole. Given that one token word types will be more common in shorter texts, we need to compare each section with a text of equal length. RE: General chat thread - Ruby Novacna - 22-06-2020 (21-06-2020, 09:55 PM)RobGea Wrote: You are not allowed to view links. Register or Login to view.From voynichese.comYou read my mind, speaking of voynichese. I use it regularly, but the only thing I can do is display a word in color and see its frequency. Can we search for two neighboring words at the same time? RE: General chat thread - ReneZ - 22-06-2020 There are several reasons why the results for the Voynich text may be skewed. We don't know, of course, what is the level of impact of each of these possible reasons, through one could try to simulate this if one wanted. 1. there are errors in the transliteration. These are likely to create (some) additional unique words 2. specifically, word spaces are likely to be misinterpreted. this has an impact on the word count. For these two points, I can recommend to look at Table 3 on You are not allowed to view links. Register or Login to view. , that shows a great spread in the number of unique words (word types). 3. The comparison texts are from printed sources, and do not contain errors. Any errors in the handwritten Voynich text would increase the number unique words (even when transliterated correctly). RE: General chat thread - Ruby Novacna - 22-06-2020 (22-06-2020, 04:07 PM)ReneZ Wrote: You are not allowed to view links. Register or Login to view.I can recommend to look at Table 3 on You are not allowed to view links. Register or Login to view.The number of unique words being between 8400 and 9800, my reading speed - 1-2 words per week, it will take me between 90 and 190 years. I'd better go on a diet if I want to get over it. ![]() RE: General chat thread - Stephen Carlson - 23-06-2020 (22-06-2020, 04:07 PM)ReneZ Wrote: You are not allowed to view links. Register or Login to view.For these two points, I can recommend to look at Table 3 on You are not allowed to view links. Register or Login to view. , that shows a great spread in the number of unique words (word types).From the linked page: Quote:A representative number of word types may be 9,000 - 10,000.This strikes me on the high side, at least for certain languages and genres. It's a kind of obvious test, but as anyone compared the unique word count of the VM to that of other works of various languages? This page here You are not allowed to view links. Register or Login to view. puts the number of words per unique words of 7 different English-language novels as between 9 and 16.5. If I understand the VM stats right, it comes in between 3.6 and 4.3, depending on the transcription. It seems that number of unique words in the denominator is about three or four times too high, but I'm curious about non-English works. RE: General chat thread - bi3mw - 23-06-2020 Maybe I have an error in thinking somewhere, but I get a value of 6.8 for the VMS: Unique words VMS: 5571 VMS words total: 37886 Words per unique word 37886 / 5571 = 6.8006 P.S: I probably only come up with 8078 Word Types because all uppercase letters have been converted to lowercase. RE: General chat thread - Stephen Carlson - 23-06-2020 (23-06-2020, 12:24 PM)bi3mw Wrote: You are not allowed to view links. Register or Login to view.Unique words VMS: 55715571 is more reasonable, but that's not the number on Rene's page. Where is the 5571 coming from? |