![]() |
Words get more unusual as they get longer (from 2 characters) - Printable Version +- The Voynich Ninja (https://www.voynich.ninja) +-- Forum: Voynich Research (https://www.voynich.ninja/forum-27.html) +--- Forum: Analysis of the text (https://www.voynich.ninja/forum-41.html) +--- Thread: Words get more unusual as they get longer (from 2 characters) (/thread-4502.html) |
Words get more unusual as they get longer (from 2 characters) - Addsamuels - 22-02-2025 Jacquard Similarity ===For 1-grams Average 1-gram similarity among least common words: 0.25570049296453057 Average 1-gram similarity among most common words: 0.21682838534999466 ===For 2-grams Average 2-gram similarity among least common words: 0.06423834336923202 Average 2-gram similarity among most common words: 0.06670467758507431 ===For 3-grams Average 3-gram similarity among least common words: 0.016802481467017006 Average 3-gram similarity among most common words: 0.02217942907311669 ===For 4-grams Average 4-gram similarity among least common words: 0.004504037054205818 Average 4-gram similarity among most common words: 0.007278612262000938 ===For 5-grams Average 5-gram similarity among least common words: 0.000987486273694194 Average 5-gram similarity among most common words: 0.0019374143958861578 ===For 6-grams Average 6-gram similarity among least common words: 0.0002422555676235787 Average 6-gram similarity among most common words: 0.0003965759779713268 Levenshtein Similarity Length 2: Overall average Levenshtein distance: 1.8191 Least common words: (freq = 2) -> Avg distance: 1.8333 Top 40 most common words: -> Avg distance: 1.8115 Top 10 most common words: -> Avg distance: 1.7778 Length 3: Overall average Levenshtein distance: 2.6262 Least common words: (freq = 2) -> Avg distance: 2.6473 Top 40 most common words: -> Avg distance: 2.4744 Top 10 most common words: -> Avg distance: 2.4667 Length 4: Overall average Levenshtein distance: 3.4458 Least common words: (freq = 2) -> Avg distance: 3.5212 Top 40 most common words: -> Avg distance: 3.2308 Top 10 most common words: -> Avg distance: 3.1556 Length 5: Overall average Levenshtein distance: 4.1917 Least common words: (freq = 2) -> Avg distance: 4.2958 Top 40 most common words: -> Avg distance: 3.8141 Top 10 most common words: -> Avg distance: 3.8667 Length 6: Overall average Levenshtein distance: 4.8464 Least common words: (freq = 2) -> Avg distance: 4.9734 Top 40 most common words: -> Avg distance: 4.3154 Top 10 most common words: -> Avg distance: 3.9778 Length 7: Overall average Levenshtein distance: 5.4544 Least common words: (freq = 2) -> Avg distance: 5.6087 Top 40 most common words: -> Avg distance: 4.7000 Top 10 most common words: -> Avg distance: 4.5333 Length 8: Overall average Levenshtein distance: 5.9280 Least common words: (freq = 2) -> Avg distance: 6.1097 Top 40 most common words: -> Avg distance: 5.3397 Top 10 most common words: -> Avg distance: 5.1333 N.B: Ignored hapax legomena Note this is consistent with natural languages but I believe it is more common in the Voynich RE: Words get more unusual as they get longer (from 2 characters) - bt2901 - 24-02-2025 I think this should be analyzed using histograms instead of average values? I'd expect Voynichese to have a bimodal distribution: some vords are more similar than in natural languages (think about sequences like qokeedy.qotedy.qokeedy.qokeedy.qokeey) while some are less similar because vords do not seem to have identifiable morphology (roots, prefixes, etc). RE: Words get more unusual as they get longer (from 2 characters) - obelus - 27-02-2025 The main "more unusual" trend is indeed found in all texts, including random ones. For "overall average Levenshtein distance" I suppose that you calculated the average edit distance of all pairwise permutations of word types within each EVA length class. I get values similar to You are not allowed to view links. Register or Login to view. using paragraph text in IT2a-n.txt (chosen for its alphabet-sized character set). Each increment of word length adds approximately 0.8 edits to the "overall average": We can convince ourselves (by thought experiment) that as the number of available characters increases, the plotted points will converge on the grey line of 1:1 slope. Dividing the average edit distance by word length yields average edit distance on a per-character basis (the local slope): Here the VMS transcription in gray is compared with "natural" texts of similar length: early modern English (red, 1611 King James Gospels), and medieval Latin (blue, Speculum Humanae Salvationis). Open symbols represent a version of each text in which the character order has been pseudorandomly scrambled, preserving the word length and character frequency distributions. The three scrambled variants are barely distinguishable in the plot. Thus, the dominant contribution to edit-distance-unusualness is a combinatorial property unrelated to any fine structure of the vocabularies. On closer inspection: The "natural" text samples KJB and SHS show a small offset toward lower average edit distance from their scrambled counterparts, representing some kind of letter-order and -frequency correlations, not strongly dependent on word length. Surprise! — Voynichese is the outlier. Its longer "words" are progressively less mutually unusual (on a per character basis) than the shorter or scrambled ones. Does it mean that longer words contain more longer-range correlations? This topic may have been addressed on the forum previously, from a different perspective. It has been a slow week in Analysisland. RE: Words get more unusual as they get longer (from 2 characters) - Addsamuels - 28-02-2025 (27-02-2025, 10:43 PM)obelus Wrote: You are not allowed to view links. Register or Login to view.The main "more unusual" trend is indeed found in all texts, including random ones.Actually my title of the thread was terrible. It should have been rarer (less common) words are more unusual (according to lev and Jac similarity) (from 2 characters). Please review the data, in this light. Also I dwell insularly RE: Words get more unusual as they get longer (from 2 characters) - Addsamuels - 28-02-2025 It should follow nicely from the data that indeed rarer words are more unusual (defined by similarity or Levenshtein distance) for length (2,8). This also can be viewed simply looking at one of my other posts about the most common words in the manuscript. There (even with ignoring hapax), it can be seen to be true. This follows a theme that more the rarer words are derivational upon (shorter) common words in which further empirical proofs should illuminate, when satisfactory thought and formatting should provide for more befitting posts. RE: Words get more unusual as they get longer (from 2 characters) - Addsamuels - 28-02-2025 N.B: See You are not allowed to view links. Register or Login to view. for some ideas. |