The Voynich Ninja

Full Version: Words get more unusual as they get longer (from 2 characters)
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Jacquard Similarity
===For 1-grams
Average 1-gram similarity among least common words: 0.25570049296453057
Average 1-gram similarity among most common words: 0.21682838534999466
===For 2-grams
Average 2-gram similarity among least common words: 0.06423834336923202
Average 2-gram similarity among most common words: 0.06670467758507431
===For 3-grams
Average 3-gram similarity among least common words: 0.016802481467017006
Average 3-gram similarity among most common words: 0.02217942907311669
===For 4-grams
Average 4-gram similarity among least common words: 0.004504037054205818
Average 4-gram similarity among most common words: 0.007278612262000938
===For 5-grams
Average 5-gram similarity among least common words: 0.000987486273694194
Average 5-gram similarity among most common words: 0.0019374143958861578
===For 6-grams
Average 6-gram similarity among least common words: 0.0002422555676235787
Average 6-gram similarity among most common words: 0.0003965759779713268

Levenshtein Similarity
Length 2:
  Overall average Levenshtein distance: 1.8191
  Least common words: (freq = 2) -> Avg distance: 1.8333
  Top 40 most common words: -> Avg distance: 1.8115
  Top 10 most common words:  -> Avg distance: 1.7778

Length 3:
  Overall average Levenshtein distance: 2.6262
  Least common words: (freq = 2) -> Avg distance: 2.6473
  Top 40 most common words: -> Avg distance: 2.4744
  Top 10 most common words:  -> Avg distance: 2.4667

Length 4:
  Overall average Levenshtein distance: 3.4458
  Least common words: (freq = 2) -> Avg distance: 3.5212
  Top 40 most common words: -> Avg distance: 3.2308
  Top 10 most common words:  -> Avg distance: 3.1556

Length 5:
  Overall average Levenshtein distance: 4.1917
  Least common words: (freq = 2) -> Avg distance: 4.2958
  Top 40 most common words: -> Avg distance: 3.8141
  Top 10 most common words:  -> Avg distance: 3.8667

Length 6:
  Overall average Levenshtein distance: 4.8464
  Least common words: (freq = 2) -> Avg distance: 4.9734
  Top 40 most common words: -> Avg distance: 4.3154
  Top 10 most common words:  -> Avg distance: 3.9778

Length 7:
  Overall average Levenshtein distance: 5.4544
  Least common words: (freq = 2) -> Avg distance: 5.6087
  Top 40 most common words: -> Avg distance: 4.7000
  Top 10 most common words:  -> Avg distance: 4.5333

Length 8:
  Overall average Levenshtein distance: 5.9280
  Least common words: (freq = 2) -> Avg distance: 6.1097
  Top 40 most common words: -> Avg distance: 5.3397
  Top 10 most common words:  -> Avg distance: 5.1333


N.B: Ignored hapax legomena

Note this is consistent with natural languages but I believe it is more common in the Voynich
I think this should be analyzed using histograms instead of average values? I'd expect Voynichese to have a bimodal distribution: some vords are more similar than in natural languages (think about sequences like qokeedy.qotedy.qokeedy.qokeedy.qokeey) while some are less similar because vords do not seem to have identifiable morphology (roots, prefixes, etc).
The main "more unusual" trend is indeed found in all texts, including random ones.

For "overall average Levenshtein distance" I suppose that you calculated the average edit distance of all pairwise permutations of word types within each EVA length class.  I get values similar to You are not allowed to view links. Register or Login to view. using paragraph text in IT2a-n.txt (chosen for its alphabet-sized character set).  Each increment of word length adds approximately 0.8 edits to the "overall average":
[attachment=10080]
We can convince ourselves (by thought experiment) that as the number of available characters increases, the plotted points will converge on the grey line of 1:1 slope.

Dividing the average edit distance by word length yields average edit distance on a per-character basis (the local slope):
[attachment=10078]
Here the VMS transcription in gray is compared with "natural" texts of similar length:  early modern English (red, 1611 King James Gospels),  and medieval Latin (blue, Speculum Humanae Salvationis).  Open symbols represent a version of each text in which the character order has been pseudorandomly scrambled, preserving the word length and character frequency distributions.  The three scrambled variants are barely distinguishable in the plot.  Thus, the dominant contribution to edit-distance-unusualness is a combinatorial property unrelated to any fine structure of the vocabularies.

On closer inspection:
[attachment=10079]
The "natural" text samples KJB and SHS show a small offset toward lower average edit distance from their scrambled counterparts, representing some kind of letter-order and -frequency correlations, not strongly dependent on word length.  Surprise! — Voynichese is the outlier.  Its longer "words" are progressively less mutually unusual (on a per character basis) than the shorter or scrambled ones.  Does it mean that longer words contain more longer-range correlations?  This topic may have been addressed on the forum previously, from a different perspective.

It has been a slow week in Analysisland.
(27-02-2025, 10:43 PM)obelus Wrote: You are not allowed to view links. Register or Login to view.The main "more unusual" trend is indeed found in all texts, including random ones.

For "overall average Levenshtein distance" I suppose that you calculated the average edit distance of all pairwise permutations of word types within each EVA length class.  I get values similar to You are not allowed to view links. Register or Login to view. using paragraph text in IT2a-n.txt (chosen for its alphabet-sized character set).  Each increment of word length adds approximately 0.8 edits to the "overall average":

We can convince ourselves (by thought experiment) that as the number of available characters increases, the slope of the plotted points will converge on the 1:1 grey line.

Dividing the average edit distance by word length yields average edit distance on a per-character basis (the local slope):

Here the VMS transcription in gray is compared with "natural" texts of similar length:  early modern English (red, 1611 King James Gospels),  and medieval Latin (blue, Speculum Humanae Salvationis).  Open symbols represent a version of each text in which the character order has been pseudorandomly scrambled, preserving the word length and character frequency distributions.  The three scrambled variants are barely distinguishable in the plot.  Thus, the dominant contribution to edit-distance-unusualness is a combinatorial property unrelated to any fine structure of the vocabularies.

On closer inspection:

The "natural" text samples KJB and SHS show a small offset toward lower average edit distance from their scrambled counterparts, representing some kind of letter-order and -frequency correlations, not strongly dependent on word length.  Surprise! — Voynichese is the outlier.  Its longer "words" are progressively less mutually unusual (on a per character basis) than the shorter or scrambled ones.  Does this mean that longer words contain more longer-range correlations?  This topic may have been addressed on the forum previously, from a different perspective.

It has been a slow week in Analysisland.
Actually my title of the thread was terrible. It should have been rarer (less common) words are more unusual (according to lev and Jac similarity) (from 2 characters). Please review the data, in this light. Also I dwell insularly
It should follow nicely from the data that indeed rarer words are more unusual (defined by similarity or Levenshtein distance) for length (2,8). This also can be viewed simply looking at one of my other posts about the most common words in the manuscript. There (even with ignoring hapax), it can be seen to be true. This follows a theme that more the rarer words are derivational upon (shorter) common words in which further empirical proofs should illuminate, when satisfactory thought and formatting should provide for more befitting posts.
N.B: See You are not allowed to view links. Register or Login to view. for some ideas.