22-02-2025, 06:33 PM
Jacquard Similarity
===For 1-grams
Average 1-gram similarity among least common words: 0.25570049296453057
Average 1-gram similarity among most common words: 0.21682838534999466
===For 2-grams
Average 2-gram similarity among least common words: 0.06423834336923202
Average 2-gram similarity among most common words: 0.06670467758507431
===For 3-grams
Average 3-gram similarity among least common words: 0.016802481467017006
Average 3-gram similarity among most common words: 0.02217942907311669
===For 4-grams
Average 4-gram similarity among least common words: 0.004504037054205818
Average 4-gram similarity among most common words: 0.007278612262000938
===For 5-grams
Average 5-gram similarity among least common words: 0.000987486273694194
Average 5-gram similarity among most common words: 0.0019374143958861578
===For 6-grams
Average 6-gram similarity among least common words: 0.0002422555676235787
Average 6-gram similarity among most common words: 0.0003965759779713268
Levenshtein Similarity
Length 2:
Overall average Levenshtein distance: 1.8191
Least common words: (freq = 2) -> Avg distance: 1.8333
Top 40 most common words: -> Avg distance: 1.8115
Top 10 most common words: -> Avg distance: 1.7778
Length 3:
Overall average Levenshtein distance: 2.6262
Least common words: (freq = 2) -> Avg distance: 2.6473
Top 40 most common words: -> Avg distance: 2.4744
Top 10 most common words: -> Avg distance: 2.4667
Length 4:
Overall average Levenshtein distance: 3.4458
Least common words: (freq = 2) -> Avg distance: 3.5212
Top 40 most common words: -> Avg distance: 3.2308
Top 10 most common words: -> Avg distance: 3.1556
Length 5:
Overall average Levenshtein distance: 4.1917
Least common words: (freq = 2) -> Avg distance: 4.2958
Top 40 most common words: -> Avg distance: 3.8141
Top 10 most common words: -> Avg distance: 3.8667
Length 6:
Overall average Levenshtein distance: 4.8464
Least common words: (freq = 2) -> Avg distance: 4.9734
Top 40 most common words: -> Avg distance: 4.3154
Top 10 most common words: -> Avg distance: 3.9778
Length 7:
Overall average Levenshtein distance: 5.4544
Least common words: (freq = 2) -> Avg distance: 5.6087
Top 40 most common words: -> Avg distance: 4.7000
Top 10 most common words: -> Avg distance: 4.5333
Length 8:
Overall average Levenshtein distance: 5.9280
Least common words: (freq = 2) -> Avg distance: 6.1097
Top 40 most common words: -> Avg distance: 5.3397
Top 10 most common words: -> Avg distance: 5.1333
N.B: Ignored hapax legomena
Note this is consistent with natural languages but I believe it is more common in the Voynich
===For 1-grams
Average 1-gram similarity among least common words: 0.25570049296453057
Average 1-gram similarity among most common words: 0.21682838534999466
===For 2-grams
Average 2-gram similarity among least common words: 0.06423834336923202
Average 2-gram similarity among most common words: 0.06670467758507431
===For 3-grams
Average 3-gram similarity among least common words: 0.016802481467017006
Average 3-gram similarity among most common words: 0.02217942907311669
===For 4-grams
Average 4-gram similarity among least common words: 0.004504037054205818
Average 4-gram similarity among most common words: 0.007278612262000938
===For 5-grams
Average 5-gram similarity among least common words: 0.000987486273694194
Average 5-gram similarity among most common words: 0.0019374143958861578
===For 6-grams
Average 6-gram similarity among least common words: 0.0002422555676235787
Average 6-gram similarity among most common words: 0.0003965759779713268
Levenshtein Similarity
Length 2:
Overall average Levenshtein distance: 1.8191
Least common words: (freq = 2) -> Avg distance: 1.8333
Top 40 most common words: -> Avg distance: 1.8115
Top 10 most common words: -> Avg distance: 1.7778
Length 3:
Overall average Levenshtein distance: 2.6262
Least common words: (freq = 2) -> Avg distance: 2.6473
Top 40 most common words: -> Avg distance: 2.4744
Top 10 most common words: -> Avg distance: 2.4667
Length 4:
Overall average Levenshtein distance: 3.4458
Least common words: (freq = 2) -> Avg distance: 3.5212
Top 40 most common words: -> Avg distance: 3.2308
Top 10 most common words: -> Avg distance: 3.1556
Length 5:
Overall average Levenshtein distance: 4.1917
Least common words: (freq = 2) -> Avg distance: 4.2958
Top 40 most common words: -> Avg distance: 3.8141
Top 10 most common words: -> Avg distance: 3.8667
Length 6:
Overall average Levenshtein distance: 4.8464
Least common words: (freq = 2) -> Avg distance: 4.9734
Top 40 most common words: -> Avg distance: 4.3154
Top 10 most common words: -> Avg distance: 3.9778
Length 7:
Overall average Levenshtein distance: 5.4544
Least common words: (freq = 2) -> Avg distance: 5.6087
Top 40 most common words: -> Avg distance: 4.7000
Top 10 most common words: -> Avg distance: 4.5333
Length 8:
Overall average Levenshtein distance: 5.9280
Least common words: (freq = 2) -> Avg distance: 6.1097
Top 40 most common words: -> Avg distance: 5.3397
Top 10 most common words: -> Avg distance: 5.1333
N.B: Ignored hapax legomena
Note this is consistent with natural languages but I believe it is more common in the Voynich