The Voynich Ninja

Jacquard Similarity
===For 1-grams
Average 1-gram similarity among least common words: 0.25570049296453057
Average 1-gram similarity among most common words: 0.21682838534999466
===For 2-grams
Average 2-gram similarity among least common words: 0.06423834336923202
Average 2-gram similarity among most common words: 0.06670467758507431
===For 3-grams
Average 3-gram similarity among least common words: 0.016802481467017006
Average 3-gram similarity among most common words: 0.02217942907311669
===For 4-grams
Average 4-gram similarity among least common words: 0.004504037054205818
Average 4-gram similarity among most common words: 0.007278612262000938
===For 5-grams
Average 5-gram similarity among least common words: 0.000987486273694194
Average 5-gram similarity among most common words: 0.0019374143958861578
===For 6-grams
Average 6-gram similarity among least common words: 0.0002422555676235787
Average 6-gram similarity among most common words: 0.0003965759779713268

Levenshtein Similarity
Length 2:
Overall average Levenshtein distance: 1.8191
Least common words: (freq = 2) -> Avg distance: 1.8333
Top 40 most common words: -> Avg distance: 1.8115
Top 10 most common words: -> Avg distance: 1.7778

Length 3:
Overall average Levenshtein distance: 2.6262
Least common words: (freq = 2) -> Avg distance: 2.6473
Top 40 most common words: -> Avg distance: 2.4744
Top 10 most common words: -> Avg distance: 2.4667

Length 4:
Overall average Levenshtein distance: 3.4458
Least common words: (freq = 2) -> Avg distance: 3.5212
Top 40 most common words: -> Avg distance: 3.2308
Top 10 most common words: -> Avg distance: 3.1556

Length 5:
Overall average Levenshtein distance: 4.1917
Least common words: (freq = 2) -> Avg distance: 4.2958
Top 40 most common words: -> Avg distance: 3.8141
Top 10 most common words: -> Avg distance: 3.8667

Length 6:
Overall average Levenshtein distance: 4.8464
Least common words: (freq = 2) -> Avg distance: 4.9734
Top 40 most common words: -> Avg distance: 4.3154
Top 10 most common words: -> Avg distance: 3.9778

Length 7:
Overall average Levenshtein distance: 5.4544
Least common words: (freq = 2) -> Avg distance: 5.6087
Top 40 most common words: -> Avg distance: 4.7000
Top 10 most common words: -> Avg distance: 4.5333

Length 8:
Overall average Levenshtein distance: 5.9280
Least common words: (freq = 2) -> Avg distance: 6.1097
Top 40 most common words: -> Avg distance: 5.3397
Top 10 most common words: -> Avg distance: 5.1333

N.B: Ignored hapax legomena

Note this is consistent with natural languages but I believe it is more common in the Voynich

I think this should be analyzed using histograms instead of average values? I'd expect Voynichese to have a bimodal distribution: some vords are more similar than in natural languages (think about sequences like qokeedy.qotedy.qokeedy.qokeedy.qokeey) while some are less similar because vords do not seem to have identifiable morphology (roots, prefixes, etc).

The main "more unusual" trend is indeed found in all texts, including random ones.

For "overall average Levenshtein distance" I suppose that you calculated the average edit distance of all pairwise permutations of word types within each EVA length class. I get values similar to You are not allowed to view links. Register or Login to view. using paragraph text in IT2a-n.txt (chosen for its alphabet-sized character set). Each increment of word length adds approximately 0.8 edits to the "overall average":
[attachment=10080]
We can convince ourselves (by thought experiment) that as the number of available characters increases, the plotted points will converge on the grey line of 1:1 slope.

Dividing the average edit distance by word length yields average edit distance on a per-character basis (the local slope):
[attachment=10078]
Here the VMS transcription in gray is compared with "natural" texts of similar length: early modern English (red, 1611 King James Gospels), and medieval Latin (blue, Speculum Humanae Salvationis). Open symbols represent a version of each text in which the character order has been pseudorandomly scrambled, preserving the word length and character frequency distributions. The three scrambled variants are barely distinguishable in the plot. Thus, the dominant contribution to edit-distance-unusualness is a combinatorial property unrelated to any fine structure of the vocabularies.

On closer inspection:
[attachment=10079]
The "natural" text samples KJB and SHS show a small offset toward lower average edit distance from their scrambled counterparts, representing some kind of letter-order and -frequency correlations, not strongly dependent on word length. Surprise! — Voynichese is the outlier. Its longer "words" are progressively less mutually unusual (on a per character basis) than the shorter or scrambled ones. Does it mean that longer words contain more longer-range correlations? This topic may have been addressed on the forum previously, from a different perspective.

It has been a slow week in Analysisland.

(27-02-2025, 10:43 PM)obelus Wrote: You are not allowed to view links. Register or Login to view.The main "more unusual" trend is indeed found in all texts, including random ones.

For "overall average Levenshtein distance" I suppose that you calculated the average edit distance of all pairwise permutations of word types within each EVA length class. I get values similar to You are not allowed to view links. Register or Login to view. using paragraph text in IT2a-n.txt (chosen for its alphabet-sized character set). Each increment of word length adds approximately 0.8 edits to the "overall average":

We can convince ourselves (by thought experiment) that as the number of available characters increases, the slope of the plotted points will converge on the 1:1 grey line.

Dividing the average edit distance by word length yields average edit distance on a per-character basis (the local slope):

Here the VMS transcription in gray is compared with "natural" texts of similar length: early modern English (red, 1611 King James Gospels), and medieval Latin (blue, Speculum Humanae Salvationis). Open symbols represent a version of each text in which the character order has been pseudorandomly scrambled, preserving the word length and character frequency distributions. The three scrambled variants are barely distinguishable in the plot. Thus, the dominant contribution to edit-distance-unusualness is a combinatorial property unrelated to any fine structure of the vocabularies.

On closer inspection:

The "natural" text samples KJB and SHS show a small offset toward lower average edit distance from their scrambled counterparts, representing some kind of letter-order and -frequency correlations, not strongly dependent on word length. Surprise! — Voynichese is the outlier. Its longer "words" are progressively less mutually unusual (on a per character basis) than the shorter or scrambled ones. Does this mean that longer words contain more longer-range correlations? This topic may have been addressed on the forum previously, from a different perspective.

It has been a slow week in Analysisland.

Actually my title of the thread was terrible. It should have been rarer (less common) words are more unusual (according to lev and Jac similarity) (from 2 characters). Please review the data, in this light. Also I dwell insularly

It should follow nicely from the data that indeed rarer words are more unusual (defined by similarity or Levenshtein distance) for length (2,8). This also can be viewed simply looking at one of my other posts about the most common words in the manuscript. There (even with ignoring hapax), it can be seen to be true. This follows a theme that more the rarer words are derivational upon (shorter) common words in which further empirical proofs should illuminate, when satisfactory thought and formatting should provide for more befitting posts.

N.B: See You are not allowed to view links. Register or Login to view. for some ideas.

Addsamuels

bt2901

obelus

Addsamuels

Addsamuels

Addsamuels