Addsamuels > 22-02-2025, 06:33 PM
bt2901 > 24-02-2025, 11:57 PM
obelus > 27-02-2025, 10:43 PM
Addsamuels > 28-02-2025, 12:56 AM
(27-02-2025, 10:43 PM)obelus Wrote: You are not allowed to view links. Register or Login to view.The main "more unusual" trend is indeed found in all texts, including random ones.Actually my title of the thread was terrible. It should have been rarer (less common) words are more unusual (according to lev and Jac similarity) (from 2 characters). Please review the data, in this light. Also I dwell insularly
For "overall average Levenshtein distance" I suppose that you calculated the average edit distance of all pairwise permutations of word types within each EVA length class. I get values similar to You are not allowed to view links. Register or Login to view. using paragraph text in IT2a-n.txt (chosen for its alphabet-sized character set). Each increment of word length adds approximately 0.8 edits to the "overall average":
We can convince ourselves (by thought experiment) that as the number of available characters increases, the slope of the plotted points will converge on the 1:1 grey line.
Dividing the average edit distance by word length yields average edit distance on a per-character basis (the local slope):
Here the VMS transcription in gray is compared with "natural" texts of similar length: early modern English (red, 1611 King James Gospels), and medieval Latin (blue, Speculum Humanae Salvationis). Open symbols represent a version of each text in which the character order has been pseudorandomly scrambled, preserving the word length and character frequency distributions. The three scrambled variants are barely distinguishable in the plot. Thus, the dominant contribution to edit-distance-unusualness is a combinatorial property unrelated to any fine structure of the vocabularies.
On closer inspection:
The "natural" text samples KJB and SHS show a small offset toward lower average edit distance from their scrambled counterparts, representing some kind of letter-order and -frequency correlations, not strongly dependent on word length. Surprise! — Voynichese is the outlier. Its longer "words" are progressively less mutually unusual (on a per character basis) than the shorter or scrambled ones. Does this mean that longer words contain more longer-range correlations? This topic may have been addressed on the forum previously, from a different perspective.
It has been a slow week in Analysisland.
Addsamuels > 28-02-2025, 01:37 PM