(21-02-2026, 11:43 PM)quimqu Wrote: You are not allowed to view links. Register or Login to view....
I am confused about what what you mean by "word shuffled" and "token shuffled". Is that a random permutation of the tokens on each line, on each parag, or on the whole text?
Either way, if you shuffle the
words and then measure the mutual information between
characters at distance d, you are basically computing the mutual information of characters at distance d
within the word, and then mixing that function with shifted copies of itself.
More generally, again, statistics about characters are more confusing than illuminating. For one thing, they depend primarily on the spelling/encoding system, secondly on the topic, genre and style, and lastly on the language. But also the distance d between characters is not a very meaningful variable, because words and syllables have different lengths.
For instance, suppose that you were trying to see whether the language has gender/number agreement of adjectives and nouns, as in Indo-European languages. To properly test that theory you would need statistics on suffixes of consecutive words. If that theory was true, you
would notice a signal in that fixed-distance correlation plot that would disappear when words were scrambled. But that signal it would be spread out over all distances d between ~3 and ~15, and thus may be hard to see.
A similar comment would apply if you wanted to test for vowel harmony, a feature of Turkish and Hungarian. That feature would manifest itself as a correlation between vowels inside a word. In the fixed-distance correlation plots, it would show up as a fuzzy peak at distance ~2, that does not disappear when words are scrambled, becomes broader when characters are scrambled within each word, and disappears when characters are scrambled across words. A better way to test for such a feature would be to look for correlations between characters inside each word independently of distance.
And another suggestion: use log-log scales in your plots. That would expand both the left side, where all plots now get squeezed into a narrow vertical band, and the longer range, where all plots get squeezed together near the bottom. A justification for using log on the horz axis is that the difference between d = 2 and d = 3 is expected to be much more dramatic than the difference between d=21 and d=22. Indeed, maybe you should somehow combine the correlations for ranges of large distances, say 20-29 and 30-39, into single statistics.
All the best, --stolfi