quimqu > Yesterday, 12:33 PM
(Yesterday, 12:28 PM)nablator Wrote: You are not allowed to view links. Register or Login to view.(Yesterday, 12:17 PM)quimqu Wrote: You are not allowed to view links. Register or Login to view.It is this formula. It is essentially the mutual information between characters separated by exactly d positions, averaged over all such pairs in the corpus.
That's what worries me (potentially, I don't know, but I suspect a possible issue there): how do you average the MI of all the pairs at distance d? A simple mean?
# =========================
# MI(d) estimator (sparse joint, smoothed marginals)
# =========================
def mutual_information_lag_sparse(ids, lag, alpha=0.5):
x = ids[:-lag]
y = ids[lag:]
n = len(x)
if n <= 20:
return np.nan
V = int(max(ids)) + 1
cx = np.bincount(x, minlength=V).astype(np.float64)
cy = np.bincount(y, minlength=V).astype(np.float64)
px = (cx + alpha) / (n + alpha * V)
py = (cy + alpha) / (n + alpha * V)
joint = defaultdict(int)
for a, b in zip(x, y):
joint[(int(a), int(b))] += 1
mi = 0.0
inv_n = 1.0 / n
for (a, b), cnt in joint.items():
pxy = cnt * inv_n
mi += pxy * (math.log(pxy) - math.log(px[a]) - math.log(py[b]))
return float(mi)nablator > Yesterday, 01:29 PM

Quote:Count number of occurrences of each value in array of non-negative ints.You are not allowed to view links. Register or Login to view.

Quote:The zip() function returns a zip object, which is an iterator of tuples where the first item in each passed iterator is paired together, and then the second item in each passed iterator are paired together etc.You are not allowed to view links. Register or Login to view.
quimqu > 11 hours ago
(Yesterday, 01:29 PM)nablator Wrote: You are not allowed to view links. Register or Login to view.Thank you for the answer and code. Again, I'm sorry, I have near-zero knowledge of Python and its libraries so someone better qualified should comment.
Fontanellean > 11 hours ago
Mauro > 11 hours ago
quimqu > 10 hours ago
(11 hours ago)Mauro Wrote: You are not allowed to view links. Register or Login to view.Interesting, and a lot of data to digest before an answer. Welcome back, @quimqu

quimqu > 10 hours ago
(11 hours ago)Fontanellean Wrote: You are not allowed to view links. Register or Login to view.Do we have a graph of the predictability of a character if you know the n characters immediately preceding it?
nablator > 8 hours ago
(10 hours ago)quimqu Wrote: You are not allowed to view links. Register or Login to view.Even if entropy is not predictability, they are very linked.
quimqu > 8 hours ago
(8 hours ago)nablator Wrote: You are not allowed to view links. Register or Login to view.(10 hours ago)quimqu Wrote: You are not allowed to view links. Register or Login to view.Even if entropy is not predictability, they are very linked.
For a well-known measure of predictability of characters separated by distance d, I could modify slightly my very simple code (Java or JavaScript) that calculates the conditional character entropy (h2), to calculate h2(d), using the conditional probability of the character at index + d (modulo the size of the text to avoid border effects), knowing the character at a given index, instead of d = 1 for h2.
Isn't MI(d) = h1 - h2(d) ? Not sure...
h1 = First order character entropy = Shannon's entropy
h2 = Second order character entropy = Conditional entropy
I would be surprised if h1 - h2(d) doesn't drop to near zero (asymptote at zero) very quickly (approximately when d > 25) and even more quickly (approximately when d > 10) when the text is word-shuffled.
nablator > 7 hours ago
