The Voynich Ninja
Measuring Long-Range Structure in the Voynich Manuscript - Printable Version

+- The Voynich Ninja (https://www.voynich.ninja)
+-- Forum: Voynich Research (https://www.voynich.ninja/forum-27.html)
+--- Forum: Analysis of the text (https://www.voynich.ninja/forum-41.html)
+--- Thread: Measuring Long-Range Structure in the Voynich Manuscript (/thread-5380.html)

Pages: 1 2 3 4 5


RE: Measuring Long-Range Structure in the Voynich Manuscript - quimqu - 19-02-2026

(19-02-2026, 12:28 PM)nablator Wrote: You are not allowed to view links. Register or Login to view.
(19-02-2026, 12:17 PM)quimqu Wrote: You are not allowed to view links. Register or Login to view.It is this formula. It is essentially the mutual information between characters separated by exactly d positions, averaged over all such pairs in the corpus.

That's what worries me (potentially, I don't know, but I suspect a possible issue there): how do you average the MI of all the pairs at distance d? A simple mean?

For each distance d, I do not average individual MI values. I treat all character pairs at that distance as samples of two random variables: the character at position t and the character at position t+d. I then compute the mutual information between those two variables in the standard way, using their empirical joint and marginal distributions. So there is only one MI(d) per distance, not an average of many MIs. This is the python code for the calculation:

Code:
# =========================
# MI(d) estimator (sparse joint, smoothed marginals)
# =========================
def mutual_information_lag_sparse(ids, lag, alpha=0.5):
    x = ids[:-lag]
    y = ids[lag:]
    n = len(x)
    if n <= 20:
        return np.nan

    V = int(max(ids)) + 1

    cx = np.bincount(x, minlength=V).astype(np.float64)
    cy = np.bincount(y, minlength=V).astype(np.float64)

    px = (cx + alpha) / (n + alpha * V)
    py = (cy + alpha) / (n + alpha * V)

    joint = defaultdict(int)
    for a, b in zip(x, y):
        joint[(int(a), int(b))] += 1

    mi = 0.0
    inv_n = 1.0 / n
    for (a, b), cnt in joint.items():
        pxy = cnt * inv_n
        mi += pxy * (math.log(pxy) - math.log(px[a]) - math.log(py[b]))
    return float(mi)



RE: Measuring Long-Range Structure in the Voynich Manuscript - nablator - 19-02-2026

Thank you for the answer and code. Again, I'm sorry, I have near-zero knowledge of Python and its libraries so someone better qualified should comment. Smile

Quote:Count number of occurrences of each value in array of non-negative ints.
You are not allowed to view links. Register or Login to view.

Maybe I understand what it means. Progress! Smile

Quote:The zip() function returns a zip object, which is an iterator of tuples where the first item in each passed iterator is paired together, and then the second item in each passed iterator are paired together etc.
You are not allowed to view links. Register or Login to view.

This is so confusing to me... It makes Python very powerful but every time I see something that has no equivalent in the programming languages I know, I struggle to understand it. OK, I get it. You iterate on x and y at the same time, keeping the distance constant.

What is alpha? Always 0.5?


RE: Measuring Long-Range Structure in the Voynich Manuscript - quimqu - 19-02-2026

(19-02-2026, 01:29 PM)nablator Wrote: You are not allowed to view links. Register or Login to view.Thank you for the answer and code. Again, I'm sorry, I have near-zero knowledge of Python and its libraries so someone better qualified should comment. Smile

Let me explain the code:

First, it creates two aligned sequences:
x = all characters except the last d
y = all characters except the first d

zip(x, y) simply pairs elements from the two lists:

If:
x = [A, B, C, D]
y = [D, E, F, G]
then:
zip(x, y) gives:
(A, D)
(B, E)
(C, F)
(D, G)

np.bincount counts how many times each character appears (so we have how many times a character appears in x and in y)

joint[(a,b)] += 1 counts how often each pair appears. That gives us the joint frequency distribution.

Alpha is just a small number added to the character counts. If a character never appears in one of the positions, its count would be zero. But as I later take logarithms and logarithm of zero is undefined, I add a small constant (like 0.5) to every count before turning counts into probabilities. This is called smoothing. It does not significantly change results for large texts.

Then I just calculate mi with mi += pxy * (log(pxy) - log(px[a]) - log(py[b]))

It compares: “How often does this pair actually occur?” with “How often would it occur if the two positions were independent?” If the actual joint frequency differs from what independence predicts, MI becomes positive.

The function calculates one single mutual information value for a given distance d, based on all pairs at that distance. So, in the code I iterate for each d in a loop out of the function to get the plot data.


RE: Measuring Long-Range Structure in the Voynich Manuscript - Fontanellean - 19-02-2026

Do we have a graph of the predictability of a character if you know the n characters immediately preceding it?


RE: Measuring Long-Range Structure in the Voynich Manuscript - Mauro - 19-02-2026

I cannot comment on the Python code (same problem as Nablator), but from your description of the algorithm (provided np.bincount and joint[] do what you say they do, which I don't doubt), is correct and implements Nablator's formula. I'm (or better I was) used to call it self-correlation instead of mutual information but it looks the same.

Interesting, and a lot of data to digest before an answer. Welcome back, @quimqu  Smile


RE: Measuring Long-Range Structure in the Voynich Manuscript - quimqu - 19-02-2026

(19-02-2026, 02:07 PM)Mauro Wrote: You are not allowed to view links. Register or Login to view.Interesting, and a lot of data to digest before an answer. Welcome back, @quimqu  Smile

Thank you Mauro. Yes, back into the puzzle game.  Huh

I really don't know if we can extract any information from the data I posted. As for many other things, the MS is in the middle of nowhere. I tried to find a signal for long term relation in the Voynich, but... is there really any at all?


RE: Measuring Long-Range Structure in the Voynich Manuscript - quimqu - 19-02-2026

(19-02-2026, 02:01 PM)Fontanellean Wrote: You are not allowed to view links. Register or Login to view.Do we have a graph of the predictability of a character if you know the n characters immediately preceding it?

Maybe you can take a look at the Entropy analysis of Rene's website You are not allowed to view links. Register or Login to view.

There are plenty of links to studies. Even if entropy is not predictability, they are very linked.


RE: Measuring Long-Range Structure in the Voynich Manuscript - nablator - 19-02-2026

(19-02-2026, 03:31 PM)quimqu Wrote: You are not allowed to view links. Register or Login to view.Even if entropy is not predictability, they are very linked.

For a well-known measure of predictability of characters separated by distance d, I could modify slightly my very simple code (Java or JavaScript) that calculates the conditional character entropy (H2), to calculate H2(d), using the conditional probability of the character at index + d (modulo the size of the text to avoid border effects), knowing the character at a given index, instead of d = 1 for H2.

Isn't MI(d) = H1 - H2(d) ? Not sure...

H1 = First order character entropy (Shannon's entropy)
H2 = Second order character entropy (conditional entropy)

I would be surprised if H1 - H2(d) doesn't drop to near zero (asymptote at zero) very quickly (approximately when d > 25) and even more quickly (approximately when d > 10) when the text is word-shuffled.


RE: Measuring Long-Range Structure in the Voynich Manuscript - quimqu - 19-02-2026

(19-02-2026, 05:01 PM)nablator Wrote: You are not allowed to view links. Register or Login to view.
(19-02-2026, 03:31 PM)quimqu Wrote: You are not allowed to view links. Register or Login to view.Even if entropy is not predictability, they are very linked.

For a well-known measure of predictability of characters separated by distance d, I could modify slightly my very simple code (Java or JavaScript) that calculates the conditional character entropy (h2), to calculate h2(d), using the conditional probability of the character at index + d (modulo the size of the text to avoid border effects), knowing the character at a given index, instead of d = 1 for h2.

Isn't MI(d) = h1 - h2(d) ? Not sure...

h1 = First order character entropy = Shannon's entropy
h2 = Second order character entropy = Conditional entropy

I would be surprised if h1 - h2(d) doesn't drop to near zero (asymptote at zero) very quickly (approximately when d > 25) and even more quickly (approximately when d > 10) when the text is word-shuffled.

Yes, formally it is, but in real texts H2(d) does not approach H1 as quickly as intuition suggests because long-range statistical correlations keep a small but measurable dependence even at larger distances.


RE: Measuring Long-Range Structure in the Voynich Manuscript - nablator - 19-02-2026

I changed h to H because it is actually a Greek eta and the purists (like ReneZ) are shocked by the improper romanization. Smile

I'll try it and see what happens. Sometimes intuition is totally wrong...