The Voynich Ninja

Full Version: Binomial distribution in VMS
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Pages: 1 2 3 4 5 6 7
(18-08-2024, 04:14 PM)RobGea Wrote: You are not allowed to view links. Register or Login to view.Thanks for the graphs bi3mw, interesting that a section shows a binomial distribution and a folio subset does not.

Honestly, i dont know what to make of this binomial distribution business.

1. Statistical corollary of the method used to generate voynichese.
2. As postulated in post#1, it was done deliberately.
3. Coincidence.
4. Other ?
    .

I think it is quite possible that the shortening of longer words in VMS plaintext and the lengthening of short words with "filler syllables" represent a form of ciphering. If it were possible to identify these "filler syllables", then the actual words could be reconstructed in the first step. However, it would not be possible to restore the shortened words. The content would be irretrievably lost. This only makes sense if the creator of the VMS considered the appearance (with glyphs) to be more important than the possibility of ever being able to read the content again in full.

However, I do not believe that the distribution is a coincidence, it is simply too conspicuous for that.
(18-08-2024, 01:44 PM)bi3mw Wrote: You are not allowed to view links. Register or Login to view.I have picked out the balneological section ( Quire 13 , from You are not allowed to view links. Register or Login to view. to You are not allowed to view links. Register or Login to view. ) as a test. The binomial distribution is clearly recognizable here.

Really? A comparison to a binomial distribution would help recognize something, maybe... Smile

Quote:And here the complete VMS:

The word token length distribution that you plotted is the one that is not binomial according to Stolfi.
(18-08-2024, 05:06 PM)nablator Wrote: You are not allowed to view links. Register or Login to view.The word token length distribution that you plotted is the one that is not binomial according to Stolfi.

Where did Stolfi make such a statement ? To me the result looks pretty ok but I was just comparing with my generated text.

Here again the complete VMS with overlay:
[attachment=9044]

And here is the original regimen sanitatis for comparison:
[attachment=9045]

Am I missing something ?
(18-08-2024, 05:36 PM)bi3mw Wrote: You are not allowed to view links. Register or Login to view.Where did Stolfi make such a statement ?

Quote:Figure 1 - Token length distributions.
Figure 2 - Word length distributions.
Figure 3 - Voynichese word length distribution, compared to the binomial one.
You are not allowed to view links. Register or Login to view.

It is clearly the word [types] length distribution that is matched to the binomial one, i.e. the distribution of "distinct VMS words of each length, ignoring their frequencies in the text".

You used the length of words in EVA (by Takeshi Takahashi) and Stolfi did not so it's not surprising that you got different results.

Congratulations. Smile
(17-08-2024, 06:38 PM)bi3mw Wrote: You are not allowed to view links. Register or Login to view.regimen_sanitatis_original.txt
About the transcription of the Regimen sanitatis Magnini Mediolanensi that you uploaded:

- You should remove the many supra (numbered references).

- It starts at the third part (tertia pars) of the book, L600009B here:
You are not allowed to view links. Register or Login to view.
You are not allowed to view links. Register or Login to view.
You are not allowed to view links. Register or Login to view.
Source: CELT: Corpus of Electronic Texts: a project of University College, Cork.
Credit where credit is due. Wink
I have extended the code in the opening post #1 so that it clearly shows the binomial distribution in the generated text (plot). The distribution seems to be correct.
[attachment=9053]
[attachment=9054]

Code:
import sys
import numpy as np
import matplotlib.pyplot as plt
from scipy.special import comb

def calculate_binomial_distribution(n, max_length):
    """Calculates a binomial distribution for word lengths."""
    k_values = np.arange(1, max_length + 1)
    # Calculate the binomial distribution for the formula comb(n, k) * (0.5 ** n)
    probabilities = [comb(n, k) * (0.5 ** n) for k in k_values]

    # Normalize the distribution
    probabilities /= np.sum(probabilities)

    return k_values, probabilities

def plot_binomial_distribution(n, max_length):
    """Visualizes the binomial distribution and overlays the theoretical curve."""
    k_values, probabilities = calculate_binomial_distribution(n, max_length)
   
    # Calculate the theoretical binomial curve
    theoretical_probabilities = [comb(n, k) * (0.5 ** n) for k in k_values]
    theoretical_probabilities /= np.sum(theoretical_probabilities)
   
    plt.figure(figsize=(10, 6))
   
    # Plot the calculated probabilities
    plt.bar(k_values, probabilities, width=0.6, edgecolor='black', alpha=0.6, label='Calculated Distribution')
   
    # Plot the theoretical binomial curve
    plt.plot(k_values, theoretical_probabilities, 'r--', marker='o', label='Theoretical Distribution')
   
    plt.xlabel('Word Length')
    plt.ylabel('Probability')
    plt.title(f'Binomial Distribution for n={n}')
    plt.legend()
    plt.grid(True)
    plt.show()

def adjust_word_lengths(words, target_distribution):
    """Adjusts word lengths to fit the target distribution by truncating or extending words."""
    adjusted_words = []
    max_word_length = len(target_distribution)

    length_bins = np.arange(1, max_word_length + 1)
    length_probs = np.array(target_distribution)

    for word in words:
        current_length = len(word)
        target_length = np.random.choice(length_bins, p=length_probs)
   
        # If the target word length is shorter, truncate the word
        if target_length < current_length:
            adjusted_words.append(word[:target_length])
        # If the target word length is longer, extend the word with 'X'
        elif target_length > current_length:
            adjusted_words.append(word + 'X' * (target_length - current_length))
        else:
            adjusted_words.append(word)  # If the length matches, keep the word as is

    return adjusted_words

def process_text(file_path, output_path):
    """Reads the text from the file, adjusts word lengths, and writes the modified text to an output file."""
    try:
        with open(file_path, 'r', encoding='utf-8') as file:
            lines = file.readlines()
    except FileNotFoundError:
        print(f"Error: The file {file_path} was not found.")
        sys.exit(1)
    except IOError as e:
        print(f"Error: An error occurred while reading the file: {e}")
        sys.exit(1)

    max_word_length = 15  # Set maximum word length
    n = 9  # Number of trials for the binomial distribution

    # Calculate the binomial distribution
    _, target_distribution = calculate_binomial_distribution(n, max_word_length)

    adjusted_lines = []
    for line in lines:
        words = line.split()
        adjusted_words = adjust_word_lengths(words, target_distribution)
        adjusted_lines.append(' '.join(adjusted_words))

    # Write the modified text to the output file
    try:
        with open(output_path, 'w', encoding='utf-8') as file:
            file.write('\n'.join(adjusted_lines))
        print(f"Modified text has been written to {output_path}.")
    except IOError as e:
        print(f"Error: An error occurred while writing the file: {e}")
        sys.exit(1)

def main():
    if len(sys.argv) != 3:
        print("Usage: python adjust_word_length.py <input_filename> <output_filename>")
        sys.exit(1)

    input_file_path = sys.argv[1]
    output_file_path = sys.argv[2]
   
    # Optional visualization of the distribution
    max_word_length = 15
    n = 9
    plot_binomial_distribution(n, max_word_length)
   
    # Process the text and write to the output file
    process_text(input_file_path, output_file_path)

if __name__ == "__main__":
    main()

As @RobGea has already noted, the distribution in the VMS seems to fit roughly section by section, but not folio by folio. I can't understand this, because the result for the entire VMS text must come out somehow. Does anyone have an explanation ?
(19-08-2024, 06:08 PM)bi3mw Wrote: You are not allowed to view links. Register or Login to view.I have extended the code in the opening post #1 so that it clearly shows the binomial distribution in the generated text (plot). The distribution seems to be correct.

Can you overlay the binomial distribution graph (with red dots) with the graph for the entire VMS and for a section in post #9? The fit doesn't look very good. Stolfi wins. Smile

Quote:As @RobGea has already noted, the distribution in the VMS seems to fit roughly section by section, but not folio by folio. I can't understand this, because the result for the entire VMS text must come out somehow. Does anyone have an explanation ?

Small sample.
(19-08-2024, 06:43 PM)nablator Wrote: You are not allowed to view links. Register or Login to view.Can you overlay the binomial distribution graph (with red dots) with the graph for the entire VMS and for a section in post #9? The fit doesn't look very good. Stolfi wins.

Here are the plots:

Quire 13 ( Perhaps not the best choice )
[attachment=9055]

Voynich full (CUVA)
[attachment=9056]

Voynich full (TT)
[attachment=9057]

All right, the results are somewhat .... unsatisfactory Mmmmm Nevertheless, they are far closer to the curve than any original ( Latin ) text.
If you want to test folios or sections in the VMS yourself, here is the code for plotting:

Code:
import numpy as np
import matplotlib.pyplot as plt
from scipy.special import comb
import argparse
import os

def calculate_binomial_distribution(n, max_length):
    """Berechnet eine Binomialverteilung für Wortlängen."""
    k_values = np.arange(1, max_length + 1)
    probabilities = [comb(n, k) * (0.5 ** n) for k in k_values]
    probabilities /= np.sum(probabilities)  # Normiert die Verteilung
    return k_values, probabilities

def plot_binomial_distribution(word_lengths, max_length=15, n=10):
    """Visualisiert die Binomialverteilung basierend auf den Wortlängen in der Textdatei."""
    if len(word_lengths) == 0:
        print("Error: Keine Wörter in der Datei gefunden.")
        return
   
    # Berechnung der Binomialverteilung
    k_values, binomial_probabilities = calculate_binomial_distribution(n, max_length)
   
    # Berechnung der Häufigkeitsverteilung für das Histogramm
    counts, bins = np.histogram(word_lengths, bins=np.arange(0.5, max_length + 1.5))
    probabilities = counts / np.sum(counts)  # Normalisieren
   
    plt.figure(figsize=(10, 6))
   
    # Histogramm der Wortlängen
    bin_centers = (bins[:-1] + bins[1:]) / 2
    plt.bar(bin_centers, probabilities, width=0.6, edgecolor='black', alpha=0.6, label='Calculated Distribution')
   
    # Binomialverteilungskurve anpassen
    plt.plot(k_values, binomial_probabilities, 'r--', marker='o', label='Theoretical Distribution')
   
    # Achsenbegrenzungen setzen, um die Kurve korrekt darzustellen
    plt.xlim(0.5, max_length + 0.5)
    plt.xticks(np.arange(1, max_length + 1))
   
    plt.xlabel('Word Length')
    plt.ylabel('Probability')
    plt.title(f'Binomial Distribution for n={n}')
    plt.legend()
    plt.grid(True)
    plt.show()

def process_text_file(file_path):
    """Liests eine Textdatei ein und berechnet die Wortlängen."""
    if not os.path.isfile(file_path):
        print(f"Error: Die Datei {file_path} wurde nicht gefunden.")
        return None
   
    try:
        with open(file_path, 'r', encoding='utf-8') as file:
            text = file.read()
    except IOError as e:
        print(f"Error: Ein Fehler ist beim Lesen der Datei aufgetreten: {e}")
        return None
   
    words = text.split()
    word_lengths = [len(word) for word in words]
   
    return word_lengths

def main():
    parser = argparse.ArgumentParser(description="Visualisiert die Binomialverteilung basierend auf einer Textdatei.")
    parser.add_argument('file', type=str, help='Pfad zur Textdatei')

    args = parser.parse_args()
   
    # Textdatei verarbeiten
    word_lengths = process_text_file(args.file)
   
    if word_lengths is not None:
        plot_binomial_distribution(word_lengths, max_length=15, n=10)

if __name__ == "__main__":
    main()
Nice program!

Binomial distribution for n=10, p=0.5 is a better fit:

TT:
[attachment=9058]

V101 (mapped to nearest basic EVA):
[attachment=9059]

RF1a-n:
[attachment=9060]
Pages: 1 2 3 4 5 6 7