The Voynich Ninja

Full Version: Binomial distribution in VMS
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Pages: 1 2 3 4 5 6 7
(25-08-2024, 06:11 PM)RobGea Wrote: You are not allowed to view links. Register or Login to view.Original chars -> replaced with
ch -> C
sh -> S
cth -> T
ckh -> K
cph -> P
cfh -> F

This has a considerable influence on the graph ( here herbal section as example). I would not have thought so.

Here is the result without substitutions:
[attachment=9124]

The graph with substitutions:
[attachment=9125]
Using Cryptool ngrams on an only-ascii ZL3a-n.txt file shows CH is the Rank No.1 most common bigram in both all_words and distinct words.
That would help towards explaining the graph differences.

Digram Analysis of <ZL2023_Clean.txt>. File size 230559 bytes.
Descending sorted on frequency.

  No.   Substring Frequency (in %) Frequency
    1         CH         7.2458       10865
    2         HE         5.4565       8182
    3         DY         4.5149       6770
=====================================================================
Digram Analysis of <VMS_distinct_words.txt>. File size 66310 bytes.
Descending sorted on frequency.

  No.   Substring Frequency (in %) Frequency
    1         CH         7.4569       3119
    2         HE         4.9298       2062
    3         EE         3.6508       1527
I have different percentages but the order is the same.

Bigrams
 
  No.  Digram Frequency (in %) Frequency
  1      CH       5.7271      11123 *
  2      HE       4.2679      8289
  3      DY       3.6150      7021
  4      AI       3.4976      6793
  5      OK       3.2531      6318
  6      IN       3.1120      6044
  7      OL       3.0121      5850
  8      EE       2.7361      5314
  9      QO       2.7310      5304
  10    ED       2.6398      5127
  11    II       2.4534      4765
  12    SH       2.3340      4533 *
 
 
  Trigrams
 
  No.  Trigram Frequency (in %) Frequency
  1      CHE        2.6285      5105
  2      IIN        2.2223      4316
  3      AII        2.2150      4302
  4      EDY        2.1842      4242
  5      YQO        1.8814      3654
  6      QOK        1.6116      3130
  7      CHO        1.3732      2667
  8      OKE        1.3475      2617
  9      SHE        1.3469      2616
  10    HED        1.2749      2476
  48    CTH        0.4747        922 *
  49    CKH        0.4655        904 *
196    CPH        0.1086        211 *

The fact is that minor changes to the corpus can have a significant effect on the distribution.
Are they minor changes, though ?
If i've done this right then:

preparation
You are not allowed to view links. Register or Login to view.
     .
substrings =['ch', 'sh', 'cth', 'ckh', 'cph', 'cfh']
Total nbr. of VMS words :  39020
VMS words containing at least 1of substrings  : 16183
(16183 / 39020)*100 = 41.47 %

Then you have changed the lengths of 41% of the words in the corpus.
You should be able to get identical results. That is not an unachievable ideal. It should be possible.

These are some obvious chances to create differences:
- different input files (there are even different sources for the IT or TT files)
- interpretation of uncertain spaces (ivtt has -x7 and -x8 for the two cases)
- handling of the ? character: as just another character, word break, or ignore word altogether
- the general way in which substrings are substituted
- w.r.t. these substrings, in particular what is done with cases like ckhh or cthh
(27-08-2024, 10:57 PM)ReneZ Wrote: You are not allowed to view links. Register or Login to view.You should be able to get identical results. That is not an unachievable ideal. It should be possible.

I have checked my procedure again and corrected it. Here are all the steps again (now with the complete VMS file):

ivtt -x7 ZL3a-n.txt ZLall.txt

./prep_file2.sh ZLall.txt ZLall2.txt

Code:
#!/bin/bash

# Check if the correct number of arguments has been passed
if [ "$#" -ne 2 ]; then
    echo "Usage: $0 <inputfile> <outputfile>"
    exit 1
fi

# Input file and output file from the arguments
inputfile="$1"
outputfile="$2"

# Perform character replacements and write to the output file, remove empty lines
sed -e 's/@[^;]*;/w/g' \
    -e '/^$/d' "$inputfile" > "$outputfile"

# Print success message
echo "The file has been successfully written to $outputfile."


python3 freq_chars_bigramme.py ZLall2.txt freqs.txt

Code:
import sys
from collections import Counter
import re

# Function to calculate digram (bigram) frequency within words
def calculate_digram_frequency(filepath):
    # Read file content
    with open(filepath, 'r', encoding='utf-8') as file:
        text = file.read().lower()

    # Replace non-letter characters with spaces, except within words
    text = re.sub(r'[^a-z\s]', ' ', text)

    # Split text into words based on spaces
    words = text.split()

    # List to store bigrams
    bigrams = []

    # Generate bigrams within each word
    for word in words:
        if len(word) >= 2:
            bigrams.extend(word[i:i+2] for i in range(len(word) - 1))

    # Count the occurrences of each bigram
    counter = Counter(bigrams)

    return counter, len(bigrams)

# Main program
if __name__ == "__main__":
    if len(sys.argv) != 3:
        print("Usage: python program.py <input_filepath> <output_filepath>")
        sys.exit(1)

    input_filepath = sys.argv[1]  # Input file as an argument
    output_filepath = sys.argv[2]  # Output file as an argument

    # Calculate frequencies
    counter, total_bigram_count = calculate_digram_frequency(input_filepath)

    # Sort bigrams by frequency and select the top 12
    top_bigrams = counter.most_common(12)

    # Write the top 12 bigrams to the output file
    with open(output_filepath, 'w', encoding='utf-8') as output_file:
        output_file.write("  No.  Bigram Frequency (in %) Frequency\n")
        for i, (bigram, frequency) in enumerate(top_bigrams, 1):
            bigram_upper = bigram.upper()  # Convert bigram to uppercase
            frequency_percentage = (frequency / total_bigram_count) * 100 if total_bigram_count > 0 else 0.0
            output_file.write(f"{i:>4} {bigram_upper:>8} {frequency_percentage:>13.4f} {frequency:>10}\n")

No.  Bigram Frequency (in %) Frequency
  1      CH        7.1043      11081
  2      HE        5.3008      8268
  3      DY        4.4623      6960
  4      AI        4.3533      6790
  5      OK        3.9615      6179
  6      IN        3.8750      6044
  7      OL        3.6121      5634
  8      EE        3.4038      5309
  9      QO        3.3961      5297
  10      ED        3.2800      5116
  11      II        3.0550      4765
  12      SH        2.9062      4533


If anyone finds any errors I would be grateful for information.
(28-08-2024, 01:28 PM)bi3mw Wrote: You are not allowed to view links. Register or Login to view.    # Remove all characters that are not letters

Maybe you should keep spaces and question marks to match Cryptool results (I have not checked).
Hmm, I used the text editor to search for the number of digrams in each case and get the correct numbers. The code to create "ZLall2.txt" is actually too simple to be wrong.

edit: I also checked with this tool. The results are identical.

You are not allowed to view links. Register or Login to view.
Wrong Cryptool counts are totally my bad , i just grabbed a file that had @nnn as high ascii bytes and cryptool has  a default charset of a-z, which i forgot to account for. Totally my fault , sorry for the confusion Blush  Cry
(28-08-2024, 06:04 PM)RobGea Wrote: You are not allowed to view links. Register or Login to view.Wrong Cryptool counts are totally my bad , i just grabbed a file that had @nnn as high ascii bytes and cryptool has  a default charset of a-z, which i forgot to account for. Totally my fault , sorry for the confusion Blush  Cry

Can you run Cryptool again with a clean file ( just to double check ) ?
Pages: 1 2 3 4 5 6 7