The Voynich Ninja - Binomial distribution in VMS

(25-08-2024, 06:11 PM)RobGea Wrote: You are not allowed to view links. Register or Login to view.Original chars -> replaced with
ch -> C
sh -> S
cth -> T
ckh -> K
cph -> P
cfh -> F

This has a considerable influence on the graph ( here herbal section as example). I would not have thought so.

Here is the result without substitutions:
[attachment=9124]

The graph with substitutions:
[attachment=9125]

Using Cryptool ngrams on an only-ascii ZL3a-n.txt file shows CH is the Rank No.1 most common bigram in both all_words and distinct words.
That would help towards explaining the graph differences.

Digram Analysis of <ZL2023_Clean.txt>. File size 230559 bytes.
Descending sorted on frequency.

No. Substring Frequency (in %) Frequency
1 CH 7.2458 10865
2 HE 5.4565 8182
3 DY 4.5149 6770
=====================================================================
Digram Analysis of <VMS_distinct_words.txt>. File size 66310 bytes.
Descending sorted on frequency.

No. Substring Frequency (in %) Frequency
1 CH 7.4569 3119
2 HE 4.9298 2062
3 EE 3.6508 1527

I have different percentages but the order is the same.

Bigrams

No. Digram Frequency (in %) Frequency
1   CH 5.7271 11123 *
2 HE 4.2679 8289
3 DY 3.6150 7021
4 AI   3.4976 6793
5 OK 3.2531 6318
6 IN 3.1120 6044
7 OL 3.0121 5850
8 EE 2.7361 5314
9 QO 2.7310 5304
10    ED 2.6398 5127
11 II 2.4534 4765
12   SH 2.3340 4533 *

Trigrams

No. Trigram Frequency (in %) Frequency
1 CHE 2.6285 5105
2 IIN 2.2223 4316
3 AII 2.2150 4302
4 EDY 2.1842 4242
5 YQO 1.8814 3654
6 QOK 1.6116 3130
7 CHO 1.3732 2667
8 OKE 1.3475 2617
9 SHE 1.3469 2616
10 HED 1.2749 2476
48   CTH 0.4747 922 *
49   CKH 0.4655 904 *
196 CPH 0.1086 211 *

The fact is that minor changes to the corpus can have a significant effect on the distribution.

Are they minor changes, though ?
If i've done this right then:

preparation

You are not allowed to view links. Register or Login to view.

.
substrings =['ch', 'sh', 'cth', 'ckh', 'cph', 'cfh']
Total nbr. of VMS words : 39020
VMS words containing at least 1of substrings : 16183
(16183 / 39020)*100 = 41.47 %

Then you have changed the lengths of 41% of the words in the corpus.

You should be able to get identical results. That is not an unachievable ideal. It should be possible.

These are some obvious chances to create differences:
- different input files (there are even different sources for the IT or TT files)
- interpretation of uncertain spaces (ivtt has -x7 and -x8 for the two cases)
- handling of the ? character: as just another character, word break, or ignore word altogether
- the general way in which substrings are substituted
- w.r.t. these substrings, in particular what is done with cases like ckhh or cthh

(27-08-2024, 10:57 PM)ReneZ Wrote: You are not allowed to view links. Register or Login to view.You should be able to get identical results. That is not an unachievable ideal. It should be possible.

I have checked my procedure again and corrected it. Here are all the steps again (now with the complete VMS file):

ivtt -x7 ZL3a-n.txt ZLall.txt

./prep_file2.sh ZLall.txt ZLall2.txt

Code:
#!/bin/bash

# Check if the correct number of arguments has been passed

if [ "$#" -ne 2 ]; then

    echo "Usage: $0 <inputfile> <outputfile>"

    exit 1

fi

# Input file and output file from the arguments

inputfile="$1"

outputfile="$2"

# Perform character replacements and write to the output file, remove empty lines

sed -e 's/@[^;]*;/w/g' \

    -e '/^$/d' "$inputfile" > "$outputfile"

# Print success message

echo "The file has been successfully written to $outputfile."

python3 freq_chars_bigramme.py ZLall2.txt freqs.txt

Code:
import sys

from collections import Counter

import re

# Function to calculate digram (bigram) frequency within words

def calculate_digram_frequency(filepath):

    # Read file content

    with open(filepath, 'r', encoding='utf-8') as file:

        text = file.read().lower()

    # Replace non-letter characters with spaces, except within words

    text = re.sub(r'[^a-z\s]', ' ', text)

    # Split text into words based on spaces

    words = text.split()

    # List to store bigrams

    bigrams = []

    # Generate bigrams within each word

    for word in words:

        if len(word) >= 2:

            bigrams.extend(word[i:i+2] for i in range(len(word) - 1))

    # Count the occurrences of each bigram

    counter = Counter(bigrams)

    return counter, len(bigrams)

# Main program

if __name__ == "__main__":

    if len(sys.argv) != 3:

        print("Usage: python program.py <input_filepath> <output_filepath>")

        sys.exit(1)

    input_filepath = sys.argv[1]  # Input file as an argument

    output_filepath = sys.argv[2]  # Output file as an argument

    # Calculate frequencies

    counter, total_bigram_count = calculate_digram_frequency(input_filepath)

    # Sort bigrams by frequency and select the top 12

    top_bigrams = counter.most_common(12)

    # Write the top 12 bigrams to the output file

    with open(output_filepath, 'w', encoding='utf-8') as output_file:

        output_file.write("  No.  Bigram Frequency (in %) Frequency\n")

        for i, (bigram, frequency) in enumerate(top_bigrams, 1):

            bigram_upper = bigram.upper()  # Convert bigram to uppercase

            frequency_percentage = (frequency / total_bigram_count) * 100 if total_bigram_count > 0 else 0.0

            output_file.write(f"{i:>4} {bigram_upper:>8} {frequency_percentage:>13.4f} {frequency:>10}\n")

No. Bigram Frequency (in %) Frequency
1 CH 7.1043 11081
2 HE 5.3008 8268
3 DY 4.4623 6960
4 AI 4.3533 6790
5 OK 3.9615 6179
6 IN 3.8750 6044
7 OL 3.6121 5634
8 EE 3.4038 5309
9 QO 3.3961 5297
10 ED 3.2800 5116
11 II 3.0550 4765
12 SH 2.9062 4533

If anyone finds any errors I would be grateful for information.

(28-08-2024, 01:28 PM)bi3mw Wrote: You are not allowed to view links. Register or Login to view. # Remove all characters that are not letters

Maybe you should keep spaces and question marks to match Cryptool results (I have not checked).

Hmm, I used the text editor to search for the number of digrams in each case and get the correct numbers. The code to create "ZLall2.txt" is actually too simple to be wrong.

edit: I also checked with this tool. The results are identical.

You are not allowed to view links. Register or Login to view.

Wrong Cryptool counts are totally my bad , i just grabbed a file that had @nnn as high ascii bytes and cryptool has a default charset of a-z, which i forgot to account for. Totally my fault , sorry for the confusion Blush

(28-08-2024, 06:04 PM)RobGea Wrote: You are not allowed to view links. Register or Login to view.Wrong Cryptool counts are totally my bad , i just grabbed a file that had @nnn as high ascii bytes and cryptool has a default charset of a-z, which i forgot to account for. Totally my fault , sorry for the confusion

Can you run Cryptool again with a clean file ( just to double check ) ?