• Binomial distribution in VMS
  • RE: Binomial distribution in VMS

    bi3mw > 26-08-2024, 08:34 PM

    (25-08-2024, 06:11 PM)RobGea Wrote: You are not allowed to view links. Register or Login to view.Original chars -> replaced with
    ch -> C
    sh -> S
    cth -> T
    ckh -> K
    cph -> P
    cfh -> F

    This has a considerable influence on the graph ( here herbal section as example). I would not have thought so.

    Here is the result without substitutions:
       

    The graph with substitutions:
       
  • RE: Binomial distribution in VMS

    RobGea > 26-08-2024, 10:45 PM

    Using Cryptool ngrams on an only-ascii ZL3a-n.txt file shows CH is the Rank No.1 most common bigram in both all_words and distinct words.
    That would help towards explaining the graph differences.

    Digram Analysis of <ZL2023_Clean.txt>. File size 230559 bytes.
    Descending sorted on frequency.

      No.   Substring Frequency (in %) Frequency
        1         CH         7.2458       10865
        2         HE         5.4565       8182
        3         DY         4.5149       6770
    =====================================================================
    Digram Analysis of <VMS_distinct_words.txt>. File size 66310 bytes.
    Descending sorted on frequency.

      No.   Substring Frequency (in %) Frequency
        1         CH         7.4569       3119
        2         HE         4.9298       2062
        3         EE         3.6508       1527
  • RE: Binomial distribution in VMS

    bi3mw > 27-08-2024, 03:32 PM

    I have different percentages but the order is the same.

    Bigrams
     
      No.  Digram Frequency (in %) Frequency
      1      CH       5.7271      11123 *
      2      HE       4.2679      8289
      3      DY       3.6150      7021
      4      AI       3.4976      6793
      5      OK       3.2531      6318
      6      IN       3.1120      6044
      7      OL       3.0121      5850
      8      EE       2.7361      5314
      9      QO       2.7310      5304
      10    ED       2.6398      5127
      11    II       2.4534      4765
      12    SH       2.3340      4533 *
     
     
      Trigrams
     
      No.  Trigram Frequency (in %) Frequency
      1      CHE        2.6285      5105
      2      IIN        2.2223      4316
      3      AII        2.2150      4302
      4      EDY        2.1842      4242
      5      YQO        1.8814      3654
      6      QOK        1.6116      3130
      7      CHO        1.3732      2667
      8      OKE        1.3475      2617
      9      SHE        1.3469      2616
      10    HED        1.2749      2476
      48    CTH        0.4747        922 *
      49    CKH        0.4655        904 *
    196    CPH        0.1086        211 *

    The fact is that minor changes to the corpus can have a significant effect on the distribution.
  • RE: Binomial distribution in VMS

    RobGea > 27-08-2024, 09:41 PM

    Are they minor changes, though ?
    If i've done this right then:

    preparation
    You are not allowed to view links. Register or Login to view.
         .
    substrings =['ch', 'sh', 'cth', 'ckh', 'cph', 'cfh']
    Total nbr. of VMS words :  39020
    VMS words containing at least 1of substrings  : 16183
    (16183 / 39020)*100 = 41.47 %

    Then you have changed the lengths of 41% of the words in the corpus.
  • RE: Binomial distribution in VMS

    ReneZ > 27-08-2024, 10:57 PM

    You should be able to get identical results. That is not an unachievable ideal. It should be possible.

    These are some obvious chances to create differences:
    - different input files (there are even different sources for the IT or TT files)
    - interpretation of uncertain spaces (ivtt has -x7 and -x8 for the two cases)
    - handling of the ? character: as just another character, word break, or ignore word altogether
    - the general way in which substrings are substituted
    - w.r.t. these substrings, in particular what is done with cases like ckhh or cthh
  • RE: Binomial distribution in VMS

    bi3mw > 28-08-2024, 01:28 PM

    (27-08-2024, 10:57 PM)ReneZ Wrote: You are not allowed to view links. Register or Login to view.You should be able to get identical results. That is not an unachievable ideal. It should be possible.

    I have checked my procedure again and corrected it. Here are all the steps again (now with the complete VMS file):

    ivtt -x7 ZL3a-n.txt ZLall.txt

    ./prep_file2.sh ZLall.txt ZLall2.txt

    Code:
    #!/bin/bash

    # Check if the correct number of arguments has been passed
    if [ "$#" -ne 2 ]; then
        echo "Usage: $0 <inputfile> <outputfile>"
        exit 1
    fi

    # Input file and output file from the arguments
    inputfile="$1"
    outputfile="$2"

    # Perform character replacements and write to the output file, remove empty lines
    sed -e 's/@[^;]*;/w/g' \
        -e '/^$/d' "$inputfile" > "$outputfile"

    # Print success message
    echo "The file has been successfully written to $outputfile."


    python3 freq_chars_bigramme.py ZLall2.txt freqs.txt

    Code:
    import sys
    from collections import Counter
    import re

    # Function to calculate digram (bigram) frequency within words
    def calculate_digram_frequency(filepath):
        # Read file content
        with open(filepath, 'r', encoding='utf-8') as file:
            text = file.read().lower()

        # Replace non-letter characters with spaces, except within words
        text = re.sub(r'[^a-z\s]', ' ', text)

        # Split text into words based on spaces
        words = text.split()

        # List to store bigrams
        bigrams = []

        # Generate bigrams within each word
        for word in words:
            if len(word) >= 2:
                bigrams.extend(word[i:i+2] for i in range(len(word) - 1))

        # Count the occurrences of each bigram
        counter = Counter(bigrams)

        return counter, len(bigrams)

    # Main program
    if __name__ == "__main__":
        if len(sys.argv) != 3:
            print("Usage: python program.py <input_filepath> <output_filepath>")
            sys.exit(1)

        input_filepath = sys.argv[1]  # Input file as an argument
        output_filepath = sys.argv[2]  # Output file as an argument

        # Calculate frequencies
        counter, total_bigram_count = calculate_digram_frequency(input_filepath)

        # Sort bigrams by frequency and select the top 12
        top_bigrams = counter.most_common(12)

        # Write the top 12 bigrams to the output file
        with open(output_filepath, 'w', encoding='utf-8') as output_file:
            output_file.write("  No.  Bigram Frequency (in %) Frequency\n")
            for i, (bigram, frequency) in enumerate(top_bigrams, 1):
                bigram_upper = bigram.upper()  # Convert bigram to uppercase
                frequency_percentage = (frequency / total_bigram_count) * 100 if total_bigram_count > 0 else 0.0
                output_file.write(f"{i:>4} {bigram_upper:>8} {frequency_percentage:>13.4f} {frequency:>10}\n")

    No.  Bigram Frequency (in %) Frequency
      1      CH        7.1043      11081
      2      HE        5.3008      8268
      3      DY        4.4623      6960
      4      AI        4.3533      6790
      5      OK        3.9615      6179
      6      IN        3.8750      6044
      7      OL        3.6121      5634
      8      EE        3.4038      5309
      9      QO        3.3961      5297
      10      ED        3.2800      5116
      11      II        3.0550      4765
      12      SH        2.9062      4533


    If anyone finds any errors I would be grateful for information.
  • RE: Binomial distribution in VMS

    nablator > 28-08-2024, 02:17 PM

    (28-08-2024, 01:28 PM)bi3mw Wrote: You are not allowed to view links. Register or Login to view.    # Remove all characters that are not letters

    Maybe you should keep spaces and question marks to match Cryptool results (I have not checked).
  • RE: Binomial distribution in VMS

    bi3mw > 28-08-2024, 03:03 PM

    Hmm, I used the text editor to search for the number of digrams in each case and get the correct numbers. The code to create "ZLall2.txt" is actually too simple to be wrong.

    edit: I also checked with this tool. The results are identical.

    You are not allowed to view links. Register or Login to view.
  • RE: Binomial distribution in VMS

    RobGea > 28-08-2024, 06:04 PM

    Wrong Cryptool counts are totally my bad , i just grabbed a file that had @nnn as high ascii bytes and cryptool has  a default charset of a-z, which i forgot to account for. Totally my fault , sorry for the confusion Blush  Cry
  • RE: Binomial distribution in VMS

    bi3mw > 28-08-2024, 06:25 PM

    (28-08-2024, 06:04 PM)RobGea Wrote: You are not allowed to view links. Register or Login to view.Wrong Cryptool counts are totally my bad , i just grabbed a file that had @nnn as high ascii bytes and cryptool has  a default charset of a-z, which i forgot to account for. Totally my fault , sorry for the confusion Blush  Cry

    Can you run Cryptool again with a clean file ( just to double check ) ?