-
RE: Binomial distribution in VMS
bi3mw > 26-08-2024, 08:34 PM
(25-08-2024, 06:11 PM)RobGea Wrote: You are not allowed to view links. Register or Login to view.Original chars -> replaced with
ch -> C
sh -> S
cth -> T
ckh -> K
cph -> P
cfh -> F
This has a considerable influence on the graph ( here herbal section as example). I would not have thought so.
Here is the result without substitutions:
The graph with substitutions:
-
RE: Binomial distribution in VMS
RobGea > 26-08-2024, 10:45 PM
Using Cryptool ngrams on an only-ascii ZL3a-n.txt file shows CH is the Rank No.1 most common bigram in both all_words and distinct words.
That would help towards explaining the graph differences.
Digram Analysis of <ZL2023_Clean.txt>. File size 230559 bytes.
Descending sorted on frequency.
No. Substring Frequency (in %) Frequency
1 CH 7.2458 10865
2 HE 5.4565 8182
3 DY 4.5149 6770
=====================================================================
Digram Analysis of <VMS_distinct_words.txt>. File size 66310 bytes.
Descending sorted on frequency.
No. Substring Frequency (in %) Frequency
1 CH 7.4569 3119
2 HE 4.9298 2062
3 EE 3.6508 1527 -
RE: Binomial distribution in VMS
bi3mw > 27-08-2024, 03:32 PM
I have different percentages but the order is the same.
Bigrams
No. Digram Frequency (in %) Frequency
1 CH 5.7271 11123 *
2 HE 4.2679 8289
3 DY 3.6150 7021
4 AI 3.4976 6793
5 OK 3.2531 6318
6 IN 3.1120 6044
7 OL 3.0121 5850
8 EE 2.7361 5314
9 QO 2.7310 5304
10 ED 2.6398 5127
11 II 2.4534 4765
12 SH 2.3340 4533 *
Trigrams
No. Trigram Frequency (in %) Frequency
1 CHE 2.6285 5105
2 IIN 2.2223 4316
3 AII 2.2150 4302
4 EDY 2.1842 4242
5 YQO 1.8814 3654
6 QOK 1.6116 3130
7 CHO 1.3732 2667
8 OKE 1.3475 2617
9 SHE 1.3469 2616
10 HED 1.2749 2476
48 CTH 0.4747 922 *
49 CKH 0.4655 904 *
196 CPH 0.1086 211 *
The fact is that minor changes to the corpus can have a significant effect on the distribution. -
RE: Binomial distribution in VMS
RobGea > 27-08-2024, 09:41 PM
Are they minor changes, though ?
If i've done this right then:
preparation
.
substrings =['ch', 'sh', 'cth', 'ckh', 'cph', 'cfh']
Total nbr. of VMS words : 39020
VMS words containing at least 1of substrings : 16183
(16183 / 39020)*100 = 41.47 %
Then you have changed the lengths of 41% of the words in the corpus. -
RE: Binomial distribution in VMS
ReneZ > 27-08-2024, 10:57 PM
You should be able to get identical results. That is not an unachievable ideal. It should be possible.
These are some obvious chances to create differences:
- different input files (there are even different sources for the IT or TT files)
- interpretation of uncertain spaces (ivtt has -x7 and -x8 for the two cases)
- handling of the ? character: as just another character, word break, or ignore word altogether
- the general way in which substrings are substituted
- w.r.t. these substrings, in particular what is done with cases like ckhh or cthh -
RE: Binomial distribution in VMS
bi3mw > 28-08-2024, 01:28 PM
(27-08-2024, 10:57 PM)ReneZ Wrote: You are not allowed to view links. Register or Login to view.You should be able to get identical results. That is not an unachievable ideal. It should be possible.
I have checked my procedure again and corrected it. Here are all the steps again (now with the complete VMS file):
ivtt -x7 ZL3a-n.txt ZLall.txt
./prep_file2.sh ZLall.txt ZLall2.txt
Code:#!/bin/bash
# Check if the correct number of arguments has been passed
if [ "$#" -ne 2 ]; then
echo "Usage: $0 <inputfile> <outputfile>"
exit 1
fi
# Input file and output file from the arguments
inputfile="$1"
outputfile="$2"
# Perform character replacements and write to the output file, remove empty lines
sed -e 's/@[^;]*;/w/g' \
-e '/^$/d' "$inputfile" > "$outputfile"
# Print success message
echo "The file has been successfully written to $outputfile."
python3 freq_chars_bigramme.py ZLall2.txt freqs.txt
Code:import sys
from collections import Counter
import re
# Function to calculate digram (bigram) frequency within words
def calculate_digram_frequency(filepath):
# Read file content
with open(filepath, 'r', encoding='utf-8') as file:
text = file.read().lower()
# Replace non-letter characters with spaces, except within words
text = re.sub(r'[^a-z\s]', ' ', text)
# Split text into words based on spaces
words = text.split()
# List to store bigrams
bigrams = []
# Generate bigrams within each word
for word in words:
if len(word) >= 2:
bigrams.extend(word[i:i+2] for i in range(len(word) - 1))
# Count the occurrences of each bigram
counter = Counter(bigrams)
return counter, len(bigrams)
# Main program
if __name__ == "__main__":
if len(sys.argv) != 3:
print("Usage: python program.py <input_filepath> <output_filepath>")
sys.exit(1)
input_filepath = sys.argv[1] # Input file as an argument
output_filepath = sys.argv[2] # Output file as an argument
# Calculate frequencies
counter, total_bigram_count = calculate_digram_frequency(input_filepath)
# Sort bigrams by frequency and select the top 12
top_bigrams = counter.most_common(12)
# Write the top 12 bigrams to the output file
with open(output_filepath, 'w', encoding='utf-8') as output_file:
output_file.write(" No. Bigram Frequency (in %) Frequency\n")
for i, (bigram, frequency) in enumerate(top_bigrams, 1):
bigram_upper = bigram.upper() # Convert bigram to uppercase
frequency_percentage = (frequency / total_bigram_count) * 100 if total_bigram_count > 0 else 0.0
output_file.write(f"{i:>4} {bigram_upper:>8} {frequency_percentage:>13.4f} {frequency:>10}\n")
No. Bigram Frequency (in %) Frequency
1 CH 7.1043 11081
2 HE 5.3008 8268
3 DY 4.4623 6960
4 AI 4.3533 6790
5 OK 3.9615 6179
6 IN 3.8750 6044
7 OL 3.6121 5634
8 EE 3.4038 5309
9 QO 3.3961 5297
10 ED 3.2800 5116
11 II 3.0550 4765
12 SH 2.9062 4533
If anyone finds any errors I would be grateful for information. -
RE: Binomial distribution in VMS
nablator > 28-08-2024, 02:17 PM
-
RE: Binomial distribution in VMS
bi3mw > 28-08-2024, 03:03 PM
-
RE: Binomial distribution in VMS
RobGea > 28-08-2024, 06:04 PM
Wrong Cryptool counts are totally my bad , i just grabbed a file that had @nnn as high ascii bytes and cryptool has a default charset of a-z, which i forgot to account for. Totally my fault , sorry for the confusion -
RE: Binomial distribution in VMS
bi3mw > 28-08-2024, 06:25 PM
(28-08-2024, 06:04 PM)RobGea Wrote: You are not allowed to view links. Register or Login to view.Wrong Cryptool counts are totally my bad , i just grabbed a file that had @nnn as high ascii bytes and cryptool has a default charset of a-z, which i forgot to account for. Totally my fault , sorry for the confusion
Can you run Cryptool again with a clean file ( just to double check ) ?