(25-08-2024, 06:11 PM)RobGea Wrote: You are not allowed to view links. Register or Login to view.Original chars -> replaced with
ch -> C
sh -> S
cth -> T
ckh -> K
cph -> P
cfh -> F
This has a considerable influence on the graph ( here herbal section as example). I would not have thought so.
Here is the result without substitutions:
[
attachment=9124]
The graph with substitutions:
[
attachment=9125]
Using Cryptool ngrams on an only-ascii ZL3a-n.txt file shows CH is the Rank No.1 most common bigram in both all_words and distinct words.
That would help towards explaining the graph differences.
Digram Analysis of <ZL2023_Clean.txt>. File size 230559 bytes.
Descending sorted on frequency.
No. Substring Frequency (in %) Frequency
1 CH 7.2458 10865
2 HE 5.4565 8182
3 DY 4.5149 6770
=====================================================================
Digram Analysis of <VMS_distinct_words.txt>. File size 66310 bytes.
Descending sorted on frequency.
No. Substring Frequency (in %) Frequency
1 CH 7.4569 3119
2 HE 4.9298 2062
3 EE 3.6508 1527
I have different percentages but the order is the same.
Bigrams
No. Digram Frequency (in %) Frequency
1 CH 5.7271 11123 *
2 HE 4.2679 8289
3 DY 3.6150 7021
4 AI 3.4976 6793
5 OK 3.2531 6318
6 IN 3.1120 6044
7 OL 3.0121 5850
8 EE 2.7361 5314
9 QO 2.7310 5304
10 ED 2.6398 5127
11 II 2.4534 4765
12 SH 2.3340 4533 *
Trigrams
No. Trigram Frequency (in %) Frequency
1 CHE 2.6285 5105
2 IIN 2.2223 4316
3 AII 2.2150 4302
4 EDY 2.1842 4242
5 YQO 1.8814 3654
6 QOK 1.6116 3130
7 CHO 1.3732 2667
8 OKE 1.3475 2617
9 SHE 1.3469 2616
10 HED 1.2749 2476
48 CTH 0.4747 922 *
49 CKH 0.4655 904 *
196 CPH 0.1086 211 *
The fact is that minor changes to the corpus can have a significant effect on the distribution.
Are they minor changes, though ?
If i've done this right then:
preparation
ivtt.exe -x7 ZL3a-n.txt ZL3_nnn.txt
regexed in notepad++
\@[0-9]{3}\; replace with 'W' {166 occurrences were replaced in the current file}
Save as 'ZL3_W.txt'
.
substrings =['ch', 'sh', 'cth', 'ckh', 'cph', 'cfh']
Total nbr. of VMS words : 39020
VMS words containing at least 1of substrings : 16183
(16183 / 39020)*100 = 41.47 %
Then you have changed the lengths of 41% of the words in the corpus.
You should be able to get identical results. That is not an unachievable ideal. It should be possible.
These are some obvious chances to create differences:
- different input files (there are even different sources for the IT or TT files)
- interpretation of uncertain spaces (ivtt has -x7 and -x8 for the two cases)
- handling of the ? character: as just another character, word break, or ignore word altogether
- the general way in which substrings are substituted
- w.r.t. these substrings, in particular what is done with cases like ckhh or cthh
(27-08-2024, 10:57 PM)ReneZ Wrote: You are not allowed to view links. Register or Login to view.You should be able to get identical results. That is not an unachievable ideal. It should be possible.
I have checked my procedure again and corrected it. Here are all the steps again (now with the complete VMS file):
ivtt -x7 ZL3a-n.txt ZLall.txt
./prep_file2.sh ZLall.txt ZLall2.txt
Code:
#!/bin/bash
# Check if the correct number of arguments has been passed
if [ "$#" -ne 2 ]; then
echo "Usage: $0 <inputfile> <outputfile>"
exit 1
fi
# Input file and output file from the arguments
inputfile="$1"
outputfile="$2"
# Perform character replacements and write to the output file, remove empty lines
sed -e 's/@[^;]*;/w/g' \
-e '/^$/d' "$inputfile" > "$outputfile"
# Print success message
echo "The file has been successfully written to $outputfile."
python3 freq_chars_bigramme.py ZLall2.txt freqs.txt
Code:
import sys
from collections import Counter
import re
# Function to calculate digram (bigram) frequency within words
def calculate_digram_frequency(filepath):
# Read file content
with open(filepath, 'r', encoding='utf-8') as file:
text = file.read().lower()
# Replace non-letter characters with spaces, except within words
text = re.sub(r'[^a-z\s]', ' ', text)
# Split text into words based on spaces
words = text.split()
# List to store bigrams
bigrams = []
# Generate bigrams within each word
for word in words:
if len(word) >= 2:
bigrams.extend(word[i:i+2] for i in range(len(word) - 1))
# Count the occurrences of each bigram
counter = Counter(bigrams)
return counter, len(bigrams)
# Main program
if __name__ == "__main__":
if len(sys.argv) != 3:
print("Usage: python program.py <input_filepath> <output_filepath>")
sys.exit(1)
input_filepath = sys.argv[1] # Input file as an argument
output_filepath = sys.argv[2] # Output file as an argument
# Calculate frequencies
counter, total_bigram_count = calculate_digram_frequency(input_filepath)
# Sort bigrams by frequency and select the top 12
top_bigrams = counter.most_common(12)
# Write the top 12 bigrams to the output file
with open(output_filepath, 'w', encoding='utf-8') as output_file:
output_file.write(" No. Bigram Frequency (in %) Frequency\n")
for i, (bigram, frequency) in enumerate(top_bigrams, 1):
bigram_upper = bigram.upper() # Convert bigram to uppercase
frequency_percentage = (frequency / total_bigram_count) * 100 if total_bigram_count > 0 else 0.0
output_file.write(f"{i:>4} {bigram_upper:>8} {frequency_percentage:>13.4f} {frequency:>10}\n")
No. Bigram Frequency (in %) Frequency
1 CH 7.1043 11081
2 HE 5.3008 8268
3 DY 4.4623 6960
4 AI 4.3533 6790
5 OK 3.9615 6179
6 IN 3.8750 6044
7 OL 3.6121 5634
8 EE 3.4038 5309
9 QO 3.3961 5297
10 ED 3.2800 5116
11 II 3.0550 4765
12 SH 2.9062 4533
If anyone finds any errors I would be grateful for information.
(28-08-2024, 01:28 PM)bi3mw Wrote: You are not allowed to view links. Register or Login to view. # Remove all characters that are not letters
Maybe you should keep spaces and question marks to match Cryptool results (I have not checked).
Hmm, I used the text editor to search for the number of digrams in each case and get the correct numbers. The code to create "ZLall2.txt" is actually too simple to be wrong.
edit: I also checked with this tool. The results are identical.
You are not allowed to view links.
Register or
Login to view.
Wrong Cryptool counts are totally my bad , i just grabbed a file that had @nnn as high ascii bytes and cryptool has a default charset of a-z, which i forgot to account for. Totally my fault , sorry for the confusion

(28-08-2024, 06:04 PM)RobGea Wrote: You are not allowed to view links. Register or Login to view.Wrong Cryptool counts are totally my bad , i just grabbed a file that had @nnn as high ascii bytes and cryptool has a default charset of a-z, which i forgot to account for. Totally my fault , sorry for the confusion

Can you run Cryptool again with a clean file ( just to double check ) ?