![]() |
Binomial distribution in VMS - Printable Version +- The Voynich Ninja (https://www.voynich.ninja) +-- Forum: Voynich Research (https://www.voynich.ninja/forum-27.html) +--- Forum: Analysis of the text (https://www.voynich.ninja/forum-41.html) +--- Thread: Binomial distribution in VMS (/thread-4353.html) |
RE: Binomial distribution in VMS - bi3mw - 26-08-2024 (25-08-2024, 06:11 PM)RobGea Wrote: You are not allowed to view links. Register or Login to view.Original chars -> replaced with This has a considerable influence on the graph ( here herbal section as example). I would not have thought so. Here is the result without substitutions: The graph with substitutions: RE: Binomial distribution in VMS - RobGea - 26-08-2024 Using Cryptool ngrams on an only-ascii ZL3a-n.txt file shows CH is the Rank No.1 most common bigram in both all_words and distinct words. That would help towards explaining the graph differences. Digram Analysis of <ZL2023_Clean.txt>. File size 230559 bytes. Descending sorted on frequency. No. Substring Frequency (in %) Frequency 1 CH 7.2458 10865 2 HE 5.4565 8182 3 DY 4.5149 6770 ===================================================================== Digram Analysis of <VMS_distinct_words.txt>. File size 66310 bytes. Descending sorted on frequency. No. Substring Frequency (in %) Frequency 1 CH 7.4569 3119 2 HE 4.9298 2062 3 EE 3.6508 1527 RE: Binomial distribution in VMS - bi3mw - 27-08-2024 I have different percentages but the order is the same. Bigrams No. Digram Frequency (in %) Frequency 1 CH 5.7271 11123 * 2 HE 4.2679 8289 3 DY 3.6150 7021 4 AI 3.4976 6793 5 OK 3.2531 6318 6 IN 3.1120 6044 7 OL 3.0121 5850 8 EE 2.7361 5314 9 QO 2.7310 5304 10 ED 2.6398 5127 11 II 2.4534 4765 12 SH 2.3340 4533 * Trigrams No. Trigram Frequency (in %) Frequency 1 CHE 2.6285 5105 2 IIN 2.2223 4316 3 AII 2.2150 4302 4 EDY 2.1842 4242 5 YQO 1.8814 3654 6 QOK 1.6116 3130 7 CHO 1.3732 2667 8 OKE 1.3475 2617 9 SHE 1.3469 2616 10 HED 1.2749 2476 48 CTH 0.4747 922 * 49 CKH 0.4655 904 * 196 CPH 0.1086 211 * The fact is that minor changes to the corpus can have a significant effect on the distribution. RE: Binomial distribution in VMS - RobGea - 27-08-2024 Are they minor changes, though ? If i've done this right then: preparation . substrings =['ch', 'sh', 'cth', 'ckh', 'cph', 'cfh'] Total nbr. of VMS words : 39020 VMS words containing at least 1of substrings : 16183 (16183 / 39020)*100 = 41.47 % Then you have changed the lengths of 41% of the words in the corpus. RE: Binomial distribution in VMS - ReneZ - 27-08-2024 You should be able to get identical results. That is not an unachievable ideal. It should be possible. These are some obvious chances to create differences: - different input files (there are even different sources for the IT or TT files) - interpretation of uncertain spaces (ivtt has -x7 and -x8 for the two cases) - handling of the ? character: as just another character, word break, or ignore word altogether - the general way in which substrings are substituted - w.r.t. these substrings, in particular what is done with cases like ckhh or cthh RE: Binomial distribution in VMS - bi3mw - 28-08-2024 (27-08-2024, 10:57 PM)ReneZ Wrote: You are not allowed to view links. Register or Login to view.You should be able to get identical results. That is not an unachievable ideal. It should be possible. I have checked my procedure again and corrected it. Here are all the steps again (now with the complete VMS file): ivtt -x7 ZL3a-n.txt ZLall.txt ./prep_file2.sh ZLall.txt ZLall2.txt Code: #!/bin/bash python3 freq_chars_bigramme.py ZLall2.txt freqs.txt Code: import sys No. Bigram Frequency (in %) Frequency 1 CH 7.1043 11081 2 HE 5.3008 8268 3 DY 4.4623 6960 4 AI 4.3533 6790 5 OK 3.9615 6179 6 IN 3.8750 6044 7 OL 3.6121 5634 8 EE 3.4038 5309 9 QO 3.3961 5297 10 ED 3.2800 5116 11 II 3.0550 4765 12 SH 2.9062 4533 If anyone finds any errors I would be grateful for information. RE: Binomial distribution in VMS - nablator - 28-08-2024 (28-08-2024, 01:28 PM)bi3mw Wrote: You are not allowed to view links. Register or Login to view. # Remove all characters that are not letters Maybe you should keep spaces and question marks to match Cryptool results (I have not checked). RE: Binomial distribution in VMS - bi3mw - 28-08-2024 Hmm, I used the text editor to search for the number of digrams in each case and get the correct numbers. The code to create "ZLall2.txt" is actually too simple to be wrong. edit: I also checked with this tool. The results are identical. You are not allowed to view links. Register or Login to view. RE: Binomial distribution in VMS - RobGea - 28-08-2024 Wrong Cryptool counts are totally my bad , i just grabbed a file that had @nnn as high ascii bytes and cryptool has a default charset of a-z, which i forgot to account for. Totally my fault , sorry for the confusion ![]() ![]() RE: Binomial distribution in VMS - bi3mw - 28-08-2024 (28-08-2024, 06:04 PM)RobGea Wrote: You are not allowed to view links. Register or Login to view.Wrong Cryptool counts are totally my bad , i just grabbed a file that had @nnn as high ascii bytes and cryptool has a default charset of a-z, which i forgot to account for. Totally my fault , sorry for the confusion Can you run Cryptool again with a clean file ( just to double check ) ? |