![]() |
Some statistics and analysis regarding VMS (words & letters) - Printable Version +- The Voynich Ninja (https://www.voynich.ninja) +-- Forum: Voynich Research (https://www.voynich.ninja/forum-27.html) +--- Forum: Analysis of the text (https://www.voynich.ninja/forum-41.html) +--- Thread: Some statistics and analysis regarding VMS (words & letters) (/thread-4411.html) Pages:
1
2
|
Some statistics and analysis regarding VMS (words & letters) - argo2001 - 23-11-2024 So I did some frequency analysis of the text. I'm still processing the most of it, but I wanted to share some stats I got. It's possible some of these data have been posted or done before by someone else, but I couldn't find it. The data might not be 100% precise due to transcription errors. I used transcription from Landini-Stolfi. I have limited dataset of old Latin and English (only around 5000 words) Total letters (excluding spaces): 194 771 Total words: 37 852 My entropy calculations of text based on words: Voynich - 10.5957 bits English - 7.9068 bits Old Latin - 9.8516 bits My entropy calculations of text based on letters: Voynich - 3.8689 bits English - 4.1317 bits Old Latin - 4.0166 bits
the bottom row lists what is the most common letter that precedes/follows it in all cases. If the word is a standalone, I list what words commonly precede/follow it - ex. "chol daiin")
Independently: The most common combinations of words that appear in the text are: "or aiin" - 51 times "s aiin" - 44 times "ar aiin" - 32 times "chol daiin" - 30 times "chol chol" - 21 times "qokeedy qokeedy" - 20 times (Some) of the longest strings that appear multiple times in the text are: "ol shedy qokedy qokeedy" - 2 times "qokeedy qotedy qokeedy" - 2 times "qokedy qokeedy qokeedy" - 2 times "qokedy qokedy qokedy" - 3 times "sheedy qokedy chedy" - 3 times "chol daiin chkaiin" - 2 times Regarding individual letters/symbols: Frequency analysis of all letters: Analysis of letters - do they appear standalone, initial, medial or as final? So yeah, these are some of my results - if you have request or are interested in more, let me know. If you have any questions, ask. Feel free to draw some of your conclusions in the comments. Please, if you have some digitalized transcripts of original 15th century texts or similar (any language), share them with me. ![]() RE: Some statistics and analysis regarding VMS (words & letters) - Koen G - 23-11-2024 I attach the corpus I made 4 years ago. Please note: * I assembled this initially for personal use. There is no proper attribution or even notes of where I got the text from. Do not redistribute etc etc. * The texts have been pre-processed to remove punctuation, capitalization etc. However, I'm not sure if this happened properly for non-Latin scripts like Greek and Slavic languages. Some texts likely still contain unwanted artefacts. Treat these with caution. You are not allowed to view links. Register or Login to view. RE: Some statistics and analysis regarding VMS (words & letters) - argo2001 - 23-11-2024 Thanks, I need your authorization to enter, it should be on your email. What do you think about this analysis? RE: Some statistics and analysis regarding VMS (words & letters) - Koen G - 23-11-2024 Ah yes, I forgot to change permissions on the file, it should work now. Your analysis is good at first glance, it summarizes a lot of Voynichese's properties. Which entropy stat did you calculate exactly to get those numbers? (I am personally almost exclusively familiar with conditional character entropy, where the contrast between Voynichese and a randomly selected medieval text is quite stark.) Something I haven't personally paid much attention to before is the longest repeated word strings. It would be interesting if you could compare the corpus to Voynichese. (Some form of normalization for text length will be necessary here). RE: Some statistics and analysis regarding VMS (words & letters) - argo2001 - 23-11-2024 (23-11-2024, 01:41 PM)Koen G Wrote: You are not allowed to view links. Register or Login to view.Which entropy stat did you calculate exactly to get those numbers? (I am personally almost exclusively familiar with conditional character entropy, where the contrast between Voynichese and a randomly selected medieval text is quite stark.) I used the Shannon entropy formula. It's more simple and different from CCE. We have word string "fachys ykal ar ataiin shol shory ctoses y kor sholdy": Shannon represents the uncertainty of each letter independently. The entropy is 3.64 bits. Conditional entropy represents the uncertainty of each letter given the previous one. The entropy is 2.75 bits. RE: Some statistics and analysis regarding VMS (words & letters) - argo2001 - 23-11-2024 Here is a graphical representation of distribution of individual letters within individual words of VMS. RE: Some statistics and analysis regarding VMS (words & letters) - Rafal - 23-11-2024 I like your presentation. It is really neat. Do you have any observations that seem valuable to you? And do you have any working hypothesis what this text could be? RE: Some statistics and analysis regarding VMS (words & letters) - RobGea - 23-11-2024 @argo2001 In post#6, "graphical representation of distribution of individual letters within individual words of VMS." What is the rationale for the X-axis only going to 5 ? RE: Some statistics and analysis regarding VMS (words & letters) - argo2001 - 23-11-2024 (23-11-2024, 10:33 PM)RobGea Wrote: You are not allowed to view links. Register or Login to view.In post#6, "graphical representation of distribution of individual letters within individual words of VMS."The rationale is limitation of my coding skills. In my initial attempt, I tried to extract the position of each individual letter within the words. However, this approach caused issues because words vary in length. As a result, the positions of letters in shorter words were skewed when compared to longer words, which led to inconsistent or misleading results. This discrepancy in word lengths made the letter positions unreliable for analysis. So I created an algorithm that separated each word into five (equal) parts. This way all the words can be represented. If the word has 2 letters it's separated in the middle (the word dy becomes d/y) and are counted in 1 and 5. If the word has 3 letters, the letters are counted respectively in 1, 3, 5. If the word has 4 letters, (qoty -> q/o/t/y) q is counted in 1, o+t are counted in 2,3,4; and y in 5. 5 is self-explanatory. If the word has more letters, it's divided into fifths. (te/od/a/ro/dy) and counted respectively. Yeah, it's not perfect. But I can't come up with better idea how to program it. I'm open to any solutions, haha! RE: Some statistics and analysis regarding VMS (words & letters) - argo2001 - 25-11-2024 Frequency analysis of letters in medieval European languages compared to VMS After processing Koen's corpus, I have processed a total of 21,996,439 individual characters extracted from 112 medieval texts spanning various topics. The dataset for Slavic languages is the least reliable due to its relatively small size (405,778 characters), limited topical diversity, and the inclusion of mixed East Slavic languages. The majority of these texts are written in Old Church Slavonic and Old East Slavic, primarily from the 1200s. In contrast, the largest datasets pertain to Latin, English, and Italian, each exceeding 5,000,000 characters. For context, the Voynich Manuscript, transcribed in the EVA contains 194,627 characters. Some of my results are presented graphically below, some are in attachments. The Y-axis represents the percentage of each letter's contribution to the total character count. The X-axis displays the most frequently occurring letters, arranged from the most common on the left to the least common on the right. Voynich compared to Germanic languages. Voynich compared to Romance languages. Based on the similarity to Romance languages, mainly Latin, I've decided to determine the most common initial letters of all words in German and Latin, comparing these results both with the VMS and between German and Latin. Since I'm not a linguist, but based on this striking similarity, is it safe to assume that if VMS is a kind of a substitution cipher(somehow), it is based on a Romance language? There are few more graphs in the attachments. Any ideas, questions or opinions? |