23-11-2024, 01:02 PM
So I did some frequency analysis of the text. I'm still processing the most of it, but I wanted to share some stats I got. It's possible some of these data have been posted or done before by someone else, but I couldn't find it.
The data might not be 100% precise due to transcription errors. I used transcription from Landini-Stolfi. I have limited dataset of old Latin and English (only around 5000 words)
Total letters (excluding spaces): 194 771
Total words: 37 852
My entropy calculations of text based on words:
Voynich - 10.5957 bits
English - 7.9068 bits
Old Latin - 9.8516 bits
My entropy calculations of text based on letters:
Voynich - 3.8689 bits
English - 4.1317 bits
Old Latin - 4.0166 bits
the bottom row lists what is the most common letter that precedes/follows it in all cases. If the word is a standalone, I list what words commonly precede/follow it - ex. "chol daiin")
[attachment=9433]
[attachment=9434]
Independently:
[attachment=9436]
The most common combinations of words that appear in the text are:
"or aiin" - 51 times
"s aiin" - 44 times
"ar aiin" - 32 times
"chol daiin" - 30 times
"chol chol" - 21 times
"qokeedy qokeedy" - 20 times
(Some) of the longest strings that appear multiple times in the text are:
"ol shedy qokedy qokeedy" - 2 times
"qokeedy qotedy qokeedy" - 2 times
"qokedy qokeedy qokeedy" - 2 times
"qokedy qokedy qokedy" - 3 times
"sheedy qokedy chedy" - 3 times
"chol daiin chkaiin" - 2 times
Regarding individual letters/symbols:
Frequency analysis of all letters:
[attachment=9437]
Analysis of letters - do they appear standalone, initial, medial or as final?
[attachment=9438]
So yeah, these are some of my results - if you have request or are interested in more, let me know.
If you have any questions, ask.
Feel free to draw some of your conclusions in the comments.
Please, if you have some digitalized transcripts of original 15th century texts or similar (any language), share them with me.
The data might not be 100% precise due to transcription errors. I used transcription from Landini-Stolfi. I have limited dataset of old Latin and English (only around 5000 words)
Total letters (excluding spaces): 194 771
Total words: 37 852
My entropy calculations of text based on words:
Voynich - 10.5957 bits
English - 7.9068 bits
Old Latin - 9.8516 bits
My entropy calculations of text based on letters:
Voynich - 3.8689 bits
English - 4.1317 bits
Old Latin - 4.0166 bits
- Vast majority of words are unique = 6126 (70,54%) of words appears only once. 941 (10,84%) appear twice, and 382 (4,40%) thrice. That's 85% of words that are unique. These unique words make up 24,18% of the entire text. Truly unique words (they appear once) make up 16,18% of the entire text.
- The word that appears the most (as a standalone) is "daiin" - 820 times (2,17%) Here are some data regarding "daiin":
the bottom row lists what is the most common letter that precedes/follows it in all cases. If the word is a standalone, I list what words commonly precede/follow it - ex. "chol daiin")
[attachment=9433]
- The second word that appears the most (as a standalone) is "ol" - 504 times (1,33%) - but it appears over 5000 times as a part of a word.
[attachment=9434]
- The word frequency in Voynichese (light blue), 15th Century Latin (Yellow), Modern English (dark blue/purple). I believe we already knew that it resembles Zipfs Law, but I'll show it anyway. Note: I don't have sufficient data to feed my algorithm regarding other languages, that's why English and Latin are so low. My Voynich algorithm worked with 37852 words, English and Latin with around 5000 words. I'd be happy if someone provided me with some old relevant texts that I could feed into the algorithm. Let me know!
Independently:
[attachment=9436]
The most common combinations of words that appear in the text are:
"or aiin" - 51 times
"s aiin" - 44 times
"ar aiin" - 32 times
"chol daiin" - 30 times
"chol chol" - 21 times
"qokeedy qokeedy" - 20 times
(Some) of the longest strings that appear multiple times in the text are:
"ol shedy qokedy qokeedy" - 2 times
"qokeedy qotedy qokeedy" - 2 times
"qokedy qokeedy qokeedy" - 2 times
"qokedy qokedy qokedy" - 3 times
"sheedy qokedy chedy" - 3 times
"chol daiin chkaiin" - 2 times
Regarding individual letters/symbols:
Frequency analysis of all letters:
[attachment=9437]
Analysis of letters - do they appear standalone, initial, medial or as final?
[attachment=9438]
So yeah, these are some of my results - if you have request or are interested in more, let me know.
If you have any questions, ask.
Feel free to draw some of your conclusions in the comments.
Please, if you have some digitalized transcripts of original 15th century texts or similar (any language), share them with me.
