The Voynich Ninja

Full Version: Some statistics and analysis regarding VMS (words & letters)
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Pages: 1 2
So I did some frequency analysis of the text. I'm still processing the most of it, but I wanted to share some stats I got. It's possible some of these data have been posted or done before by someone else, but I couldn't find it.

The data might not be 100% precise due to transcription errors. I used transcription from Landini-Stolfi. I have limited dataset of old Latin and English (only around 5000 words)

Total letters (excluding spaces): 194 771
Total words: 37 852

My entropy calculations of text based on words:
Voynich - 10.5957 bits
English - 7.9068 bits
Old Latin - 9.8516 bits

My entropy calculations of text based on letters:
Voynich - 3.8689 bits
English - 4.1317 bits
Old Latin - 4.0166 bits
  • Vast majority of words are unique = 6126 (70,54%) of words appears only once. 941 (10,84%) appear twice, and 382 (4,40%) thrice. That's 85% of words that are unique. These unique words make up 24,18% of the entire text. Truly unique words (they appear once) make up 16,18% of the entire text.
  • The word that appears the most (as a standalone) is "daiin" - 820 times (2,17%) Here are some data regarding "daiin":
(the top row lists if the string "daiin" appears as a standalone word, or if it's a part of a different word. If it's part of a different word, I list if it has initial, medial or final position.
the bottom row lists what is the most common letter that precedes/follows it in all cases. If the word is a standalone, I list what words commonly precede/follow it - ex. "chol daiin")
[attachment=9433]
  • The second word that appears the most (as a standalone) is "ol" - 504 times (1,33%) - but it appears over 5000 times as a part of a word.

[attachment=9434]
  • The word frequency in Voynichese (light blue), 15th Century Latin (Yellow), Modern English (dark blue/purple). I believe we already knew that it resembles Zipfs Law, but I'll show it anyway. Note: I don't have sufficient data to feed my algorithm regarding other languages, that's why English and Latin are so low. My Voynich algorithm worked with 37852 words, English and Latin with around 5000 words. I'd be happy if someone provided me with some old relevant texts that I could feed into the algorithm. Let me know!
[attachment=9435]
Independently:
[attachment=9436]

The most common combinations of words that appear in the text are:
"or aiin" - 51 times
"s aiin" - 44 times
"ar aiin" - 32 times
"chol daiin" - 30 times
"chol chol" - 21 times
"qokeedy qokeedy" - 20 times

(Some) of the longest strings that appear multiple times in the text are:
"ol shedy qokedy qokeedy" - 2 times
"qokeedy qotedy qokeedy" - 2 times
"qokedy qokeedy qokeedy" - 2 times
"qokedy qokedy qokedy" - 3 times
"sheedy qokedy chedy" - 3 times
"chol daiin chkaiin" - 2 times



Regarding individual letters/symbols:


Frequency analysis of all letters:
[attachment=9437]

Analysis of letters - do they appear standalone, initial, medial or as final?
[attachment=9438]

So yeah, these are some of my results - if you have request or are interested in more, let me know.

If you have any questions, ask.

Feel free to draw some of your conclusions in the comments.

Please, if you have some digitalized transcripts of original 15th century texts  or similar (any language), share them with me.  Cool
I attach the corpus I made 4 years ago. Please note:

* I assembled this initially for personal use. There is no proper attribution or even notes of where I got the text from. Do not redistribute etc etc.
* The texts have been pre-processed to remove punctuation, capitalization etc. However, I'm not sure if this happened properly for non-Latin scripts like Greek and Slavic languages. Some texts likely still contain unwanted artefacts. Treat these with caution.

You are not allowed to view links. Register or Login to view.
Thanks, I need your authorization to enter, it should be on your email.

What do you think about this analysis?
Ah yes, I forgot to change permissions on the file, it should work now. 

Your analysis is good at first glance, it summarizes a lot of Voynichese's properties. 

Which entropy stat did you calculate exactly to get those numbers? (I am personally almost exclusively familiar with conditional character entropy, where the contrast between Voynichese and a randomly selected medieval text is quite stark.)

Something I haven't personally paid much attention to before is the longest repeated word strings. It would be interesting if you could compare the corpus to Voynichese. (Some form of normalization for text length will be necessary here).
(23-11-2024, 01:41 PM)Koen G Wrote: You are not allowed to view links. Register or Login to view.Which entropy stat did you calculate exactly to get those numbers? (I am personally almost exclusively familiar with conditional character entropy, where the contrast between Voynichese and a randomly selected medieval text is quite stark.)

I used the Shannon entropy formula. It's more simple and different from CCE.

We have word string "fachys ykal ar ataiin shol shory ctoses y kor sholdy":

Shannon represents the uncertainty of each letter independently. The entropy is 3.64 bits.
Conditional entropy represents the uncertainty of each letter given the previous one. The entropy is 2.75 bits.
Here is a graphical representation of distribution of individual letters within individual words of VMS.

[attachment=9441]
I like your presentation. It is really neat.

Do you have any observations that seem valuable to you?

And do you have any working hypothesis what this text could be?
@argo2001
In post#6, "graphical representation of distribution of individual letters within individual words of VMS."

What is the rationale for the X-axis only going to 5 ?
(23-11-2024, 10:33 PM)RobGea Wrote: You are not allowed to view links. Register or Login to view.In post#6, "graphical representation of distribution of individual letters within individual words of VMS."

What is the rationale for the X-axis only going to 5 ?
The rationale is limitation of my coding skills.

In my initial attempt, I tried to extract the position of each individual letter within the words. However, this approach caused issues because words vary in length. As a result, the positions of letters in shorter words were skewed when compared to longer words, which led to inconsistent or misleading results. This discrepancy in word lengths made the letter positions unreliable for analysis.

So I created an algorithm that separated each word into five (equal) parts. This way all the words can be represented.
If the word has 2 letters it's separated in the middle (the word dy becomes d/y) and are counted in 1 and 5.
If the word has 3 letters, the letters are counted respectively in 1, 3, 5.
If the word has 4 letters, (qoty -> q/o/t/y) q is counted in 1, o+t are counted in 2,3,4; and y in 5.
5 is self-explanatory.
If the word has more letters, it's divided into fifths. (te/od/a/ro/dy) and counted respectively.

Yeah, it's not perfect. But I can't come up with better idea how to program it.

I'm open to any solutions, haha!
Frequency analysis of letters in medieval European languages compared to VMS

After processing Koen's corpus, I have processed a total of 21,996,439 individual characters extracted from 112 medieval texts spanning various topics.

The dataset for Slavic languages is the least reliable due to its relatively small size (405,778 characters), limited topical diversity, and the inclusion of mixed East Slavic languages. The majority of these texts are written in Old Church Slavonic and Old East Slavic, primarily from the 1200s. In contrast, the largest datasets pertain to Latin, English, and Italian, each exceeding 5,000,000 characters.

For context, the Voynich Manuscript, transcribed in the EVA contains 194,627 characters.

Some of my results are presented graphically below, some are in attachments. The Y-axis represents the percentage of each letter's contribution to the total character count. The X-axis displays the most frequently occurring letters, arranged from the most common on the left to the least common on the right.

Voynich compared to Germanic languages.
[attachment=9452]

Voynich compared to Romance languages.
[attachment=9454]

Based on the similarity to Romance languages, mainly Latin, I've decided to determine the most common initial letters of all words in German and Latin, comparing these results both with the VMS and between German and Latin.
[attachment=9456]

Since I'm not a linguist, but based on this striking similarity, is it safe to assume that if VMS is a kind of a substitution cipher(somehow), it is based on a Romance language?

There are few more graphs in the attachments.

Any ideas, questions or opinions?
Pages: 1 2