The Voynich Ninja

Full Version: Working my way to a semantic word analysis
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Pages: 1 2
Hey folks,
I've been binging Koen's videos on Youtube over the last week, great stuff!
Obviously that means I'm a Voynich novice, but I did use computational linguistics during my PhD at the MPI in Nijmegen, Netherlands, so I couldn't help but dig my fingers into the data :)

I'm not claiming any novelty but I haven't seen the different analysis steps put together in one place so I figured I might as well publish it here. (However, I do think in the end, I have some interesting results that I didn't see anywhere else... but more about this below and in the next post.)
I put the data, scripts and a small analysis report on a dedicated Github repository, if somebody wants to have a deeper look: You are not allowed to view links. Register or Login to view.

My main idea for this round is to perform an TF-IDF analysis. This is a statistical tool where you count how often words occur over the whole text, vs the individual text segments (pages). With this method, one can (approximately) distinguish "content words" from "function words". Content words are those that are specific to a particular topic, like "Voynich" or "Quantum", while function words are words that show up everywhere, like "the", "of", "and", etc. I think it would be tantalizing to produce a list of Voynich words, where we can guess, from section and illustration cues, what they might mean, given where they show up. (Although I don't think it would bring us closer to deciphering the text, it would be fun.)

Down the line, maybe I have time to produce a visual tool where people can explore how words cluster in certain portions of the text. Not quite as fancy as the amazing tool on voynichese.com but in the same spirit.
I'm currently getting into working with LLMs (building them, not talking to Chat GPT) and I am very curious if one can use these tools to identify semantic clusters of Voynich words. Tbd.

I obviously haven't read everything there is on Voynich, but I did my best to go through voynich.nu and Bowern and Lindemann (2020) as well as the latest and pinned posts on this forum, to get a base understanding what's commonly known and what's currently under discussion. I'm looking forward to learn more from you.

I want to start by stating my base preconceptions/assumptions when I went into the analysis, as well as some questions that maybe you can help me with.

Assumptions:

1. The text is real in the sense that somebody in the 15th century wrote something down to communicate information to somebody else.
2. The transcription is reasonably good and conveys the textual content of the VMS to an overwhelming degree, so we can base an analysis on it.
3. The words are words in the sense that they can, through translation, combination, compression, augmentation by auxiliary information, or some other process, be rendered into a language that someone at some time spoke. If there is a cipher, it did not jumble words by moving word boundaries or similar shenanigans.
4. Letters are only meaningful with regards to the manuscript itself. They cannot be identified in a one-to-one manner with any language.
5. The manuscript was written by several scribes/authors, possibly at different times, possibly without knowing each other. The known separations are Currier A and B as well as the 5 Hands (Davis, Lisa Fagin. 2020).
6. There is no hope of me ever decrypting the text by myself since I have none of the necessary skills to actually understand any language that the authors spoke, even less the manuscript itself.


Questions:
1. The whole analysis is based on IT2a-n.txt from You are not allowed to view links. Register or Login to view. Is this the correct choice? As far as I understand, it's a version of the TT transcription, but I don't know what the state of the art is. I noticed that the transcription used on voynichese.com is different and in some places more complete, but I don't know.
2. How is the interaction between "fan groups" like this one and the academic community? From what I saw in the videos, there is a fairly good collaboration, but I still wonder. I know that for some topics that garner so much public interest, there can be a lot of tension.
3. I had trouble finding a "definitive" distinction of pages into Currier A and B. I don't know if that's because it is not fully defined, if there is disagreement or if I just didn't look at the right places.

Base results:
Before we get to the good stuff, I want to post the base analysis as a sanity check but also so they are all in one spot. As I said, these are all well known but it's good for me to see the data myself, so maybe also for others.

I split up all the analysis steps by Currier A and B. Going in, I did not have any idea how close both languages are. My initial assumption was actually that they are as different as German and Latin. These stats helped me understand it better.

[EDIT: I made a mistake in my Currier A/B separation for these plots. Corrected plots in my reply on page 2.]

1. The word length plots for Currier A and B with the distributions for 4 reference languages. (I just chose 4 languages that were easily accessible to me.)
[Image: word_length_stats.png?raw=true]

2. The known Zipf distribution of word frequencies with reference languages
[Image: Zipf_stats.png?raw=true]

3. Bigram heatmap.
This one was interesting to me because it shows a very close correspondence between Currier A and B. I expected a much bigger variation.
[Image: bigram_heatmap_a_b.png]

As a reference looked at the bigram statistics of the reference languages and one can see that they vary much much more from each other, compared to the currier A and B.
[Image: bigram_heatmap_ref_shared.png]

4. Word start and end bigrams/trigrams
The bigrams and word-initial trigrams did not show that much irregularity but the word-end trigrams clearly shows the famous -edy ending for Currier B
[Image: word_end_trigrams_a_b.png]

What did surprise me is that the -edy ending is also among the most common endings in the Currier A script. From what I read and saw, I assumed that it is almost exclusive to the Currier B. Does that mean that I (a) simply misunderstood or (b) chose the wrong page split between currier A and B?

My current split is defined by the code below. Input is very welcome.

Code:
CURRIER_A_RANGES = [("f1r", "f24v"), ("f31r", "f31v"), ("f88r", "f90v1"), ("f100r1", "f116r"), ]
CURRIER_A_SINGLES = ["f25r", "f25v", "f32r", "f32v", "f33r", "f34r", "f34v", "f67r2", "f67v1", "f67v2", "f91v"]
CURRIER_B_RANGES = [ ("f26r", "f30v"), ("f35r", "f39v"), ("f75r", "f84v"), ("f93r", "f96v"), ]
CURRIER_B_SINGLES = ["f68r1", "f68r2", "f68v1", "f68v2"]



I'll leave it at that for now. I'm curious how the interaction in the community here works and if I'll hear from anybody. I'm still preparing the plots for the TF-IDF analysis, as I said, I actually think they are quite interesting. I will add them as reply when I'm done.
Until then, cheerio,
Marvin.
This is cool, I have seen these graphs individually around the forum but never put together like this, so it's good to have them all in one place. The binomial distribution of Voynichese is well shown by Graph 1. I think any "solution" needs to explain this phenomenon. On the A vs B split, I have usually used scribal hands to identify the language, there's the "How many scribes" paper by Lisa Fagin Davis that has a complete list of folios by language and scribe. Scribe 1 writes in Voynichese A, the others in B if I recall correctly. Good luck with the work Smile It looks very interesting.
Hi and welcome!

Just a few comments about the assumptions you listed.

(12-12-2025, 01:51 PM)mxv456 Wrote: You are not allowed to view links. Register or Login to view.3. The words are words in the sense that they can, through translation, combination, compression, augmentation by auxiliary information, or some other process, be rendered into a language that someone at some time spoke. If there is a cipher, it did not jumble words by moving word boundaries or similar shenanigans.

Ouch. Especially given that the word boundaries in the manuscript vary a lot, I don't think this is a reasonable assumption at all. Even if the cipher didn't jumble words, the scribe appeared to anyway.

(12-12-2025, 01:51 PM)mxv456 Wrote: You are not allowed to view links. Register or Login to view.6. There is no hope of me ever decrypting the text by myself since I have none of the necessary skills to actually understand any language that the authors spoke, even less the manuscript itself.

Even if you don't understand the language the authors spoke, it's very likely that ChatGPT will. In transcription/translation tasks modern AIs appear quite competent. Not super professional, but certainly good enough to help with a well known historical language.
(12-12-2025, 02:45 PM)oshfdk Wrote: You are not allowed to view links. Register or Login to view.Even if you don't understand the language the authors spoke,

But we don't know that the language the authors spoke is also reflected in the text of the MS....
(12-12-2025, 02:50 PM)ReneZ Wrote: You are not allowed to view links. Register or Login to view.
(12-12-2025, 02:45 PM)oshfdk Wrote: You are not allowed to view links. Register or Login to view.Even if you don't understand the language the authors spoke,

But we don't know that the language the authors spoke is also reflected in the text of the MS....

True. But what I mean ChatGPT will likely be able to help with the language part, if there is some well known historical language lurking in the MS.
(12-12-2025, 01:51 PM)mxv456 Wrote: You are not allowed to view links. Register or Login to view.My current split is defined by the code below. Input is very welcome.

You can get the Currier "language" from the $L header in the IT, ZL, RF, etc. transliterations:
Code:
f1r  A
f1v  A
f2r  A
f2v  A
f3r  A
f3v  A
f4r  A
f4v  A
f5r  A
f5v  A
f6r  A
f6v  A
f7r  A
f7v  A
f8r  A
f8v  A
f9r  A
f9v  A
f10r A
f10v A
f11r A
f11v A
f13r A
f13v A
f14r A
f14v A
f15r A
f15v A
f16r A
f16v A
f17r A
f17v A
f18r A
f18v A
f19r A
f19v A
f20r A
f20v A
f21r A
f21v A
f22r A
f22v A
f23r A
f23v A
f24r A
f24v A
f25r A
f25v A
f26r B
f26v B
f27r A
f27v A
f28r A
f28v A
f29r A
f29v A
f30r A
f30v A
f31r B
f31v B
f32r A
f32v A
f33r B
f33v B
f34r B
f34v B
f35r A
f35v A
f36r A
f36v A
f37r A
f37v A
f38r A
f38v A
f39r B
f39v B
f40r B
f40v B
f41r B
f41v B
f42r A
f42v A
f43r B
f43v B
f44r A
f44v A
f45r A
f45v A
f46r B
f46v B
f47r A
f47v A
f48r B
f48v B
f49r A
f49v A
f50r B
f50v B
f51r A
f51v A
f52r A
f52v A
f53r A
f53v A
f54r A
f54v A
f55r B
f55v B
f56r A
f56v A
f57r B
f57v
f58r A
f58v A
f65r
f65v
f66r B
f66v B
f67r1
f67r2
f67v2
f67v1
f68r1
f68r2
f68r3
f68v3
f68v2
f68v1
f69r
f69v
f70r1
f70r2
f70v2
f70v1
f71r
f71v
f72r1
f72r2
f72r3
f72v3
f72v2
f72v1
f73r
f73v
f75r B
f75v B
f76r B
f76v B
f77r B
f77v B
f78r B
f78v B
f79r B
f79v B
f80r B
f80v B
f81r B
f81v B
f82r B
f82v B
f83r B
f83v B
f84r B
f84v B
f85r1 B
f85r2 B
fRos B
f86v4 B
f86v6 B
f86v5 B
f86v3 B
f87r A
f87v A
f88r A
f88v A
f89r1 A
f89r2 A
f89v2 A
f89v1 A
f90r1 A
f90r2 A
f90v2 A
f90v1 A
f93r A
f93v A
f94r B
f94v B
f95r1 B
f95r2 B
f95v2 B
f95v1 B
f96r A
f96v A
f99r A
f99v A
f100r A
f100v A
f101r A
f101v A
f102r A
f102r A
f102v A
f102v A
f103r B
f103v B
f104r B
f104v B
f105r B
f105v B
f106r B
f106v B
f107r B
f107v B
f108r B
f108v B
f111r B
f111v B
f112r B
f112v B
f113r B
f113v B
f114r B
f114v B
f115r B
f115v B
f116r B
f116v
Welcome to the forum! 
Currier A vs. B is not one of my areas of expertise, but I just want to compliment you on the clear presentation.
[attachment=12921]

[quote
But we don't know that the language the authors spoke is also reflected in the text of the MS....
[/quote]

That was good.
We never know that in a patent specification either.
Excerpt from a patent description I'm currently working on.
(12-12-2025, 01:51 PM)mxv456 Wrote: You are not allowed to view links. Register or Login to view.1. The word length plots for Currier A and B with the distributions for 4 reference languages. (I just chose 4 languages that were easily accessible to me.)
[Image: word_length_stats.png?raw=true]

2. The known Zipf distribution of word frequencies with reference languages
[Image: Zipf_stats.png?raw=true]

Nice plots! 

I agree with your assumptions about word spaces etc.   But maybe you should open the range of languages a bit:

You are not allowed to view links. Register or Login to view.

My You are not allowed to view links. Register or Login to view.

You will find texts in a few other languages You are not allowed to view links. Register or Login to view..  Look for files "main.wds", and take the lines that begin with "a ".  The comments at the top of each file describe the source, language, spelling, encoding, etc.  If you need help, do ask.

All the best, --stolfi
Hi and welcome to the forum!

Everything that you write makes sense. Your assumptions may be not correct or not but that's exactly how assumption works - you assume them, do some test and their are either confirmed or denied. Or there is the third case where the results are inconclusive which in unfortunately quite often in the area of VM.

Quote:With this method, one can (approximately) distinguish "content words" from "function words". Content words are those that are specific to a particular topic, like "Voynich" or "Quantum", while function words are words that show up everywhere, like "the", "of", "and", etc.
Here you are making a hidden assumption that Voynichese has content and function words Smile In general case it doesn't have to be true if the text is gibberish or function words for example are glued with content words (agglutination).

I believe similar tests were done before. Such methods will always return "something" like two lists of words. But will these lists make sense or will they be a statistical artifact??? Hard to say. Unfortunately this approach didn't lead us any further so far but maybe your results will be different  Wink

Quote:Down the line, maybe I have time to produce a visual tool where people can explore how words cluster in certain portions of the text.
I would like to have something like this. Make it user friendly please - not some scripts run from black window with a dozen parameters but a true window application with buttons and menus  Wink
Pages: 1 2