The Voynich Ninja

I spent some time with the manuscript over the festive break. I have an idea as to what I think is happening here regarding the text but I'll set that aside for the end. This write-up is a byproduct of the overall investigation, on its own it might be beneficial to people following their own research paths.

The experiment:

Can we establish correlation between folios and sections based on the number of words shared.

The method:

I wrangled a python script that goes through each folio, takes each individual word (above a length of 3 characters) and counts how many times it appears in any other folio. To balance the fact that different folios will have a different number of words I divide the final score by the number of words in the comparison page. The idea is that we are balancing the count against the event opportunity (although I admit this is likely not a *clean* scoring method).

The output & processing:

The output is a big CSV file with the scores for each folio/folio row/column. I imported this into a spreadsheet so that I could do further analysis. I formatted the cells so that they are colored on a gradient based on how much higher they are than the average.

Some advanced warning, this method of displaying the results creates an implicit symmetry between an entry and its counter, basically everything is mirrored across the diagonal and this will create create patterns that don't exist. The output spreadsheet looks like the following:

[Image: AP1GczNAoeoFK9D-S74M8UNov2z0ridcCM9zV1EO...authuser=0]

[Image: AP1GczNAoeoFK9D-S74M8UNov2z0ridcCM9zV1EO...authuser=0]

Starting from the top left corner F1r.
The darker the red color the higher above the average score.
The bordered lines are quires.
The gaps in the column and row headers are the missing folios.

Observations:

We can see very clear correlation between two sections in the bottom right, these are exactly the bathing and recipe sections.
Both the bathing section and the recipe section have very distinct correlation to plants in the second half of the herbal section.
These same plant folios have a strong correlation with the preparation section as well as the bathing and recipe section.
The second clear block is the plant section itself, we see clear correlation between all entries, and perhaps an indication that the last 3 quires of this section are somehow more tightly correlated than the previous 4.

The quires wrap very nicely to these... "islands of correlation", we can see the borders of the quires outlining the sections.

A nice example of this is the "vine" plant that appears in f17v, f96v, f99. It's correlated through theme and word correlation.

How does this help:

It helps us isolate sections to work on, there is a lot of text in the manuscript and reducing the attack surface can help.
We can also use this to validate folio and quire order, for example I've always been confused by location of quire 19 and suspected it belonged somewhere else, but looking at its position on the sheet it shares the same correlations as its neighbors, the bathing section and the recipe section, meaning it likely belongs where it is.

The speculation part:

The current status of the "is it a cipher or not" debate boils down to either "no its the product of a generative method that produces pseudo-language" or "yes but we don't know how or what" or "something else". If this research shows anything it shows that the manuscript text shares a logical correlation to its pictographic themes and even the physical structure (the quires). If this is pseud-language then it follows themes and relations. Also, I don't think these concepts are mutually exclusive, I think you absolutely can use a generative method as an encoding or enciphering mechanism. If I was to state it in a simple way it would be, "you hide a needle in a bunch of haystacks, but you're going to need to manufacture those haystacks". I think the generative mechanism takes words as a seed and produces pseudo-language as an obscured output. I also think there is a sort of pun going on here, it's not just the text being generated, the characters and plants are too, hence why we see so many unrecognisable plants.

The plants are gathered, cut up thrown into a device and turned into something new, like the words.

If I'd say anything, my theory is that a generative mechanism turns a word into many words, perhaps letter by letter. It adds pre/suff-ixes, modifies letters, adds false root etc so that we end up with what we have, a text poor in individual characters and overburdened with words.

I think the generative mechanism is shown in You are not allowed to view links. Register or Login to view. and is based on a solar quadrant (You are not allowed to view links. Register or Login to view.).
I think the female characters in the center are showing how a word is modified, what that process is.
I think the second outermost circle (containing the repeating pattern of single characters) manages the substitutions.
I think the contents of the generative mechanism is defined by the astal/zodiac section allowing it to be adjusted based on themes.
I think some settings for the mechanism are indicated by the specific female character and their association to a star, or the star present.
I think we have some working notes in the margins of You are not allowed to view links. Register or Login to view. and You are not allowed to view links. Register or Login to view. that are further clues to the process.
I think sections sharing an above average correlation are encrypted using the same scheme/settings.

What next:
I'd like to focus on identifying the core word roots and re-scoring everything based on that, I think that would help identify how the prefixes and suffixes are defined based on the mechanism state/settings.

Notes:

I added the missing folios and quires based on You are not allowed to view links. Register or Login to view..
The solar quadrant was taken from a reference in the interesting academic paper "The Voynich Manuscript as a Manual for the Habsburgs" by You are not allowed to view links. Register or Login to view.
You can get a PDF of the spreadsheet here You are not allowed to view links. Register or Login to view.
You can get a spreadsheet of the spreadsheet here You are not allowed to view links. Register or Login to view.
As is the way with such things, it turns out ReneZ had done this research prior. Head You are not allowed to view links. Register or Login to view.for more correlation goodness.

Your result looks very much like mine!

Which transliteration (text) of the MS did you use?

(28-01-2025, 11:11 AM)ReneZ Wrote: You are not allowed to view links. Register or Login to view.Your result looks very much like mine!

Which transliteration (text) of the MS did you use?

Thank you, getting confirmation from someone else eases my fears that I screwed something up. Can you link me your research? I'm super interested in others approach/interpretation.

I used the dataset from You are not allowed to view links. Register or Login to view. which credits:
"The EVA transcriptions used in the Voynichese project were obtained from the You are not allowed to view links. Register or Login to view. with contributions from different authors, including Takeshi Takahashi and Jorge Stolfi.

The classification of folios into You are not allowed to view links. Register or Login to view. is derived from the work of Capt. Prescott Currier, also extracted using the You are not allowed to view links. Register or Login to view.."

(28-01-2025, 10:26 AM)008348dc760f858fd668476b75fb6f Wrote: You are not allowed to view links. Register or Login to view.I wrangled a python script that goes through each folio, takes each individual word and counts how many times it appears in any other folio. To balance the fact that different folios will have a different number of words I divide the final score by the number of words in the comparison page. The idea is that we are balancing the count against the event opportunity (although I admit this is likely not a *clean* scoring method).

I'm not sure if the result is not largely just the geometric mean of the number of word tokens on both pages.

The manuscript is dominated by words like "daiin", "chol", etc., and, as far as I understand, the relative frequencies of these words are mostly determined by the "hand" (A/B, etc). So, unless there is some filter that removes the effect from top words, I would expect the chart to mostly show the (weighted) number of daiin's on page 1 times the number of daiin's on page 2.

Yep, that's my main concern, the scoring function and if I'm inadvertently introducing a pattern where one doesn't exist. I've tried to mitigate this as explained in the post, using the set of words in the first folio and dividing the score by the total words in the 2nd folio.

I think I'll run this again with a modified scoring function that doesn't count how many times a word appears in the 2nd folio just whether it appears or not. That way I don't need to remove the over-represented words (my opinion is that they represent *something* even if it's just a null word).

I'm not super convinced that the differing output based on the "hand" writing would contradict the idea, I would expect that the scribes are using different settings on the encoding mechanism, but, I think it would be good for me to add some indication of the likely scribe to the sheet to see what, if anything, it brings up.

==update==
It looks like I was using the updated scoring idea already:

Code:
      folioScore = sum([list(set(compFolio)).count(w) for w in set(baseFolioWords)])

      ret.append(str( folioScore / (len(set(baseFolioWords)) + len(set(compFolio))) ))

I just read the article linked in that voynichese credit and it contains... exactly the same research as I've written up.
That's hilarious to me, oh dear.

==update==
aaaaaaaand I've just realised you wrote that article, now I'm really laughing my ass off.

(28-01-2025, 12:10 PM)008348dc760f858fd668476b75fb6f Wrote: You are not allowed to view links. Register or Login to view.I'm not super convinced that the differing output based on the "hand" writing would contradict the idea...

I'm not sure I understand what exactly "the idea" is. Sorry, I can be quite slow when it comes to understanding various concepts Smile

If you don't mind, could you explain in more details the following:

Quote:Can we establish correlation between folios and sections based on the number of words shared.

Is it mainly about the correlation between folios and folios, between sections and sections or correlation between sections and folios (as in "this page correlates with balneology, etc")?

Quote:It helps us isolate sections to work on, there is a lot of text in the manuscript and reducing the attack surface can help.

I guess the sections (if you are referring to herbal, balneology, etc) are more or less self evident in the MS?

Quote:We can also use this to validate folio and quire order, for example I've always been confused by location of quire 19 and suspected it belonged somewhere else, but looking at its position on the sheet it shares the same correlations as its neighbors, the bathing section and the recipe section, meaning it likely belongs where it is.

To me this is not very obvious from the chart. Unless I'm reading it wrong, it correlates a bit with balneology and with a number of herbal pages, but doesn't correlate particularly well with the end section.

Then I'm lost completely, because I don't understand how the following is related to the chart of correlations Smile

Quote:The current status of the "is it a cipher or not" debate boils down to either "no its the product of a generative method that produces pseudo-language" or "yes but we don't know how or what" or "something else". If this research shows anything it shows that the manuscript text shares a logical correlation to its pictographic themes and even the physical structure (the quires). If this is pseud-language then it follows themes and relations. Also, I don't think these concepts are mutually exclusive, I think you absolutely can use a generative method as an encoding or enciphering mechanism. If I was to state it in a simple way it would be, "you hide a needle in a bunch of haystacks, but you're going to need to manufacture those haystacks". I think the generative mechanism takes words as a seed and produces pseudo-language as an obscured output. I also think there is a sort of pun going on here, it's not just the text being generated, the characters and plants are too, hence why we see so many unrecognisable plants.

The plants are gathered, cut up thrown into a device and turned into something new, like the words.

I think this is a cool idea (that plants and figures are generated according to some mechanistic process). It is reminiscent of various theories that "plants are not actually plants, but some schematic representation", but also explains why the plants look weird and many nymphs are in strange poses. Not sure if there is any relation to the study of correlations here.

Quote:If I'd say anything, my theory is that a generative mechanism turns a word into many words, perhaps letter by letter. It adds pre/suff-ixes, modifies letters, adds false root etc so that we end up with what we have, a text poor in individual characters and overburdened with words.

As far as I know, this is a very popular idea and over the years a lot of systems and devices explaining how the words were produced have been proposed. Again, does the study of correlations play any particular role here?

Quote:I'm not sure I understand what exactly "the idea" is. Sorry, I can be quite slow when it comes to understanding various concepts

No worries and no reason to apologize, the idea is that a reusable encoding/enciphering mechanism is used to generate text as a way to obscure the underlying message. This mechanism can be setup to produce different output based on different settings. I say that different hands using different words is not an issue in this scheme.

Quote:Is it mainly about the correlation between folios and folios, between sections and sections or correlation between sections and folios (as in "this page correlates with balneology, etc")?

It kind of ends up being the same thing, the starting point was correlation between folios but as it turns out this confers correlation between sections and also quires.

Quote:To me this is not very obvious from the chart. Unless I'm reading it wrong, it correlates a bit with balneology and with a number of herbal pages, but doesn't correlate particularly well with the end section.

I think this is an issue of resolution, i.e the correlation between the bathing and recipe sections looks so strong that it blots out everything else. Essentially anything that isn't white is an above average correlation and is notable.

Quote:I think this is a cool idea (that plants and figures are generated according to some mechanistic process). It is reminiscent of various theories that "plants are not actually plants, but some schematic representation", but also explains why the plants look weird and many nymphs are in strange poses. Not sure if there is any relation to the study of correlations here.

You're right, it's not linked to the main body of the write-up, just something I threw in the speculation section.

Quote:As far as I know, this is a very popular idea and over the years a lot of systems and devices explaining how the words were produced have been proposed. Again, does the study of correlations play any particular role here?

Yes, the correlation suggests that the text adheres to the themes of the folio. If so, the mechanism that generates the text is not a random stream of generated text but text generated to a theme. That theme could be text related to that specific folio or section. The "vine" is a good example, it appears on 3 different folios and these 3 different folios have above average word correlation.

The question around generative text is usually positioned as "the manuscript is either pseudo-language or a cipher...", my point is that it can easily be both.

Thank you for the thoughtful and inquisitive reply!

(28-01-2025, 01:08 PM)008348dc760f858fd668476b75fb6f Wrote: You are not allowed to view links. Register or Login to view.Yes, the correlation suggests that the text adheres to the themes of the folio. If so, the mechanism that generates the text is not a random stream of generated text but text generated to a theme. That theme could be text related to that specific folio or section. The "vine" is a good example, it appears on 3 different folios and these 3 different folios have above average word correlation.

Quote:TA nice example of this is the "vine" plant that appears in f17v, f96v, f99. It's correlated through theme and word correlation.

This could be interesting. I suppose, your next move would be trying to identify which words specifically cause this correlation? I vaguely remember there was a very similar attempt of linking various sections via statistical analysis a few years ago. I remember impressive graphs connecting parts of vocabulary, color coded for sections.

(28-01-2025, 01:08 PM)008348dc760f858fd668476b75fb6f Wrote: You are not allowed to view links. Register or Login to view.The question around generative text is usually positioned as "the manuscript is either pseudo-language or a cipher...", my point is that it can easily be both.

I'm not sure there is a "usual" wording for any question about the manuscript, I think you can find a few distinct mutually incompatible opinions on almost any aspect of the MS Smile

But your conclusion sounds reasonable to me. It seems to be a common idea that the cipher, if this is a cipher, was specifically designed to mimic an unknown language. As far as I understand, your specific suggestion is that this effect was achieved by using a certain language-like sequence generation method, underlying the actual cipher? I don't know if this is a novel idea, I'm not very much concerned with specific sequence generation scenarios. I know that a lot of people have been doing in-depth research in this area.

008348dc760f858fd668476b75fb6f

ReneZ

008348dc760f858fd668476b75fb6f

oshfdk

008348dc760f858fd668476b75fb6f

008348dc760f858fd668476b75fb6f

oshfdk

008348dc760f858fd668476b75fb6f

oshfdk