Options

A mathematical approach to double words, conditional logic and the missing pages

Index
A mathematical approach to double words, conditional logic and the missing pages
A mathematical approach to double words, conditional logic and the missing pages

Vuk88 > Yesterday, 03:46 PM
Hi everyone,
I'm Alfredo. My background is actually in social science data analysis, but like many of you, the Voynich Manuscript has been a long-time obsession of mine. I recently had an intuition regarding its structure that I really wanted to test out.
I started wondering if the manuscript is less like standard prose and more like a highly structured technical manual or data ledger. I know that's not a brand-new concept, but I wanted to see if I could find hard statistical proof of its internal grammar. Building on the Latent Semantic Analysis (LSA) research by Bowern, Layfield, and Davis (which the mods here kindly shared!), I decided to run some computational models on the EVA transliteration to hunt for deterministic rules.
I specifically focused on one of the most famous statistical anomalies: the "double repetitions" (like chol chol). While it's common to dismiss these as dittography or scribal copying errors, I hypothesized something different. What if, in a highly structured document with zero punctuation, these consecutive repetitions actually act as mechanical syntactic markers? Essentially, logic gates or section delimiters.
To test this, I applied Machine Learning algorithms (Logistic Regression and LSA) to analyze the words that heavily co-occur on the same pages as these doubles, looking for a mathematical correlation.
The dependency turned out to be massive. These double words aren't isolated or random; they are strictly bound to specific associated words. I've attached a clustering graph below (voynich_syntax_constellation_en.png) where you can see a distinct separation.

The blue and red dots represent two entirely separate grammatical families. The t-SNE algorithm segregated the vocabulary into two distinct functional 'clouds', showing a strong structural dependency. This rigid separation is the fingerprint of a highly structured system that rigorously distinguishes between different functional categories of words
To find out exactly which words do this, I looked at the Logistic Regression coefficients. In the attached bar chart logistic_bar_chart_en.png, you can see these specific "syntactic triggers" isolated. The red bars highlight the exact words that heavily co-occur on the exact same pages as the double sequences, while the blue bars show the ones that mathematically inhibit them (they almost never appear together).

To prove this wasn't just a coincidence, I ran a Monte Carlo Permutation Test with 1,000 iterations. Basically, I had the script completely scramble the text 1,000 times. This destroys the original word order, but keeps the raw word counts exactly the same. As you can see in the second attached graph (permutation_violin_en.png) , the randomly shuffled text completely failed to recreate the pattern (p-value = 0.0010). This proves the manuscript follows rigid conditional rules.

But here is the part with the biggest codicological implications. By mapping the spatial density of these syntactic markers across the whole manuscript (see the attached graph: voynich_all_missing_density.png) , a striking coincidence emerges: the absolute highest peaks of density are concentrated exactly on the pages right next to the codicological gaps—the 14 missing folios (f12, f59-f64, f74, f91-f98, f109-f110). see the attached graph: voynich_all_missing_density.png
- A quick note on the attached graph (lsa_vs_syntax.png): This chart visually summarizes the intersection between semantic and syntactic data. The X-axis represents the manuscript folios (timeline from f1 to f116). The colored background blocks show the different thematic sections identified by the LSA algorithm (Herbal, Astronomical, etc.), separated by vertical dashed lines indicating where the topic abruptly changes. The red line represents my data: the density of the double words. As you can clearly see, almost every time there is a semantic transition (a change in topic), there is a massive red spike in syntactic double words. They act as physical boundaries between sections.
For example, one of the highest peaks in the entire book is at folio f108, right before the extraction of the final folios. This strongly supports the historical hypothesis that those removed pages contained the conversion matrices or "Tabulae" required to decode the frequent logical transitions in those sections.

I recently published the full paper, the dataset, and the Python framework I used for the analysis. You can check everything out here: You are not allowed to view links. Register or Login to view.
I would love for the experts here to take a close look, test the Python framework, and offer some constructive criticism. I'm really curious to hear your thoughts on how we might use this syntactic mapping moving forward.
Thanks for your time!
P.S. For those who prefer a less technical read, I have also attached a simplified, non-academic summary of the theory below in PDF format.
RE: A mathematical approach to double words, conditional logic and the missing pages

Mauro > Yesterday, 04:06 PM

If I understood, what you're saying is that you found a group of words which consistently precede a 'double repetition', and another group of words which instead are consistently not followed by a double repetition. Is this correct?

I don't understand how to interpret the t-SNE map. Could you explain more? What do the red and blue colours stand for, in the light of your hypothesis? Why some words are labelled, and what does that mean?

And, to have an idea of the statistical power, how many 'double repetitions' did you identify and analyzed in the text?

You may have found something interesting, but it's hard to say before understanding more!

One thing I surely would not agree with is this:

Quote:For example, one of the highest peaks in the entire book is at folio f108, right before the extraction of the final folios. This strongly supports the historical hypothesis that those removed pages contained the conversion matrices or "Tabulae" required to decode the frequent logical transitions in those sections.

Frankly, it looks a non sequitur to me.
RE: A mathematical approach to double words, conditional logic and the missing pages

nablator > 11 hours ago

(Yesterday, 03:46 PM)Vuk88 Wrote: You are not allowed to view links. Register or Login to view.To test this, I applied Machine Learning algorithms (Logistic Regression and LSA) to analyze the words that immediately precede these doubles, looking for a mathematical correlation.

Either I don't understand what "immediately precede these doubles" means or something is seriously wrong with the "triggers" sy, chaiin, dchy, qokchol: they never precede a reduplication.

In paragraphs of RF1b the most frequent words preceding a reduplication are daiin (8, all in Currier A), qokedy (8, all in Currier B), chedy (6, all in Currier B)...
RE: A mathematical approach to double words, conditional logic and the missing pages

Vuk88 > 7 hours ago

(Yesterday, 04:06 PM)Mauro Wrote: You are not allowed to view links. Register or Login to view.If I understood, what you're saying is that you found a group of words which consistently precede a 'double repetition', and another group of words which instead are consistently not followed by a double repetition. Is this correct?

I don't understand how to interpret the t-SNE map. Could you explain more? What do the red and blue colours stand for, in the light of your hypothesis? Why some words are labelled, and what does that mean?

And, to have an idea of the statistical power, how many 'double repetitions' did you identify and analyzed in the text?

You may have found something interesting, but it's hard to say before understanding more!

One thing I surely would not agree with is this:

Quote:For example, one of the highest peaks in the entire book is at folio f108, right before the extraction of the final folios. This strongly supports the historical hypothesis that those removed pages contained the conversion matrices or "Tabulae" required to decode the frequent logical transitions in those sections.

Frankly, it looks a non sequitur to me.

Mauro, even though I replied to you in Italian in private, I am realizing that posting this analysis in the forum in summary form could generate misunderstandings. Therefore, I am replying publicly in English, trying to make the text more conversational so that others can read it and get a more detailed idea.
I want to point out that I published the full paper and the raw Python scripts on Zenodo in the hope that people will grab the code and validate the numbers themselves.
Regarding your first point... yes, you grasped the core concept perfectly. The Logistic Regression model flagged specific words that basically act as precursors. They heavily co-occur on the exact same pages where double repetitions are found. Other words do the exact opposite and mathematically inhibit them from appearing on the same page.
As for the t-SNE map, think of it as an automatic organizer. If you fed this algorithm all the words of the Divine Comedy and told it to "group together words that behave similarly in a sentence", it would spontaneously create one cloud with all the verbs, another with the adjectives, and so on. What did the algorithm do with the Voynich? I gave it all the words. If the Voynich were a normal text, or a random hoax, the double words (the red dots) would be mixed in with everything else. Instead, the algorithm literally tore them away and isolated them in a cloud of their own.
What does this prove? It shows that double words are not normal vocabulary repeated by mistake by a tired scribe. They belong to a completely different functional category from the rest of the text. They behave according to their own mathematical rules, almost as if they were mathematical symbols or code commands sitting in the middle of a text. And the blue dots (the triggers we just talked about) sit in the other cloud. So, it's not really me speculating here. It's pure spatial mathematics showing us that double words are structurally an "alien species" compared to the words that trigger them. They are separate entities with separate functions.
Just to give you an idea of the sample size, I pulled exactly 298 consecutive double repetitions distributed across 135 pages. Obviously, the Takahashi transliteration (based on the EVA alphabet) has some illegible characters due to the condition of the manuscript, which is faded or stained, or has strange accents that seem to be scribe errors. So, when the algorithm read the manuscript, I set it up to exclusively filter out the "legible" sequences. So it's a pretty robust dataset to work on.
My initial intuition was honestly pretty simple. I just wanted to look at the math behind how these double words interact. I used a metaphor in the paper that might help: if the Voynich is a building, the standard text blocks are the walls, and these double words are basically the doors. Calling them "operators" or "data blocks" is definitely me speculating, I'll admit that. But the mathematical correlations dictating where they show up are strong.
This makes more sense when you look at the Latent Semantic Analysis (LSA) study by Bowern's team. They were trying to answer one specific question: where does the text change topics? Their algorithm found these sharp transitions where the scribe jumps, for say, from herbology to astronomy. My analysis asked a different question: what physically separates these sections? When I cross-referenced their semantic breaks with my syntactic data, the topic changes lined up exactly where my double words are concentrated.
To make this visually immediate, I've attached a new clean graph (lsa_vs_syntax.png). In this chart, the background colored blocks represent the different thematic sections found by the LSA algorithm. The red line represents the density of my double words. As you can see, the red spikes align almost surgically with the exact boundaries where the text changes topic.
You are right to be skeptical about the whole f108 codicology stuff. I'm the first to admit that assuming the missing pages held conversion matrices is a massive speculative leap. But the underlying data isn't speculation. It's a verified mathematical fact that the absolute highest density of these double words happens right next to the missing folios.
From the perspective of medieval and Renaissance codicology, placing a decoding matrix at the end of the manuscript is historically the most logical choice. There are two established reasons for this: First, in 15th-century manuscript production, analytical indices or reference tables (the tabulae) were almost always bound in the final quire, for the trivial practical necessity that an index can only be compiled after the text is finished. (The Codebreakers: The Story of Secret Writing – David Kahn – 1967). Second, in the late Middle Ages, complex cipher systems (like nomenclators) required a physical key. Scribes and diplomats often placed these keys at the end of the codex, or kept them as a separate booklet for security reasons (think of the Tabula Recta by Trithemius, from roughly 1500). So, if the Voynich contains a structured syntax requiring a conversion table, the final folios (like the missing f109-f110) are historically exactly the place we would expect to find it. My interpretation of why it happens might be wrong. But the correlation is there in the data.
Even if my hypotheses turn out to be garbage, the mathematical structure really points to a text that acts like a technical manual or a ledger, not just prose. Other people have reached similar conclusions by looking at ink chemistry or handwriting analysis over the years. The fact that I ended up at the same conclusion purely through raw statistics tells me that a cross-disciplinary approach is probably the best way forward here. Computational data can either back up or debunk the traditional historical analysis.
Thanks again for your time.
RE: A mathematical approach to double words, conditional logic and the missing pages

Vuk88 > 6 hours ago

(11 hours ago)nablator Wrote: You are not allowed to view links. Register or Login to view.
(Yesterday, 03:46 PM)Vuk88 Wrote: You are not allowed to view links. Register or Login to view.To test this, I applied Machine Learning algorithms (Logistic Regression and LSA) to analyze the words that immediately precede these doubles, looking for a mathematical correlation.

Either I don't understand what "immediately precede these doubles" means or something is seriously wrong with the "triggers" sy, chaiin, dchy, qokchol: they never precede a reduplication.

In paragraphs of RF1b the most frequent words preceding a reduplication are daiin (8, all in Currier A), qokedy (8, all in Currier B), chedy (6, all in Currier B)...

You are absolutely right, and I thank you for pointing this out.
I used the word 'precede', which was a very poor choice of words on my part and caused this misunderstanding.
The Logistic Regression model I used operates at the entire page level (a Bag-of-Words approach), not at the physically adjacent n-gram level. So, what the algorithm actually found is that words like cthom, shees, saiin (the red 'triggers') represent the 'macro-dialect' of those specific sections. They heavily co-occur on the exact same pages where double words are found, while words like sy, chaiin, dchy (the blue 'inhibitors') almost never appear on those pages.
Your manual check was of great help. If my algorithm had only found a specific grammatical word physically attached to the double, it would have just been a local syntax rule. But having found an entire vocabulary that turns 'on' and 'off' depending on whether or not double words are present on the page, the model essentially demonstrates mathematically that the Voynich is written in completely isolated and modular blocks. The text is rigidly partitioned.
I was wrong to call them words that 'precede' the doubles; in reality, they are words that simply 'co-occur on the same page'. Your manual count of the immediately adjacent words (like daiin and chedy) is correct for n-grams. You looked at the local syntax (what is physically attached to the double), while my algorithm mapped the global modularity of the page. Apologies again for the confusion with my wording, and many thanks to your watchful eye.
RE: A mathematical approach to double words, conditional logic and the missing pages

Mauro > 6 hours ago

(7 hours ago)Vuk88 Wrote: You are not allowed to view links. Register or Login to view.Mauro, even though I replied to you in Italian in private,

I got no incoming messages from you. Sicuro di averlo mandato a me?

(7 hours ago)Vuk88 Wrote: You are not allowed to view links. Register or Login to view.I am realizing that posting this analysis in the forum in summary form could generate misunderstandings. Therefore, I am replying publicly in English, trying to make the text more conversational so that others can read it and get a more detailed idea.
I want to point out that I published the full paper and the raw Python scripts on Zenodo in the hope that people will grab the code and validate the numbers themselves.
Yes I noticed that. But it takes some time to read an article, so I hoped for some quick info first

(7 hours ago)Vuk88 Wrote: You are not allowed to view links. Register or Login to view.Regarding your first point... yes, you grasped the core concept perfectly. The Logistic Regression model flagged specific words that basically act as precursors. They constantly show up right before a double repetition. Other words do the exact opposite and mathematically inhibit them.
As for the t-SNE map, think of it as an automatic organizer. If you fed this algorithm all the words of the Divine Comedy and told it to "group together words that behave similarly in a sentence", it would spontaneously create one cloud with all the verbs, another with the adjectives, and so on. What did the algorithm do with the Voynich? I gave it all the words. If the Voynich were a normal text, or a random hoax, the double words (the red dots) would be mixed in with everything else. Instead, the algorithm literally tore them away and isolated them in a cloud of their own.
I know what a t-SNE map is, but I cannot make sense of you diagram. If the red-coded words are the group of words which precede a 'double repetition', why the double words (red dots) are mixed among them? And why 'chol' is not marked (see also nablator's You are not allowed to view links. Register or Login to view.). Why are there also marked words in the blue cloud?

(7 hours ago)Vuk88 Wrote: You are not allowed to view links. Register or Login to view.What does this prove? It shows that double words are not normal vocabulary repeated by mistake by a tired scribe. They belong to a completely different functional category from the rest of the text. They behave according to their own mathematical rules, almost as if they were mathematical symbols or code commands sitting in the middle of a text. And the blue dots (the triggers we just talked about) sit in the other cloud. So, it's not really me speculating here. It's pure spatial mathematics showing us that double words are structurally an "alien species" compared to the words that trigger them. They are separate entities with separate functions.

And this would be quite an interesting result. Now I don't believe (and not many do, I think) that the duplicated words are a mistake by a tired scribe, there are innumerable reasons why the text might have duplicated words, so this is not the main point. What would be interesting would be to actually demonstrate that duplicated words have a special meaning (at least statistically).

(7 hours ago)Vuk88 Wrote: You are not allowed to view links. Register or Login to view.Just to give you an idea of the sample size, I pulled exactly 298 consecutive double repetitions distributed across 135 pages. Obviously, the Takahashi transliteration (based on the EVA alphabet) has some illegible characters due to the condition of the manuscript, which is faded or stained, or has strange accents that seem to be scribe errors. So, when the algorithm read the manuscript, I set it up to exclusively filter out the "legible" sequences. So it's a pretty robust dataset to work on.

This, I fear, could be a problem. 298 repetitions are not many in the whole text, so I think there's the possibility the results are due to chance. For instance, according to your data 'chees' strongly suppresses word duplication. But there are only 36 'chees' in the whole text so I don't find it that surprising that none of the 36 'chees' is followed by an instance of the 298 duplicated words. I think this should be addressed.

By the way, what do you mean exactly by 'words preceding a duplicated word'? The word immediately preceding the duplicated pair?

(7 hours ago)Vuk88 Wrote: You are not allowed to view links. Register or Login to view.My initial intuition was honestly pretty simple. I just wanted to look at the math behind how these double words interact. I used a metaphor in the paper that might help: if the Voynich is a building, the standard text blocks are the walls, and these double words are basically the doors. Calling them "operators" or "data blocks" is definitely me speculating, I'll admit that. But the mathematical correlations dictating where they show up are strong.
This makes more sense when you look at the Latent Semantic Analysis (LSA) study by Bowern's team. They were trying to answer one specific question: where does the text change topics? Their algorithm found these sharp transitions where the scribe jumps, for say, from herbology to astronomy. My analysis asked a different question: what physically separates these sections? When I cross-referenced their semantic breaks with my syntactic data, the topic changes lined up exactly where my double words are concentrated.
To make this visually immediate, I've attached a new clean graph (lsa_vs_syntax.png). In this chart, the background colored blocks represent the different thematic sections found by the LSA algorithm. The red line represents the density of my double words. As you can see, the red spikes align almost surgically with the exact boundaries where the text changes topic.

This is interesting, but it'll have to wait tomorrow before I can reason on it, sorry

(7 hours ago)Vuk88 Wrote: You are not allowed to view links. Register or Login to view.You are right to be skeptical about the whole f108 codicology stuff. I'm the first to admit that assuming the missing pages held conversion matrices is a massive speculative leap. But the underlying data isn't speculation. It's a verified mathematical fact that the absolute highest density of these double words happens right next to the missing folios.
From the perspective of medieval and Renaissance codicology, placing a decoding matrix at the end of the manuscript is historically the most logical choice. There are two established reasons for this: First, in 15th-century manuscript production, analytical indices or reference tables (the tabulae) were almost always bound in the final quire, for the trivial practical necessity that an index can only be compiled after the text is finished. (The Codebreakers: The Story of Secret Writing – David Kahn – 1967). Second, in the late Middle Ages, complex cipher systems (like nomenclators) required a physical key. Scribes and diplomats often placed these keys at the end of the codex, or kept them as a separate booklet for security reasons (think of the Tabula Recta by Trithemius, from roughly 1500). So, if the Voynich contains a structured syntax requiring a conversion table, the final folios (like the missing f109-f110) are historically exactly the place we would expect to find it. My interpretation of why it happens might be wrong. But the correlation is there in the data.
Even if my hypotheses turn out to be garbage, the mathematical structure really points to a text that acts like a technical manual or a ledger, not just prose. Other people have reached similar conclusions by looking at ink chemistry or handwriting analysis over the years. The fact that I ended up at the same conclusion purely through raw statistics tells me that a cross-disciplinary approach is probably the best way forward here. Computational data can either back up or debunk the traditional historical analysis.
Thanks again for your time.

However we don't know how the VMS was originally bound (there is also a study by Lisa Fagin Davis suggesting the VMS was never meant to be bound), so it's hard to be sure how to interpret your observation on duplicated words vs. missing folios (which is, nonetheless, an interesting observation).
RE: A mathematical approach to double words, conditional logic and the missing pages

Vuk88 > 5 hours ago

(6 hours ago)Mauro Wrote: You are not allowed to view links. Register or Login to view.
(7 hours ago)Vuk88 Wrote: You are not allowed to view links. Register or Login to view.Mauro, even though I replied to you in Italian in private,

I got no incoming messages from you. Sicuro di averlo mandato a me?

(7 hours ago)Vuk88 Wrote: You are not allowed to view links. Register or Login to view.I am realizing that posting this analysis in the forum in summary form could generate misunderstandings. Therefore, I am replying publicly in English, trying to make the text more conversational so that others can read it and get a more detailed idea.
I want to point out that I published the full paper and the raw Python scripts on Zenodo in the hope that people will grab the code and validate the numbers themselves.
Yes I noticed that. But it takes some time to read an article, so I hoped for some quick info first

(7 hours ago)Vuk88 Wrote: You are not allowed to view links. Register or Login to view.Regarding your first point... yes, you grasped the core concept perfectly. The Logistic Regression model flagged specific words that basically act as precursors. They constantly show up right before a double repetition. Other words do the exact opposite and mathematically inhibit them.
As for the t-SNE map, think of it as an automatic organizer. If you fed this algorithm all the words of the Divine Comedy and told it to "group together words that behave similarly in a sentence", it would spontaneously create one cloud with all the verbs, another with the adjectives, and so on. What did the algorithm do with the Voynich? I gave it all the words. If the Voynich were a normal text, or a random hoax, the double words (the red dots) would be mixed in with everything else. Instead, the algorithm literally tore them away and isolated them in a cloud of their own.
I know what a t-SNE map is, but I cannot make sense of you diagram. If the red-coded words are the group of words which precede a 'double repetition', why the double words (red dots) are mixed among them? And why 'chol' is not marked (see also nablator's You are not allowed to view links. Register or Login to view.). Why are there also marked words in the blue cloud?

(7 hours ago)Vuk88 Wrote: You are not allowed to view links. Register or Login to view.What does this prove? It shows that double words are not normal vocabulary repeated by mistake by a tired scribe. They belong to a completely different functional category from the rest of the text. They behave according to their own mathematical rules, almost as if they were mathematical symbols or code commands sitting in the middle of a text. And the blue dots (the triggers we just talked about) sit in the other cloud. So, it's not really me speculating here. It's pure spatial mathematics showing us that double words are structurally an "alien species" compared to the words that trigger them. They are separate entities with separate functions.

And this would be quite an interesting result. Now I don't believe (and not many do, I think) that the duplicated words are a mistake by a tired scribe, there are innumerable reasons why the text might have duplicated words, so this is not the main point. What would be interesting would be to actually demonstrate that duplicated words have a special meaning (at least statistically).

(7 hours ago)Vuk88 Wrote: You are not allowed to view links. Register or Login to view.Just to give you an idea of the sample size, I pulled exactly 298 consecutive double repetitions distributed across 135 pages. Obviously, the Takahashi transliteration (based on the EVA alphabet) has some illegible characters due to the condition of the manuscript, which is faded or stained, or has strange accents that seem to be scribe errors. So, when the algorithm read the manuscript, I set it up to exclusively filter out the "legible" sequences. So it's a pretty robust dataset to work on.

This, I fear, could be a problem. 298 repetitions are not many in the whole text, so I think there's the possibility the results are due to chance. For instance, according to your data 'chees' strongly suppresses word duplication. But there are only 36 'chees' in the whole text so I don't find it that surprising that none of the 36 'chees' is followed by an instance of the 298 duplicated words. I think this should be addressed.

By the way, what do you mean exactly by 'words preceding a duplicated word'? The word immediately preceding the duplicated pair?

(7 hours ago)Vuk88 Wrote: You are not allowed to view links. Register or Login to view.My initial intuition was honestly pretty simple. I just wanted to look at the math behind how these double words interact. I used a metaphor in the paper that might help: if the Voynich is a building, the standard text blocks are the walls, and these double words are basically the doors. Calling them "operators" or "data blocks" is definitely me speculating, I'll admit that. But the mathematical correlations dictating where they show up are strong.
This makes more sense when you look at the Latent Semantic Analysis (LSA) study by Bowern's team. They were trying to answer one specific question: where does the text change topics? Their algorithm found these sharp transitions where the scribe jumps, for say, from herbology to astronomy. My analysis asked a different question: what physically separates these sections? When I cross-referenced their semantic breaks with my syntactic data, the topic changes lined up exactly where my double words are concentrated.
To make this visually immediate, I've attached a new clean graph (lsa_vs_syntax.png). In this chart, the background colored blocks represent the different thematic sections found by the LSA algorithm. The red line represents the density of my double words. As you can see, the red spikes align almost surgically with the exact boundaries where the text changes topic.

This is interesting, but it'll have to wait tomorrow before I can reason on it, sorry

(7 hours ago)Vuk88 Wrote: You are not allowed to view links. Register or Login to view.You are right to be skeptical about the whole f108 codicology stuff. I'm the first to admit that assuming the missing pages held conversion matrices is a massive speculative leap. But the underlying data isn't speculation. It's a verified mathematical fact that the absolute highest density of these double words happens right next to the missing folios.
From the perspective of medieval and Renaissance codicology, placing a decoding matrix at the end of the manuscript is historically the most logical choice. There are two established reasons for this: First, in 15th-century manuscript production, analytical indices or reference tables (the tabulae) were almost always bound in the final quire, for the trivial practical necessity that an index can only be compiled after the text is finished. (The Codebreakers: The Story of Secret Writing – David Kahn – 1967). Second, in the late Middle Ages, complex cipher systems (like nomenclators) required a physical key. Scribes and diplomats often placed these keys at the end of the codex, or kept them as a separate booklet for security reasons (think of the Tabula Recta by Trithemius, from roughly 1500). So, if the Voynich contains a structured syntax requiring a conversion table, the final folios (like the missing f109-f110) are historically exactly the place we would expect to find it. My interpretation of why it happens might be wrong. But the correlation is there in the data.
Even if my hypotheses turn out to be garbage, the mathematical structure really points to a text that acts like a technical manual or a ledger, not just prose. Other people have reached similar conclusions by looking at ink chemistry or handwriting analysis over the years. The fact that I ended up at the same conclusion purely through raw statistics tells me that a cross-disciplinary approach is probably the best way forward here. Computational data can either back up or debunk the traditional historical analysis.
Thanks again for your time.

However we don't know how the VMS was originally bound (there is also a study by Lisa Fagin Davis suggesting the VMS was never meant to be bound), so it's hard to be sure how to interpret your observation on duplicated words vs. missing folios (which is, nonetheless, an interesting observation).

You are totally right to be confused by the first graph, the error is entirely mine! Being my very first day on this forum, I struggled quite a bit with the formatting interface, and in my rush, I completely messed up the text. I accidentally described the t-SNE map using the logic and colors of the Logistic Regression.
Just to clear up the mess, the t-SNE map has nothing to do with triggers. It's just a global Word2Vec embedding showing that words like shedy and daiin belong to two separate grammatical families (the red and blue clouds). Regarding chol, it is indeed in the dataset, but the script probably just skipped its label to avoid overlapping. The actual 'trigger vs inhibitor' analysis is the Logistic Regression, which is the second graph. I'm going to edit my original post as soon as possible to fix this confusing mix-up.
Moving on to your methodological points, you caught my exact poor choice of words with 'preceding'. The Logistic Regression doesn't use an n-gram model, it uses a Bag-of-Words approach at the page level. So it's not looking at physical adjacency, but rather page-level co-occurrence. It found that words like shedy and chees heavily turn on or off on the exact same pages where the doubles appear.
On the sample size issue with the 36 chees vs 298 doubles, I had the exact same concern about statistical chance. That's precisely why I ran a Monte Carlo Permutation Test with 1000 iterations. The empirical p-value came out to p=0.0010, and the mathematical distance between the permuted random distribution and the true manuscript data is what confirms this isn't just a random fluke.
Finally, regarding the codicology, I completely agree with you and Lisa Fagin Davis's work. The 'missing tabulae' theory is absolutely a speculative historical leap on my part to try and explain the data. The only hard fact is the massive density anomaly exactly at f108, and how we historically interpret that binding is definitely up for debate.
Next Oldest Next Newest

A mathematical approach to double words, conditional logic and the missing pages

Index

A mathematical approach to double words, conditional logic and the missing pages

RE: A mathematical approach to double words, conditional logic and the missing pages

RE: A mathematical approach to double words, conditional logic and the missing pages

RE: A mathematical approach to double words, conditional logic and the missing pages

RE: A mathematical approach to double words, conditional logic and the missing pages

RE: A mathematical approach to double words, conditional logic and the missing pages

RE: A mathematical approach to double words, conditional logic and the missing pages