Vuk88 > Yesterday, 03:46 PM
Mauro > Yesterday, 04:06 PM
Quote:For example, one of the highest peaks in the entire book is at folio f108, right before the extraction of the final folios. This strongly supports the historical hypothesis that those removed pages contained the conversion matrices or "Tabulae" required to decode the frequent logical transitions in those sections.
nablator > 11 hours ago
(Yesterday, 03:46 PM)Vuk88 Wrote: You are not allowed to view links. Register or Login to view.To test this, I applied Machine Learning algorithms (Logistic Regression and LSA) to analyze the words that immediately precede these doubles, looking for a mathematical correlation.
Vuk88 > 7 hours ago
(Yesterday, 04:06 PM)Mauro Wrote: You are not allowed to view links. Register or Login to view.If I understood, what you're saying is that you found a group of words which consistently precede a 'double repetition', and another group of words which instead are consistently not followed by a double repetition. Is this correct?
I don't understand how to interpret the t-SNE map. Could you explain more? What do the red and blue colours stand for, in the light of your hypothesis? Why some words are labelled, and what does that mean?
And, to have an idea of the statistical power, how many 'double repetitions' did you identify and analyzed in the text?
You may have found something interesting, but it's hard to say before understanding more!
One thing I surely would not agree with is this:
Quote:For example, one of the highest peaks in the entire book is at folio f108, right before the extraction of the final folios. This strongly supports the historical hypothesis that those removed pages contained the conversion matrices or "Tabulae" required to decode the frequent logical transitions in those sections.
Frankly, it looks a non sequitur to me.
Vuk88 > 6 hours ago
(11 hours ago)nablator Wrote: You are not allowed to view links. Register or Login to view.(Yesterday, 03:46 PM)Vuk88 Wrote: You are not allowed to view links. Register or Login to view.To test this, I applied Machine Learning algorithms (Logistic Regression and LSA) to analyze the words that immediately precede these doubles, looking for a mathematical correlation.
Either I don't understand what "immediately precede these doubles" means or something is seriously wrong with the "triggers" sy, chaiin, dchy, qokchol: they never precede a reduplication.
In paragraphs of RF1b the most frequent words preceding a reduplication are daiin (8, all in Currier A), qokedy (8, all in Currier B), chedy (6, all in Currier B)...
Mauro > 6 hours ago
(7 hours ago)Vuk88 Wrote: You are not allowed to view links. Register or Login to view.Mauro, even though I replied to you in Italian in private,
(7 hours ago)Vuk88 Wrote: You are not allowed to view links. Register or Login to view.I am realizing that posting this analysis in the forum in summary form could generate misunderstandings. Therefore, I am replying publicly in English, trying to make the text more conversational so that others can read it and get a more detailed idea.Yes I noticed that. But it takes some time to read an article, so I hoped for some quick info first
I want to point out that I published the full paper and the raw Python scripts on Zenodo in the hope that people will grab the code and validate the numbers themselves.

(7 hours ago)Vuk88 Wrote: You are not allowed to view links. Register or Login to view.Regarding your first point... yes, you grasped the core concept perfectly. The Logistic Regression model flagged specific words that basically act as precursors. They constantly show up right before a double repetition. Other words do the exact opposite and mathematically inhibit them.I know what a t-SNE map is, but I cannot make sense of you diagram. If the red-coded words are the group of words which precede a 'double repetition', why the double words (red dots) are mixed among them? And why 'chol' is not marked (see also nablator's You are not allowed to view links. Register or Login to view.). Why are there also marked words in the blue cloud?
As for the t-SNE map, think of it as an automatic organizer. If you fed this algorithm all the words of the Divine Comedy and told it to "group together words that behave similarly in a sentence", it would spontaneously create one cloud with all the verbs, another with the adjectives, and so on. What did the algorithm do with the Voynich? I gave it all the words. If the Voynich were a normal text, or a random hoax, the double words (the red dots) would be mixed in with everything else. Instead, the algorithm literally tore them away and isolated them in a cloud of their own.
(7 hours ago)Vuk88 Wrote: You are not allowed to view links. Register or Login to view.What does this prove? It shows that double words are not normal vocabulary repeated by mistake by a tired scribe. They belong to a completely different functional category from the rest of the text. They behave according to their own mathematical rules, almost as if they were mathematical symbols or code commands sitting in the middle of a text. And the blue dots (the triggers we just talked about) sit in the other cloud. So, it's not really me speculating here. It's pure spatial mathematics showing us that double words are structurally an "alien species" compared to the words that trigger them. They are separate entities with separate functions.
(7 hours ago)Vuk88 Wrote: You are not allowed to view links. Register or Login to view.Just to give you an idea of the sample size, I pulled exactly 298 consecutive double repetitions distributed across 135 pages. Obviously, the Takahashi transliteration (based on the EVA alphabet) has some illegible characters due to the condition of the manuscript, which is faded or stained, or has strange accents that seem to be scribe errors. So, when the algorithm read the manuscript, I set it up to exclusively filter out the "legible" sequences. So it's a pretty robust dataset to work on.
(7 hours ago)Vuk88 Wrote: You are not allowed to view links. Register or Login to view.My initial intuition was honestly pretty simple. I just wanted to look at the math behind how these double words interact. I used a metaphor in the paper that might help: if the Voynich is a building, the standard text blocks are the walls, and these double words are basically the doors. Calling them "operators" or "data blocks" is definitely me speculating, I'll admit that. But the mathematical correlations dictating where they show up are strong.
This makes more sense when you look at the Latent Semantic Analysis (LSA) study by Bowern's team. They were trying to answer one specific question: where does the text change topics? Their algorithm found these sharp transitions where the scribe jumps, for say, from herbology to astronomy. My analysis asked a different question: what physically separates these sections? When I cross-referenced their semantic breaks with my syntactic data, the topic changes lined up exactly where my double words are concentrated.
To make this visually immediate, I've attached a new clean graph (lsa_vs_syntax.png). In this chart, the background colored blocks represent the different thematic sections found by the LSA algorithm. The red line represents the density of my double words. As you can see, the red spikes align almost surgically with the exact boundaries where the text changes topic.

(7 hours ago)Vuk88 Wrote: You are not allowed to view links. Register or Login to view.You are right to be skeptical about the whole f108 codicology stuff. I'm the first to admit that assuming the missing pages held conversion matrices is a massive speculative leap. But the underlying data isn't speculation. It's a verified mathematical fact that the absolute highest density of these double words happens right next to the missing folios.
From the perspective of medieval and Renaissance codicology, placing a decoding matrix at the end of the manuscript is historically the most logical choice. There are two established reasons for this: First, in 15th-century manuscript production, analytical indices or reference tables (the tabulae) were almost always bound in the final quire, for the trivial practical necessity that an index can only be compiled after the text is finished. (The Codebreakers: The Story of Secret Writing – David Kahn – 1967). Second, in the late Middle Ages, complex cipher systems (like nomenclators) required a physical key. Scribes and diplomats often placed these keys at the end of the codex, or kept them as a separate booklet for security reasons (think of the Tabula Recta by Trithemius, from roughly 1500). So, if the Voynich contains a structured syntax requiring a conversion table, the final folios (like the missing f109-f110) are historically exactly the place we would expect to find it. My interpretation of why it happens might be wrong. But the correlation is there in the data.
Even if my hypotheses turn out to be garbage, the mathematical structure really points to a text that acts like a technical manual or a ledger, not just prose. Other people have reached similar conclusions by looking at ink chemistry or handwriting analysis over the years. The fact that I ended up at the same conclusion purely through raw statistics tells me that a cross-disciplinary approach is probably the best way forward here. Computational data can either back up or debunk the traditional historical analysis.
Thanks again for your time.
Vuk88 > 5 hours ago
(6 hours ago)Mauro Wrote: You are not allowed to view links. Register or Login to view.(7 hours ago)Vuk88 Wrote: You are not allowed to view links. Register or Login to view.Mauro, even though I replied to you in Italian in private,
I got no incoming messages from you. Sicuro di averlo mandato a me?
(7 hours ago)Vuk88 Wrote: You are not allowed to view links. Register or Login to view.I am realizing that posting this analysis in the forum in summary form could generate misunderstandings. Therefore, I am replying publicly in English, trying to make the text more conversational so that others can read it and get a more detailed idea.Yes I noticed that. But it takes some time to read an article, so I hoped for some quick info first
I want to point out that I published the full paper and the raw Python scripts on Zenodo in the hope that people will grab the code and validate the numbers themselves.
(7 hours ago)Vuk88 Wrote: You are not allowed to view links. Register or Login to view.Regarding your first point... yes, you grasped the core concept perfectly. The Logistic Regression model flagged specific words that basically act as precursors. They constantly show up right before a double repetition. Other words do the exact opposite and mathematically inhibit them.I know what a t-SNE map is, but I cannot make sense of you diagram. If the red-coded words are the group of words which precede a 'double repetition', why the double words (red dots) are mixed among them? And why 'chol' is not marked (see also nablator's You are not allowed to view links. Register or Login to view.). Why are there also marked words in the blue cloud?
As for the t-SNE map, think of it as an automatic organizer. If you fed this algorithm all the words of the Divine Comedy and told it to "group together words that behave similarly in a sentence", it would spontaneously create one cloud with all the verbs, another with the adjectives, and so on. What did the algorithm do with the Voynich? I gave it all the words. If the Voynich were a normal text, or a random hoax, the double words (the red dots) would be mixed in with everything else. Instead, the algorithm literally tore them away and isolated them in a cloud of their own.
(7 hours ago)Vuk88 Wrote: You are not allowed to view links. Register or Login to view.What does this prove? It shows that double words are not normal vocabulary repeated by mistake by a tired scribe. They belong to a completely different functional category from the rest of the text. They behave according to their own mathematical rules, almost as if they were mathematical symbols or code commands sitting in the middle of a text. And the blue dots (the triggers we just talked about) sit in the other cloud. So, it's not really me speculating here. It's pure spatial mathematics showing us that double words are structurally an "alien species" compared to the words that trigger them. They are separate entities with separate functions.
And this would be quite an interesting result. Now I don't believe (and not many do, I think) that the duplicated words are a mistake by a tired scribe, there are innumerable reasons why the text might have duplicated words, so this is not the main point. What would be interesting would be to actually demonstrate that duplicated words have a special meaning (at least statistically).
(7 hours ago)Vuk88 Wrote: You are not allowed to view links. Register or Login to view.Just to give you an idea of the sample size, I pulled exactly 298 consecutive double repetitions distributed across 135 pages. Obviously, the Takahashi transliteration (based on the EVA alphabet) has some illegible characters due to the condition of the manuscript, which is faded or stained, or has strange accents that seem to be scribe errors. So, when the algorithm read the manuscript, I set it up to exclusively filter out the "legible" sequences. So it's a pretty robust dataset to work on.
This, I fear, could be a problem. 298 repetitions are not many in the whole text, so I think there's the possibility the results are due to chance. For instance, according to your data 'chees' strongly suppresses word duplication. But there are only 36 'chees' in the whole text so I don't find it that surprising that none of the 36 'chees' is followed by an instance of the 298 duplicated words. I think this should be addressed.
By the way, what do you mean exactly by 'words preceding a duplicated word'? The word immediately preceding the duplicated pair?
(7 hours ago)Vuk88 Wrote: You are not allowed to view links. Register or Login to view.My initial intuition was honestly pretty simple. I just wanted to look at the math behind how these double words interact. I used a metaphor in the paper that might help: if the Voynich is a building, the standard text blocks are the walls, and these double words are basically the doors. Calling them "operators" or "data blocks" is definitely me speculating, I'll admit that. But the mathematical correlations dictating where they show up are strong.
This makes more sense when you look at the Latent Semantic Analysis (LSA) study by Bowern's team. They were trying to answer one specific question: where does the text change topics? Their algorithm found these sharp transitions where the scribe jumps, for say, from herbology to astronomy. My analysis asked a different question: what physically separates these sections? When I cross-referenced their semantic breaks with my syntactic data, the topic changes lined up exactly where my double words are concentrated.
To make this visually immediate, I've attached a new clean graph (lsa_vs_syntax.png). In this chart, the background colored blocks represent the different thematic sections found by the LSA algorithm. The red line represents the density of my double words. As you can see, the red spikes align almost surgically with the exact boundaries where the text changes topic.
This is interesting, but it'll have to wait tomorrow before I can reason on it, sorry
(7 hours ago)Vuk88 Wrote: You are not allowed to view links. Register or Login to view.You are right to be skeptical about the whole f108 codicology stuff. I'm the first to admit that assuming the missing pages held conversion matrices is a massive speculative leap. But the underlying data isn't speculation. It's a verified mathematical fact that the absolute highest density of these double words happens right next to the missing folios.
From the perspective of medieval and Renaissance codicology, placing a decoding matrix at the end of the manuscript is historically the most logical choice. There are two established reasons for this: First, in 15th-century manuscript production, analytical indices or reference tables (the tabulae) were almost always bound in the final quire, for the trivial practical necessity that an index can only be compiled after the text is finished. (The Codebreakers: The Story of Secret Writing – David Kahn – 1967). Second, in the late Middle Ages, complex cipher systems (like nomenclators) required a physical key. Scribes and diplomats often placed these keys at the end of the codex, or kept them as a separate booklet for security reasons (think of the Tabula Recta by Trithemius, from roughly 1500). So, if the Voynich contains a structured syntax requiring a conversion table, the final folios (like the missing f109-f110) are historically exactly the place we would expect to find it. My interpretation of why it happens might be wrong. But the correlation is there in the data.
Even if my hypotheses turn out to be garbage, the mathematical structure really points to a text that acts like a technical manual or a ledger, not just prose. Other people have reached similar conclusions by looking at ink chemistry or handwriting analysis over the years. The fact that I ended up at the same conclusion purely through raw statistics tells me that a cross-disciplinary approach is probably the best way forward here. Computational data can either back up or debunk the traditional historical analysis.
Thanks again for your time.
However we don't know how the VMS was originally bound (there is also a study by Lisa Fagin Davis suggesting the VMS was never meant to be bound), so it's hard to be sure how to interpret your observation on duplicated words vs. missing folios (which is, nonetheless, an interesting observation).