08-01-2024, 09:41 AM
At the Voynich 2022 conference, Massimilano Zattera presented an intriguing paper with the title “A New Transliteration Alphabet Brings New Evidence of Word Structure and Multiple "Languages" in the Voynich Manuscript”.
Zattera’s main objective, as I understood it, was to demonstrate that most of the “words” in the Voynich manuscript conformed to what he called the “slot alphabet”. This alphabet was based on the following concepts:
In this post, my interest is in what Zattera called “separable words”. By this he meant “words” which could be divided into two parts, each of which was a Voynich “word”. In his paper, he stated that “separable words” accounted for 10.4% of the text, and for 37.1% of the vocabulary.
For example, the v101 “word” {2coehcc89} (which occurs only once) can be separated into two parts: {2coe} which occurs 50 times as a “word”, and {hcc89} with 18 occurrences as a “word”. This is not the only way to separate the “word”: {2co} occurs 26 times and {ehcc89} occurs 14 times. In such cases we might conjecture that there is a word break, but it is not a space; it is invisible, or implied (as in "battlefield" or any other compound word in English).
Zattera did not provide a list of the separable words that he had identified. But we can have a shot at doing so.
I approached this task with several basic assumptions, as follows:
To this end, I selected the “herbal” section as defined by René Zandbergen. This section is visually identifiable by the full-page drawings of plants. I take no position on whether the text is related to the drawings; however, the text appears to have a uniform vocabulary which is different from that in other parts of the manuscript.
In the v211 transliteration, the "herbal" section has 12,679 “words” with an average “word” length of 3.69 glyphs. The vocabulary consists of 3,213 distinct “words”. The ten most common “words”, plus the isolated {s} which may or may not be a “word”, are as follows:
[attachment=8077]
Here, it seemed noteworthy that the text did not have a brilliant fit with Zipf’s Law. The correlation with the expected Zipf sequence (for a natural language) was just 81.9 percent.
The first step in the process was to take the v211 transliteration in Microsoft Word, and using the “find-and-replace” function, replace each of the top ten “words” with the same string preceded and followed by a space. It seemed advisable to leave v101-s {s} alone, since it was already an isolated (single-glyph) “word”. This step converted all occurrences of the selected strings to “words”. For example, any occurrence of {8am}, even when interior to a glyph string, became the “word” {8am}. This step produced a new transliteration, with a smaller vocabulary.
The second step was to make alternative conversions of frequent glyph strings to “words”, as follows:
The results of the first and second steps are shown below.
[attachment=8079]
Of these tests, the one which gave the best fit with Zipf’s Law was the first, in which the top ten “words”, whether occurring as strings or “words”, but excluding the isolated {s} were converted to “words”. Here the correlation with the Zipf sequence was 98.8 percent.
Thus, it appears that by converting the top ten glyph strings to “words”, but ignoring single-glyph “words”, we can identify a vocabulary which is highly consistent with Zipf’s Law. This conversion made the “herbal” section resemble much more closely a text in a natural language.
The first three lines
We can illustrate the process of word separability by applying it to an extract from the text of the “herbal” section: for example, the first three lines on the first page of the section (f001v). In the v101 transliteration, these lines are as follows:
[attachment=8075]
In the v211 transliteration with word separability, these three lines become as follows:
[attachment=8078]
Vowel recognition
The next step was to identify vowels, using the Sukhotin algorithm as implemented by Dr Mans Hulden’s Python code. In the v101 transliteration, the algorithm identifies the following v101 glyphs as the five most probable vowels, in descending order of probability:
{o},{a},{9},{c},{A}.
The Sukhotin algorithm identifies vowels and consonants by the frequency with which they alternate. In the v211 transliteration with position-dependent glyphs and "word" separability, the "word" breaks have changed; so, not all the glyphs have the same neighbors as before. Therefore, the vowel-glyphs will not necessarily be the same. Applied to the “herbal” section, the five most probable vowels are now identified as follows:
interior o [ô], interior a [â], interior 1 [1], final 9 [⁹], initial o [ó].
In a natural language, especially a medieval European language, we generally expect only five vowels. If so, we may deem that these glyphs in all other positions, and all other glyphs, represent consonants in the presumed precursor languages.
Mapping to precursor languages
The logical next step is to select a chunk of the “herbal” section, or alternatively the top ten “words” of the “herbal” section, and map the glyphs to letters in potential precursor languages, as follows:
This step requires a detailed recalculation of glyph frequencies in the word-separated Voynich text: not because the gross frequencies have changed, but because the position-dependent frequencies have changed. For example, within the separated words, some interior glyphs may have become initial or final glyphs. More on this in another post.
Zattera’s main objective, as I understood it, was to demonstrate that most of the “words” in the Voynich manuscript conformed to what he called the “slot alphabet”. This alphabet was based on the following concepts:
- twelve “slots”, which Zattera numbered 0 to 11;
- twenty-six distinct glyphs, based on the EVA transliteration (Zattera ignored the profuse variations and “rare glyphs” that occur in the v101 transliteration);
- the allocation of each glyph to one or more slots
- a rule that in most cases, a given glyph could occupy only one slot; in a few cases it could be in either of two slots; and in one case (EVA-d or v101-8) it could be in any of three slots.
In this post, my interest is in what Zattera called “separable words”. By this he meant “words” which could be divided into two parts, each of which was a Voynich “word”. In his paper, he stated that “separable words” accounted for 10.4% of the text, and for 37.1% of the vocabulary.
For example, the v101 “word” {2coehcc89} (which occurs only once) can be separated into two parts: {2coe} which occurs 50 times as a “word”, and {hcc89} with 18 occurrences as a “word”. This is not the only way to separate the “word”: {2co} occurs 26 times and {ehcc89} occurs 14 times. In such cases we might conjecture that there is a word break, but it is not a space; it is invisible, or implied (as in "battlefield" or any other compound word in English).
Zattera did not provide a list of the separable words that he had identified. But we can have a shot at doing so.
I approached this task with several basic assumptions, as follows:
- that, rather than use the v101 transliteration with its many variations, I should use my own v211 transliteration, which aggregates several v101 glyphs that seem to be similar, such as {1} and {2}, and also disaggregates some v101 glyphs that seem to be glyph strings, such as {m} and {n};
- that I should regard certain glyphs as having different meanings (or mappings) in the initial, interior, final and isolated positions; in other words, for example, an initial v101 {9} was not the same as a final v101 {9};
- that the most probable building blocks of Zattera’s “separable words” were the most frequent “words”, that is, glyph strings delimited by spaces or by line breaks;
- that, rather than use the whole Voynich manuscript with its (probable) multiple languages, I should work with a single thematic section that seemed to have an homogenous language.
To this end, I selected the “herbal” section as defined by René Zandbergen. This section is visually identifiable by the full-page drawings of plants. I take no position on whether the text is related to the drawings; however, the text appears to have a uniform vocabulary which is different from that in other parts of the manuscript.
In the v211 transliteration, the "herbal" section has 12,679 “words” with an average “word” length of 3.69 glyphs. The vocabulary consists of 3,213 distinct “words”. The ten most common “words”, plus the isolated {s} which may or may not be a “word”, are as follows:
[attachment=8077]
Here, it seemed noteworthy that the text did not have a brilliant fit with Zipf’s Law. The correlation with the expected Zipf sequence (for a natural language) was just 81.9 percent.
The first step in the process was to take the v211 transliteration in Microsoft Word, and using the “find-and-replace” function, replace each of the top ten “words” with the same string preceded and followed by a space. It seemed advisable to leave v101-s {s} alone, since it was already an isolated (single-glyph) “word”. This step converted all occurrences of the selected strings to “words”. For example, any occurrence of {8am}, even when interior to a glyph string, became the “word” {8am}. This step produced a new transliteration, with a smaller vocabulary.
The second step was to make alternative conversions of frequent glyph strings to “words”, as follows:
- the top ten “words”, including the isolated {s}
- the top twenty “words”, excluding isolated glyphs {s}, {9} and {y}
- the top thirty “words”, excluding isolated glyphs {s}, {9} and {y}.
The results of the first and second steps are shown below.
[attachment=8079]
Of these tests, the one which gave the best fit with Zipf’s Law was the first, in which the top ten “words”, whether occurring as strings or “words”, but excluding the isolated {s} were converted to “words”. Here the correlation with the Zipf sequence was 98.8 percent.
Thus, it appears that by converting the top ten glyph strings to “words”, but ignoring single-glyph “words”, we can identify a vocabulary which is highly consistent with Zipf’s Law. This conversion made the “herbal” section resemble much more closely a text in a natural language.
The first three lines
We can illustrate the process of word separability by applying it to an extract from the text of the “herbal” section: for example, the first three lines on the first page of the section (f001v). In the v101 transliteration, these lines are as follows:
[attachment=8075]
In the v211 transliteration with word separability, these three lines become as follows:
[attachment=8078]
Vowel recognition
The next step was to identify vowels, using the Sukhotin algorithm as implemented by Dr Mans Hulden’s Python code. In the v101 transliteration, the algorithm identifies the following v101 glyphs as the five most probable vowels, in descending order of probability:
{o},{a},{9},{c},{A}.
The Sukhotin algorithm identifies vowels and consonants by the frequency with which they alternate. In the v211 transliteration with position-dependent glyphs and "word" separability, the "word" breaks have changed; so, not all the glyphs have the same neighbors as before. Therefore, the vowel-glyphs will not necessarily be the same. Applied to the “herbal” section, the five most probable vowels are now identified as follows:
interior o [ô], interior a [â], interior 1 [1], final 9 [⁹], initial o [ó].
In a natural language, especially a medieval European language, we generally expect only five vowels. If so, we may deem that these glyphs in all other positions, and all other glyphs, represent consonants in the presumed precursor languages.
Mapping to precursor languages
The logical next step is to select a chunk of the “herbal” section, or alternatively the top ten “words” of the “herbal” section, and map the glyphs to letters in potential precursor languages, as follows:
- vowel-glyphs to vowels in the precursor language, in order of frequency;
- consonant-glyphs to consonants in the precursor language, in order of frequency;
- as for precursor languages, to my mind we have nothing to lose by starting with medieval Italian and medieval Latin.
This step requires a detailed recalculation of glyph frequencies in the word-separated Voynich text: not because the gross frequencies have changed, but because the position-dependent frequencies have changed. For example, within the separated words, some interior glyphs may have become initial or final glyphs. More on this in another post.