The Voynich Ninja

Full Version: Separable "words"
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
At the Voynich 2022 conference, Massimilano Zattera presented an intriguing paper with the title “A New Transliteration Alphabet Brings New Evidence of Word Structure and Multiple "Languages" in the Voynich Manuscript”.

Zattera’s main objective, as I understood it, was to demonstrate that most of the “words” in the Voynich manuscript conformed to what he called the “slot alphabet”. This alphabet was based on the following concepts:
  • twelve “slots”, which Zattera numbered 0 to 11;
  • twenty-six distinct glyphs, based on the EVA transliteration (Zattera ignored the profuse variations and “rare glyphs” that occur in the v101 transliteration);
  • the allocation of each glyph to one or more slots
  • a rule that in most cases, a given glyph could occupy only one slot; in a few cases it could be in either of two slots; and in one case (EVA-d or v101-8) it could be in any of three slots.
Zattera’s principal thesis was that in a Voynich “word”, each glyph could only be preceded by a glyph in a lower “slot”, and could only be followed by a glyph in a higher “slot”. He demonstrated that this was the case in 86.6 percent of all the “words” in the manuscript.

In this post, my interest is in what Zattera called “separable words”. By this he meant “words” which could be divided into two parts, each of which was a Voynich “word”. In his paper, he stated that “separable words” accounted for 10.4% of the text, and for 37.1% of the vocabulary.

For example, the v101 “word” {2coehcc89} (which occurs only once) can be separated into two parts: {2coe} which occurs 50 times as a “word”, and {hcc89} with 18 occurrences as a “word”. This is not the only way to separate the “word”: {2co} occurs 26 times and {ehcc89} occurs 14 times. In such cases we might conjecture that there is a word break, but it is not a space; it is invisible, or implied (as in "battlefield" or any other compound word in English).

Zattera did not provide a list of the separable words that he had identified. But we can have a shot at doing so.

I approached this task with several basic assumptions, as follows:
  • that, rather than use the v101 transliteration with its many variations, I should use my own v211 transliteration, which aggregates several v101 glyphs that seem to be similar, such as {1} and {2}, and also disaggregates some v101 glyphs that seem to be glyph strings, such as {m} and {n};
  • that I should regard certain glyphs as having different meanings (or mappings) in the initial, interior, final and isolated positions; in other words, for example, an initial v101 {9} was not the same as a final v101 {9};
  • that the most probable building blocks of Zattera’s “separable words” were the most frequent “words”, that is, glyph strings delimited by spaces or by line breaks;
  • that, rather than use the whole Voynich manuscript with its (probable) multiple languages, I should work with a single thematic section that seemed to have an homogenous language.

To this end, I selected the “herbal” section as defined by René Zandbergen. This section is visually identifiable by the full-page drawings of plants. I take no position on whether the text is related to the drawings; however, the text appears to have a uniform vocabulary which is different from that in other parts of the manuscript.

In the v211 transliteration, the "herbal" section has 12,679 “words” with an average “word” length of 3.69 glyphs. The vocabulary consists of 3,213 distinct “words”. The ten most common “words”, plus the isolated {s} which may or may not be a “word”, are as follows:

[attachment=8077]

Here, it seemed noteworthy that the text did not have a brilliant fit with Zipf’s Law. The correlation with the expected Zipf sequence (for a natural language) was just 81.9 percent.

The first step in the process was to take the v211 transliteration in Microsoft Word, and using the “find-and-replace” function, replace each of the top ten “words” with the same string preceded and followed by a space. It seemed advisable to leave v101-s {s} alone, since it was already an isolated (single-glyph) “word”. This step converted all occurrences of the selected strings to “words”. For example, any occurrence of {8am}, even when interior to a glyph string, became the “word” {8am}. This step produced a new transliteration, with a smaller vocabulary.

The second step was to make alternative conversions of frequent glyph strings to “words”, as follows:
  • the top ten “words”, including the isolated {s}
  • the top twenty “words”, excluding isolated glyphs {s}, {9} and {y}
  • the top thirty “words”, excluding isolated glyphs {s}, {9} and {y}.

The results of the first and second steps are shown below.

[attachment=8079]

Of these tests, the one which gave the best fit with Zipf’s Law was the first, in which the top ten “words”, whether occurring as strings or “words”, but excluding the isolated {s} were converted to “words”. Here the correlation with the Zipf sequence was 98.8 percent.

Thus, it appears that by converting the top ten glyph strings to “words”, but ignoring single-glyph “words”, we can identify a vocabulary which is highly consistent with Zipf’s Law. This conversion made the “herbal” section resemble much more closely a text in a natural language.

The first three lines

We can illustrate the process of word separability by applying it to an extract from the text of the “herbal” section: for example, the first three lines on the first page of the section (f001v). In the v101 transliteration, these lines are as follows:

[attachment=8075]

In the v211 transliteration with word separability, these three lines become as follows:

[attachment=8078]

Vowel recognition

The next step was to identify vowels, using the Sukhotin algorithm as implemented by Dr Mans Hulden’s Python code. In the v101 transliteration, the algorithm identifies the following v101 glyphs as the five most probable vowels, in descending order of probability:

{o},{a},{9},{c},{A}.

The Sukhotin algorithm identifies vowels and consonants by the frequency with which they alternate. In the v211 transliteration with position-dependent glyphs and "word" separability, the "word" breaks have changed; so, not all the glyphs have the same neighbors as before. Therefore, the vowel-glyphs will not necessarily be the same. Applied to the “herbal” section, the five most probable vowels are now identified as follows:

interior o [ô], interior a [â], interior 1 [1], final 9 [⁹], initial o [ó].

In a natural language, especially a medieval European language, we generally expect only five vowels. If so, we may deem that these glyphs in all other positions, and all other glyphs, represent consonants in the presumed precursor languages.

Mapping to precursor languages

The logical next step is to select a chunk of the “herbal” section, or alternatively the top ten “words” of the “herbal” section, and map the glyphs to letters in potential precursor languages, as follows:
  • vowel-glyphs to vowels in the precursor language, in order of frequency;
  • consonant-glyphs to consonants in the precursor language, in order of frequency;
  • as for precursor languages, to my mind we have nothing to lose by starting with medieval Italian and medieval Latin.

This step requires a detailed recalculation of glyph frequencies in the word-separated Voynich text: not because the gross frequencies have changed, but because the position-dependent frequencies have changed. For example, within the separated words, some interior glyphs may have become initial or final glyphs. More on this in another post.
Hi
Here attach a little work of the manuscript
You are not allowed to view links. Register or Login to view.
I hope enjoy it

Regards
CS
I think, both you and Zattera overestimate the complexity of the Voynich Manuscript. The problem is in the transliteration alphabet. While EVA alphabet  can generate many readable meaningful words in Slovenian, it needs improvement, because some letters, such as u, z, v, p, f, q, n, m are misidentified, and some, like w are missing. Since these letters have high frequency in the VM text, they could have great effect on computer analysis. 
The Sukhotin algorithm also missed the vowel u. In Slavic, there were also a few consonants that had in certain words a vowel-like property, such as w, y, l and r.
With a proper transliteration alphabet the phonetic Slovenian writing is quite readable, however many vowels are still missing. The reason for that is that all semi-vowels, for which the Old Church Slavonic Glagolitic script had separate letters, were dropped, since Latin alphabet had no equivalent letters for them.
It was also pointed out by several researchers that the letters are difficult to differentiate because of the similarly of their shapes, or unclear writing.

Considering the grammatical rules as a CODE, the first rule would be to insert the missed semi-vowels. This would require the knowledge of Slovenian language for the words to be similar to the medieval Slovenian, where the non-written semi-vowels were already replaced with full vowels, although there is no distinction between different pronunciation of the vowels.

The second rule would be to de-compose the so-called word blocks which were characteristic for Slavic writing up to the invention of printing presses. The word blocks look like words, but they are composed of two or three short words pronounced together as one word. An example of that is the first word in the text you use - the EVA   KCHSY. This is frequently used word in Slovenian dialectal speech, a phrase meaning 'if you did', which  sounds quite strange in English. After inserting vowels KICHESY we get word block composed of three words: 
KI ČE SI (in contemporary Slovenian alphabet). SI in this case is an axillary verb that relates to the verb DAIW (daiu - now: dajal/dajau).  
Because the letter z is not properly identified with EVA, although it looks like Latin z, the next word is read as CHY, although it looks more like OZI (OŽI- ®OŽI (dative of flower). 
EVA  daiin could be read as DAM or DAIW (give - gave, singular.)  
OL is Slovenian for 'oil'. The word OL is repeated again for the emphasis. 
EVA TCHEY looks more like TOUY (your, yours), since the Latin  U is very clearly written. 
 
EVA  CHAR means 'charm', 'incantation', and even if the word is read as OZAR (vision) it would be fitting to the verb SVECHARAIL. 
The last word in the line contains a ligature which I read SVCH (SVČ), which is a root for light, illumination, blessing. SVEČAR was Old Slovenian word, before it was changed to SVEČENIK (a pagan priest). The suffix -ail  could be cognate to English 'do', 'made'.
Translation: IF YOU WERE GIVING TO FLOWER OIL - OIL (will) WOUR VISION (CHARM) MAKE DIVINE.

There you have the reading of the first line. There is no need to search for abbreviated or Vulgar, or coded Latin. There are some good Slovenian grammar books in Latin, German or Slovenian that explain the rules and even show the diacritic markers over the vowels. It took me a while to get a digital copies of those books.
Dear Cvetka,

this is not so much a matter of complexity.
One may argue that human written language is by itself complex, and each langauge has its own different complexities.
The Voynich writing is perhaps not even more complex. I think one could argue that it is less complex than written Slavionic languages.

The main point is not whether it is more or less complex, but that it is very different.

What Massimiliano Zattera and yourself have in common is a misunderstanding of the meaning of a transliteration alphabet. It is not meant to be a translation. But never mind that point.

The Voynich MS text becomes complex only as soon as one tries to map it to any known language.

Zattera's proposed analysis method also isn't complex, though it may be confusing for people with a more arts/humanities based background.
M. Zattera may have over-fitted the data (his transliteration) with his 12-slot model to match as many words as possible, i.e. minimize the number of "separable" words. There is a visible wrap-around effect (slots 7-11 have 6 glyphs in common with slots 0-2) resulting in multiple possibilities of re-parsing lines into words with the same sequence of slots.

If Voynichese has been intentionally obfuscated, with spaces removed and inserted semi-randomly, or just carelessly copied with random-sized spaces as it seems, re-parsing lines with a simpler, unambiguous slot sequence may be a better choice, and an interesting experiment to try with some statistics that rely on word identification like word pair statistics, type-token ratio, etc.

---

Making a direct transcription to Italian (like Csan99, in post above) or semi-direct with additional steps toward whatever language anyone prefers (Nahuatl, Latin, Farsi, Turkic, to name a few languages in recent "solutions") is possible, but nobody cares anymore, because it's been done a hundred times already and it has become increasingly frequent with the ease of generating a seemingly meaningful translation from a word salad with IA. All natural language theories have zero credibility anyway because they don't solve any of the real, objectively demonstrable problems.
Quote:In this post, my interest is in what Zattera called “separable words”. By this he meant “words” which could be divided into two parts, each of which was a Voynich “word”

That's an interesting observation. I noticed it myself too, although without using any statistics.

Most of Voynich words (or "vords" as some people call it) have very rigid structure:

beginning part - middle part - end part

You can skip some parts so you can have for example "middle-end" word or just "end" word. But you can't have "middle-middle" for example.

Voynich words are generally short, especially if we consider stuff like "aiiin" as single symbol. But there are some longer words which indeed seem to have the
structure:

beginning-middle-end-beginning-middle-end

My guess is that Voynich uses some method that enables to encode let's say 3 letters in a row. It you wan't to encode more, than you have to repeat it which gives this form to longer words.

And another guess - words are very radically abbreviated. It works well with common words but stuff like names of plants cannot be abbreviated because you wouldn't guess what they mean. So longer words are things like plant names.

Please notice that longest words are also the rarest words. It would make perfects sense if some plant is mentioned only at its page.
(12-01-2024, 08:30 AM)ReneZ Wrote: You are not allowed to view links. Register or Login to view.What Massimiliano Zattera and yourself have in common is a misunderstanding of the meaning of a transliteration alphabet
Hi, Rene, thank you for responding. Ever since you explained to me that EVA was only transliteration alphabet, I have been studying it in order to see why I recognized so many Slovenian words when using EVA as transcription alphabet. The reason is simple: There is enough Slovenian words that can be transcribed into Voynich using EVA transcription, because for many letters, transliteration and translation alphabets overlap. And when Slovenian transcription alphabet overlaps with Latin, the EVA letters could be considered as useful for transcription and translation from Voynich to Slovenian.
An example of all three alphabets (EVA transliteration, transcription, translation is the word DAR, which is transliterated the same way with Slovenian transliteration alphabet into Latin word DAR, which happens to be the same transcription for Slovenian word DAR, which means 'gift, offering'.
Adding a suffix y , we get the word DARY (EVA transliteration into Latin, which is also Slovenian transliteration into Latin alphabet). Since some medieval Slovenians used exactly the same spelling DARY for the plural of DAR, Y could also be used for transcription into medieval Slovenian. However, with the appearance of letter-shape j, the use of y in Slovenian medieval writing was abandoned. Instead of y, the letters i or j (in some cases ij or ji) were used.
The combination of EVA words DAL DAR (transliterated the same way with Slovenian transliteration alphabet) means 'gave offering', 'gave gift'. That means, that EVA L can also be considered proper letter for transliteration, transcription into Latin letters, and transcription into Slovenian from Latin, and consequently for the translation.
This means that the words DAL DAR could be just as well written in numbers or any different signs, as long as they are consistently used. I did not see the EVA equivalent of the VM glyph that looks exactly like Latin u. While it is true that any symbol (letter, number, or shape) can be used for transliteration, when there is no distinction between VM letter shape that looks like u and ee, and consequently between eu and eee, uu and eeee, eu and eee, this gets problematic. In this case, the insufficient transliteration only effects the number of vowels. Those who regard EVA e as their transliteration letter c, would end up with the same incorrect number of consonants cc for EVA ee, ccc for EVA eee, cccc for EVA eeee. In this case, the number of consonants would increase.
For me, it would not make any difference if the Voynich U (with solid curved connecting line at the bottom were designated as EVA +, $, or whatever, as long as it would be distinct. In my transliteration, I can see over 6100 u-like shapes. I was also not able to find an equivalent for W. 
I suppose all that reference to DAIN, DAIIN and similarity of the Voynich glyphs to Latin letters is confusing, since our brain tends to make sense out of nonsense, and compares EVA to Latin equivalents. The closer these equivalents are to Latin alphabet, the closer the transliteration alphabet overlaps with transcription alphabet into Latin letters. Since Latin letters were used in most medieval European languages, transcription into Latin should be the first step that would enable us to recognize the writing convention - the adaptation of Latin letters to different vernacular languages. The Voynich alphabet is unique for Slovenian sounds. It was written in expectation of Church reforms and the use of vernacular letters in liturgy, and it ended with the invention of the printing presses, since printing presses were adopted for Italian, German or Hungarian. Slovenian medieval books are written in all three of those writing conventions, that is why it is almost impossible to get comparative text, particularly when computer analysis is based on contemporary Slovenian articles from Wikipedia.

What I am doing in my research is two-fold: My transliteration is matched with ZL transliteration, to avoid self-bias. After some basic adjustments to transform phonetic to written Slovenian, the spelling is matched to written medieval Slovenian. That is when real problems start, because often, many different Latin letters are used for the same Slovenian sound, or one Latin letter is used for different Slovenian sounds. It was not uncommon that the same word was spelled in five different ways.

 However, when over 100 most frequently used Voynich words, transliterated and transcribed with Slovenian alphabet (in many of these words, EVA transliteration,  Latin and Slovenian transcription overlap) can generate meaningful Slovenian words, without or with minor adjustments according to Slovenian grammatical rules, it is more likely the language is Slovenian as it was spoken in the 15th century, than any other. Taking into account the other clues Voynich researchers have found in the VM, supports this theory.
Correction to table in previous post:

[attachment=8080]