The Voynich Ninja - Voynich2Vec: Using FastText Word Embeddings for Voynich Decipherment

There is an unpublished paper about the VMS: You are not allowed to view links. Register or Login to view.

The paper from William Merrill and Eli Baum is available at William Merrils Homepage You are not allowed to view links. Register or Login to view..

The authors used word vectors to visualize voynich words: "The idea behind this approach is to encode each word as a highdimensional vector of real numbers such that similar words have similar vector representations."

The graphs illustrate that words with morphological similarity do encode similar word vectors (appear in the same context): "A variety of presumptive affixes are very apparent: [qol] vs [ol] vs [sol], for example. We can also see that certain letters are closely related and perhaps interchangeable with others. It is important to remember that these vectors were computed based on the word’s context; the fact that [qolchedy] aligns with [olchedy] suggests that they appear in similar syntactic (or even semantic) positions and not just that they have similar spellings" (Merrill & Baum).

The authors conclude: "While our morphological and topical analyses picked up on structural properties of suffixes and important words in the text, this does not necessarily mean that the text is written in a natural language. Formal language theory tells us that many sequences of structured text can be described by a grammar. Therefore, it is always possible that the properties our embeddings associate with specific suffixes or folia reflect non-natural-language structures or gibberish" (Merrill & Baum).

I would have liked to have seen data similar to Table 1 from a text we knew the meaning. They say that the program does more than just link words that are spelled similarly - but there’s nothing in the data presented that shows me anything more than that.

But assuming that the program is more than a fancy way to match similar looking words, what do these results mean for the Voynich? That every single pair of linked words look so similar to each other? It makes me suspicious that either this program doesn’t really find “meaning” or, on the flip side, there is no meaning to be found in the manuscript beyond some sort of visual patterning. Perhaps a patterning to look like language but isn’t. I really, really don’t want such a result but l’m having trouble seeing another way to interpret findings like this. If someone can provide a different way to see it, l’d love to hear their thoughts.

(10-07-2020, 12:26 AM)MichelleL11 Wrote: You are not allowed to view links. Register or Login to view.I would have liked to have seen data similar to Table 1 from a text we knew the meaning.

The codebase together with more images is available at You are not allowed to view links. Register or Login to view..

There are also graphs for texts with meaning: You are not allowed to view links. Register or Login to view., You are not allowed to view links. Register or Login to view. and You are not allowed to view links. Register or Login to view. (the analyzed texts can be found You are not allowed to view links. Register or Login to view.).

(10-07-2020, 12:26 AM)MichelleL11 Wrote: You are not allowed to view links. Register or Login to view.They say that the program does more than just link words that are spelled similarly - but there’s nothing in the data presented that shows me anything more than that.

Other researchers describe similar results:

"As a general characteristic for all the clusters shown, the words that are more strongly connected have an evident morphological similarity." (You are not allowed to view links. Register or Login to view.)

"The closer two words are (with respect to their edit distance), the more likely these words also can be found written in close vicinity (i.e. on the same page)" (You are not allowed to view links. Register or Login to view.).

(10-07-2020, 12:26 AM)MichelleL11 Wrote: You are not allowed to view links. Register or Login to view.But assuming that the program is more than a fancy way to match similar looking words, what do these results mean for the Voynich? That every single pair of linked words look so similar to each other?

[font=-apple-system, system-ui,]There is also a You are not allowed to view links. Register or Login to view. for word clusters and a whole You are not allowed to view links. Register or Login to view. of words.[/font]

There seems not to be a text reference to Figure 8 'Document vectors'.
The first 3 paragraphs of Section 4.2 'Topic Modeling' seems to be most relevant. Would that be correct?

(10-07-2020, 02:25 AM)RobGea Wrote: You are not allowed to view links. Register or Login to view.There seems not to be a text reference to Figure 8 'Document vectors'.

Indeed, they describe Figure 8 in the text but don't give the reference: "we see similar important words in each section (where 'importance' is as measured by tf-idf). While we would remiss to call this a topic in the semantic sense, we nonetheless observe a kind of morphosyntactic 'topic' (for example, in the bath section, qo- words have very high tf-idf scores)" (Merrill & Baum, p. 7).

[font=Tahoma, Verdana, Arial, sans-serif]

(10-07-2020, 02:25 AM)RobGea Wrote: You are not allowed to view links. Register or Login to view.[/font]The first 3 paragraphs of Section 4.2 'Topic Modeling' seems to be most relevant. Would that be correct?

Topic modeling (or context-dependent self-similarity) is indeed relevant for the Voynich manuscript (see Timm & Schinner 2020, p. 2f). For instance Currier described distinct languages A and B. Gabriel Landini described "unevenness of the local character density along the text" (Landini 2001, p. 292). "It is possible to distinguish Currier A and B based on frequency counts of tokens containing the sequence <ed>. ... if <chedy> is used more frequently, this also increases the frequency of similar words, like <shedy> or <qokeedy> .... At the same time, also words using the prefix <qok-> are becoming more and more frequent, whereas words typical for Currier A like <chol> and <chor> vanish gradually" (Timm & Schinner 2020, p. 6).

There are a number of related observations. One interesting outcome is that the MUSE "algorithm was largely unable to align our Voynich vectors with those of many other works in a number of languages. Oddly, we often found that MUSE aligned many Voynich words to the same word in another language ..." (Merrill & Baum, p. 7). Since words in the VMS group into similarity clusters the algorithm was trying to map word clusters to words in natural languages: "When we looked closely at many of the alignments, it seemed like similar Voynich words were being mapped to similar (or the same) target language word" (Merrill & Baum, p. 8).

Many thanks to Torsten for posting this!
I was wondering why Merrill and Baum seem to be unaware of the Zandbergen-Landini transliteration, then I noticed (github) that this works dates to 2018: at the time I guess Takahashi's was still seen as the best option.
I tried running their code, but their python scripts apparently reference a "voynich.bin" file that is not in the github repository, or maybe I missed something. Anyway, I think this is a very interesting line of research. As the authors say, the ms is too short to make the most ambitious tasks realistic, but maybe similar software tools can help in researching possible mappings between the different Voynich dialects.

This 2016 You are not allowed to view links. Register or Login to view. is referenced in the paper. I guess the other references are also worth reading. BTW, it's nice to see that these authors find Julian Bunn's work so relevant.

Will Merrill writes in his You are not allowed to view links. Register or Login to view. that the paper goes back to a Voynich project at Yale University. He probably attended You are not allowed to view links. Register or Login to view. Voynich course in 2018. Claire Bowern wrote in this You are not allowed to view links. Register or Login to view. about her course: "We are also doing a number of experiments to test the likelihood of various Voynich theories."[url=https://www.voynich.ninja/thread-2295-post-19425.html#pid19425][/url]

(10-07-2020, 05:38 PM)MarcoP Wrote: You are not allowed to view links. Register or Login to view.I tried running their code, but their python scripts apparently reference a "voynich.bin" file that is not in the github repository, or maybe I missed something.

Not being a wizard at this kind of thing , but the voynich.bin file seems to have to do with the fastText library. Quote from fastText docs:
"Saving and loading a model object
You can save your trained model object by calling the function save_model.
model.save_model("model_filename.bin")

Maybe the get_vectors.py script creates the model ?
You are not allowed to view links. Register or Login to view.

(10-07-2020, 08:31 PM)RobGea Wrote: You are not allowed to view links. Register or Login to view.Maybe the get_vectors.py script creates the model ?

Thank you. I looked at that script also and it seems to create files with a "-" in the name. It could have created models/voynich-2.vec and models/voynich-3.vec but I don't believe it can create models/voynich.bin. I am not a software wizard either: if I decide to look more into these methods, I will search for something with more documentation.

(10-07-2020, 12:26 AM)MichelleL11 Wrote: You are not allowed to view links. Register or Login to view.I would have liked to have seen data similar to Table 1 from a text we knew the meaning.

The following table is based on the models/james1.vec file in the You are not allowed to view links. Register or Login to view. linked by Torsten. The vectors are based on texts/james1.txt, this should be The political works of James I by Charles Howard McIlwain. The text is much longer than the VMS (243677 words).

I simply computed the cosine distance among all couples of vectors and picked the 100 lowest results.

You are not allowed to view links. Register or Login to view.

Though the main text is in English, it appears to contain some Latin. Most of the couples appear to pick the minority languages in the text: Arabic numbers and Latin.

The following table corresponds to the Spanish models/picatrix.vec, which also is much longer than the VMS (136127 words).

You are not allowed to view links. Register or Login to view.

It seems to me that the results are dominated by similar couples where the relationship appears to be morphological/grammatical rather than semantic ("magnesio bórax" near the bottom could be an interesting exception).

Marco,

Thanks for putting these lists together. Not sure it results in a full understanding of what this program does, but it puts the results in context. The program seems certainly better at matching Voynich words than other languages for words that look similar - but l imagine there could be many reasons for that - not the least the incredible rigidity of Voynich word structure. This program seems really good at targeting those kinds of patterns.