The Voynich Ninja
Vectoring Words (Word Embeddings) - Computerphile - Printable Version

+- The Voynich Ninja (https://www.voynich.ninja)
+-- Forum: Voynich Research (https://www.voynich.ninja/forum-27.html)
+--- Forum: Analysis of the text (https://www.voynich.ninja/forum-41.html)
+--- Thread: Vectoring Words (Word Embeddings) - Computerphile (/thread-2987.html)



Vectoring Words (Word Embeddings) - Computerphile - radapox - 24-10-2019

Just stumbled upon this Computerphile clip. I don't exactly follow the process, but it's a method of mapping word similarities (in the "semantic" sense), based purely on their occurrence within comparable contexts in a given corpus, not on some understanding of their actual "meaning". The results are surprisingly accurate. At the risk of sounding like a total noob here, could an approach like this be at all useful for the VM text? If so, it's probably already been done. In which case I'd love to learn more about it.




RE: Vectoring Words (Word Embeddings) - Computerphile - MarcoP - 24-10-2019

A while ago I played around with one or two implementations of similar methods. I had no success. Even if Voynichese should one day prove to be meaningful, there are good reasons why these methods do not work:
  • These methods require huge data-sets: likely, orders of magnitude larger than the VMS.
  • Voynichese has more word-types per N tokens than English (see You are not allowed to view links. Register or Login to view.). These methods work by assuming that words X and Y are similar if they both tend to appear near words Z and K. But in Voynichese you likely will have Z1 and Z2, K1 and K2 etc. If two words are not identical, they are different words and these algorithms make no use of them.
  • The two "languages" Currier A and Currier B most likely are different ways to write / encipher a single language. It is possible that someday someone will manage to map Currier A into Currier B or vice-versa (I think the importance of this task has been mentioned by Nick Pelling not long ago). That would increase the size of our database and could possibly also help us understand that Z1 and Z2 really are two instances of Z. Until then, Currier A and B result in further fragmentation of the already small text.
More recently, I superficially explored something that is vaguely similar but much less ambitious: You are not allowed to view links. Register or Login to view.. I think here there are more hopes to make progress, but much more time and skills should be applied to these problem. An advantage of POS tagging is that words that look similar are more likely to be related: the similarity between "running" and "petting" is easier to the detect than that between "dog" and "pet". Also, very frequent words like "function words" are meaningful for POS tagging, but almost useless for "semantic" algorithms like the one described in the video.


RE: Vectoring Words (Word Embeddings) - Computerphile - radapox - 25-10-2019

Thanks for your reply, MarcoP! Ah, yes, those are definitely complicating factors, to the point of rendering this approach useless for the VM. Pity.