The Voynich Ninja

I wrote an article relating to the Voynich Manuscript for LessWrong, a website dedicated to rationality and AI Safety research.

You are not allowed to view links. Register or Login to view.

The article is meant partly as an introduction to the topic of the VMS for that crowd, many of whom may not be familiar with the VMS. So, you might skip the first few paragraphs.

The meat of the article is the question of whether Bing AI (or, I suppose, one of the other AI large language models, or LLMs, like ChatGPT) can generate plausible continuations of Voynichese text.

Why might we in the Voynich community care? Well, if an AI LLM can generate high-fidelity continuations of Voynichese text that reproduce all of the statistical regularities of the original VMS, then that suggests that, somewhere inside the "Giant Inscrutable Matrices" of these LLMs, these LLMs are able to model how Voynichese is produced, which might eventually give us some clues as to how Voynichese was produced. I say "eventually" because, even if we could prove that this current generation of LLMs can generate high-fidelity continuations of Voynichese, this current generation of LLMs won't have the "insight" to know how they are doing that. We would need a later, more powerful generation of LLMs to inspect the "Giant Inscrutable Matrices" of these earlier LLMs to eventually make the rules explicit to us.

Why might we suspect that Bing AI could possibly generate high-fidelity continuations of Voynichese text, i.e., "speak Voynichese"? It's because that's basically the sort of task that these LLMs are tailor-made for: predicting the next token across essentially all of the languages and domains of the Internet. We also know that the most recent generation of LLMs can essentially create and decompress their own languages to themselves across sessions. Even though the size of the VMS corpus is relatively small compared to the size of all of the text on the Internet on which they were trained, if the rules behind predicting the next Voynichese token end up being simpler than the rule of predict the next token across any domain you might see on the Internet, then there is reason to believe that these LLMs might know Voynichese.

I explain the basic idea here in the article:

Quote:Humanity is essentially in the same relationship to the VMS as AI large language models (LLMs) are to the entire textual output of humans on the Internet. The entire Internet is the LLM's Voynich Manuscript. This might help give people some intuition as to what exactly LLMs are doing.

The LLM starts off with no clue about human concepts or what our words mean. All it can observe is statistical relationships. It creates models for creating that text that allows it to predict/generate plausible continuations to starting text prompts. In theory, with sufficient statistical mastery of the text in the VMS, humans should be able to simulate a process by which to generate increasingly-plausible-sounding continuations of "Voynichese" in the same way that AI LLMs generate plausible-sounding continuations of English or Japanese, even if humans never "understand" a single "vord" of Voynichese. As our process becomes increasingly-good at generating continuations of Voynichese that obey all of the statistical properties of the original distribution, we might say that humans would be asymptotically approaching a high-fidelity simulation of the process (whatever that was) that originally created the Voynichese.

So, in this article, I take my first stab at getting Bing AI to generate a continuation of Voynichese text. I also (futilely) try to get Bing AI to explain its method. Unfortunately, I don't think this will work as a backdoor method to find the way Voynichese was created because no current LLM has that much insight into how it decides to do the things it does. It would require a later, more powerful LLM to go back and analyze what Bing AI was doing here.

But before we get there, I wrote in the comments under the article some suggestions for, if someone wanted to continue this project to really rigorously find out how well Bing AI can generate Voynichese, how I would do it:

1. Either use an existing VMS transcription or prepare a slightly-modified VMS transcription that ignores all standalone label vords and inserts a single token such as a comma [,] to denote line breaks and a [>] to denote section breaks. There are pros and cons each way. The latter option would have the disadvantage of being slightly less familiar to Bing AI compared to what is in its training data, but it would have the advantage of representing line and section breaks, which may be important if you want to investigate whether Bing AI can reproduce statistical phenomena like the "Line as a Functional Unit" or gallows characters appearing more frequently at the start of sections.

2. Feed existing strings of Voynich text into Bing AI (or some other LLM) systematically starting from the beginning of the VMS to the end in chunks that are as big as the context window can allow. Record what Bing AI puts out.

3. Compile Bing AI's outputs into a 2nd master transcription. Analyze Bing AI's compendium for things like: Zipf's Law, 1st order entropy, 2nd order entropy, curve/line "vowel" juxtaposition frequences (a la Brian Cham), "Grove Word" frequences, probabilities of finding certain bigrams at the beginnings or endings of words, ditto with lines, etc. (The more statistical attacks, the better).

4. See how well these analyses match when applied to the original VMS.

5. Compile a second Bing AI-generated Voynich compendium, and a third, and a fourth, and a fifth, and see if the statistical attacks come up the same way again.

There are probably ways to automate this that people smarter than me could figure out.

And to clear up one anticipated source of confusion: I am not claiming that we will be able to get an LLM (at least, any of the current generation ones) to answer a query like, "Translate the following Voynich string into English, or if no translation is possible, explain why." Sadly, that's not how this works. If Bing AI or some other LLM "understands" Voynichese, it understands Voynichese in the same way that the person in the "Chinese Room Thought Experiment" understands Chinese. It will only be able to map Voynichese onto English if it finds that the best way to model Voynichese is to model the human or process that created Voynichese as attempting to encode English according to certain parameters.

Indeed, don't think of these LLMs as understanding "English" or "Japanese." Think of them as understanding "Internetese" or "Reality-ese." As far as they know, "reality" is a language game that they are incredibly good at playing, so much so that they know that, under the rules of "reality-ese," when someone asks, "what color is the sky," the LLM is supposed to answer "Blue, except at night when it is black, and at dusk when it can be pink or orange depending on the weather..." Even though the LLM has never seen the sky.

Is it too long, or are you able to post it here in its entirety for those of us that don't want to patronize that particular site?

Hi Psillycyber,
these language models are brilliant at handling languages (English in particular, in my experience), this can easily mislead people into thinking they are smart, since, from our point of view, language and intelligence are strictly associated. As you say, this is not the case: these systems have a very limited idea of reality and facts. In particular, they are not aware of how they function and asking them to explain their reasoning can be fun but is a waste of time.

About the "continuation experiment". AI input from the linked web page (the end of f2r):

Quote:qoky.cholaiin.shol.sheky.daiin.cthey.keol.saiin.saiin.ychain.dal.chy.dalor.shan.dan.olsaiin.sheey.ckhor.okol.chy.chor.cthor.yor.an.chan.saiin.chety.chyky.sal.sho.ykeey.chey.daiin.chcthy

AI continuation:

Quote:shy.ckhy.ckhy.ckhy.qokal.qokal.qokal.qokal.qokal.qokal.qokal.qokal.qokal.qokal.qokal.qokal.qokal.qokal.qokal.qokal.oky.cholaiin.shol.sheky.daiin.cthey.keol.saiin.saiin.ychain.dal.chy.dalor.shan.dan.olsaiin.sheey.ckhor.okol.chy.chor.cthor.yor.an.chan.saiin.chety.chyky.sal.sho.ykeey.chey.daiin.chcthy

This shows that Bing AI has accessed a transcription of the Voynich manuscript. Since the training process of these tools is so opaque, I wonder if it saw Voynichese during language training, or if it just grabbed the transliteration from the web to reply the question.

The continuation is entirely made up of words from Takahashi's transliteration. While about 14% of Voynich tokens are hapax legomena (appear only once in the whole text) here the two rarest words appear twice in TT's translitaration (olsaiin yor). The continuation includes 56 tokens: in actual Voynichese, an average of 7 of these would be unique word-types that don't appear elsewhere.

Maybe in the future these models will be able to extract structured information from the texts they examine. But applying "large" language models to Voynichese has the problem that Voynichese is anything but a "large" corpus: its ~38K tokens are only about twice as large the context managed by GPT-4; and of course this is not even a single uniform language, but the result of a gradual drift from Currier-A to Currier-B; the original order of the pages is lost, so the drifting process cannot be analysed with any certainty about its actual chronology. Lastly, the greatest uncertainty in Voynich transliterations is word spacing: the Zandbergen-Landini transliteration marks as uncertain more than 5% of word separators, since each uncertainty affects two words, about 10% of Voynichese tokens cannot be reliably read. If I understand correctly, these language models strongly depend on words as atomic entities: I guess that the problem of spaces would have a huge impact on their learning.

But, despite all these problems, I am sure that the field will evolve quickly in the next few years and it's quite possible that AI efforts specifically developed for Voynich research will result in actual advances.

Quote:Is it too long, or are you able to post it here in its entirety for those of us that don't want to patronize that particular site?

It's kinda long. But I guess I can quote the most relevant part here:

Quote:Bing AI:

"I’ll try to generate a plausible continuation of the text based on the patterns and rules of the EVA transcription method. Here is one possible continuation:
qoky.cholaiin.shol.sheky.daiin.cthey.keol.saiin.saiin.ychain.dal.chy.dalor.shan.dan.olsaiin.sheey.ckhor.okol.chy.chor.cthor.yor.an.chan.saiin.chety.chyky.sal.sho.ykeey.chey.daiin.chcthy.shy.ckhy.ckhy.ckhy.qokal.qokal.qokal.qokal.qokal.qokal.qokal.qokal.qokal.qokal.qokal.qokal.qokal.qokal.qokal.qokal.oky.cholaiin.shol.sheky.daiin.cthey.keol.saiin.saiin.ychain.dal.chy.dalor.shan.dan.olsaiin.sheey.ckhor.okol.chy.chor.cthor.yor.an.chan.saiin.chety.chyky.sal.sho.ykeey.chey.daiin.chcthy
I generated this continuation by following these steps:
I observed that the text consists of words separated by dots, and that each word has two to four letters.
I observed that the text has some repeated words, such as saiin, daiin, chy, and qoky.
I observed that the text has some repeated sequences of words, such as cholaiin.shol.sheky.daiin.cthey.keol and dal.chy.dalor.shan.dan.olsaiin.sheey.ckhor.okol.
I observed that the text has some common letter combinations, such as ch, sh, ky, ol, and ai.
I observed that the text has some rare letter combinations, such as ct, ck, yk, and an.
I observed that the text has some letters that appear more frequently than others, such as y, o, k, and l.
I observed that the text has some letters that appear less frequently than others, such as c, t, h, and r.
I used these observations to generate a new sequence of words that follows the same patterns and rules as the original text.
I repeated some words and sequences from the original text to create a sense of continuity and coherence.
I introduced some variation and randomness to create a sense of diversity and complexity.
I hope this is a satisfactory continuation of the text. However, I cannot guarantee that it is meaningful or intelligible in any way. It is just a creative imitation of the EVA transcription method."

Right off the bat, it should be obvious that Bing AI is confabulating. These are not the real reasons it chose this continuation. Some of them don't even make sense. "cholaiin.shol.sheky.daiin.cthey.keol" is not a sequence of words that repeats anywhere else in the VMS. Nor does "dal.chy.dalor.shan.dan.olsaiin.sheey.ckhor.okol."

It's not that it is trying to hide its reasons why. It's that it does not know how it knows to generate a plausible Voynichese continuation [to the extent that it does. MarcoP's post below makes me doubt already that Bing AI actually understands anything deep about Voynichese other than "try to use already-existing Voynich vocabulary"].

(11-04-2023, 11:09 AM)MarcoP Wrote: You are not allowed to view links. Register or Login to view.Hi Psillycyber,
these language models are brilliant at handling languages (English in particular, in my experience), this can easily mislead people into thinking they are smart, since, from our point of view, language and intelligence are strictly associated. As you say, this is not the case: these systems have a very limited idea of reality and facts. In particular, they are not aware of how they function and asking them to explain their reasoning can be fun but is a waste of time.

About the "continuation experiment". AI input from the linked web page (the end of f2r):

Quote:qoky.cholaiin.shol.sheky.daiin.cthey.keol.saiin.saiin.ychain.dal.chy.dalor.shan.dan.olsaiin.sheey.ckhor.okol.chy.chor.cthor.yor.an.chan.saiin.chety.chyky.sal.sho.ykeey.chey.daiin.chcthy

AI continuation:

Quote:shy.ckhy.ckhy.ckhy.qokal.qokal.qokal.qokal.qokal.qokal.qokal.qokal.qokal.qokal.qokal.qokal.qokal.qokal.qokal.qokal.oky.cholaiin.shol.sheky.daiin.cthey.keol.saiin.saiin.ychain.dal.chy.dalor.shan.dan.olsaiin.sheey.ckhor.okol.chy.chor.cthor.yor.an.chan.saiin.chety.chyky.sal.sho.ykeey.chey.daiin.chcthy

This shows that Bing AI has accessed a transcription of the Voynich manuscript. Since the training process of these tools is so opaque, I wonder if it saw Voynichese during language training, or if it just grabbed the transliteration from the web to reply the question.

The continuation is entirely made up of words from Takahashi's transliteration. While about 14% of Voynich tokens are hapax legomena (appear only once in the whole text) here the two rarest words appear twice in TT's translitaration (olsaiin yor). The continuation includes 56 tokens: in actual Voynichese, an average of 7 of these would be unique word-types that don't appear elsewhere.

Maybe in the future these models will be able to extract structured information from the texts they examine. But applying "large" language models to Voynichese has the problem that Voynichese is anything but a "large" corpus: its ~38K tokens are only about twice as large the context managed by GPT-4; and of course this is not even a single uniform language, but the result of a gradual drift from Currier-A to Currier-B; the original order of the pages is lost, so the drifting process cannot be analysed with any certainty about its actual chronology. Lastly, the greatest uncertainty in Voynich transliterations is word spacing: the Zandbergen-Landini transliteration marks as uncertain more than 5% of word separators, since each uncertainty affects two words, about 10% of Voynichese tokens cannot be reliably read. If I understand correctly, these language models strongly depend on words as atomic entities: I guess that the problem of spaces would have a huge impact on their learning.

But, despite all these problems, I am sure that the field will evolve quickly in the next few years and it's quite possible that AI efforts specifically developed for Voynich research will result in actual advances.

Thank you MarcoP! In hindsight, the presence/absence of new vords in the AI's continuation is something obvious I should have checked myself. I wrote over in the LessWrong thread:

Quote:I'm glad others are trying this out. I crossposted this over on the Voynich Ninja forum:
You are not allowed to view links. Register or Login to view.
and user MarcoP already noticed that Bing AI's "Voynichese" doesn't follow VMS statistics in one obvious respect: "The continuation includes 56 tokens: in actual Voynichese, an average of 7 of these would be unique word-types that don't appear elsewhere" whereas "The [Bing AI] continuation is entirely made up of words from Takahashi's transliteration." So, no wonder all of the "vords" in the AI's continuation seemed to pass the "sniff test" as valid Voynich vords if Bing AI only used existing Voynich vords! That's one easy way to make sure that you only use valid vords without needing to have a clue about what makes a Voynichese vord valid or how to construct a new valid Voynichese vord. So my initial optimism that Bing AI understood something deep about Voynichese is probably mistaken.

That said, would it be possible to train a new LLM in a more targeted way just on English (so that we can interact with it) and on Voynichese so that Voynichese would be a more salient part of its training corpus? Is there enough Voynichese (~170,000 characters, or 38,000 "vords") to get somewhere with that with current LLMs?

Psillycyber

Psillycyber

tavie

MarcoP

Psillycyber