![]() |
Bing AI Generating Continuations of Voynich Text - Printable Version +- The Voynich Ninja (https://www.voynich.ninja) +-- Forum: Voynich Research (https://www.voynich.ninja/forum-27.html) +--- Forum: Analysis of the text (https://www.voynich.ninja/forum-41.html) +--- Thread: Bing AI Generating Continuations of Voynich Text (/thread-3977.html) |
Bing AI Generating Continuations of Voynich Text - Psillycyber - 10-04-2023 I wrote an article relating to the Voynich Manuscript for LessWrong, a website dedicated to rationality and AI Safety research. You are not allowed to view links. Register or Login to view. The article is meant partly as an introduction to the topic of the VMS for that crowd, many of whom may not be familiar with the VMS. So, you might skip the first few paragraphs. The meat of the article is the question of whether Bing AI (or, I suppose, one of the other AI large language models, or LLMs, like ChatGPT) can generate plausible continuations of Voynichese text. Why might we in the Voynich community care? Well, if an AI LLM can generate high-fidelity continuations of Voynichese text that reproduce all of the statistical regularities of the original VMS, then that suggests that, somewhere inside the "Giant Inscrutable Matrices" of these LLMs, these LLMs are able to model how Voynichese is produced, which might eventually give us some clues as to how Voynichese was produced. I say "eventually" because, even if we could prove that this current generation of LLMs can generate high-fidelity continuations of Voynichese, this current generation of LLMs won't have the "insight" to know how they are doing that. We would need a later, more powerful generation of LLMs to inspect the "Giant Inscrutable Matrices" of these earlier LLMs to eventually make the rules explicit to us. Why might we suspect that Bing AI could possibly generate high-fidelity continuations of Voynichese text, i.e., "speak Voynichese"? It's because that's basically the sort of task that these LLMs are tailor-made for: predicting the next token across essentially all of the languages and domains of the Internet. We also know that the most recent generation of LLMs can essentially create and decompress their own languages to themselves across sessions. Even though the size of the VMS corpus is relatively small compared to the size of all of the text on the Internet on which they were trained, if the rules behind predicting the next Voynichese token end up being simpler than the rule of predict the next token across any domain you might see on the Internet, then there is reason to believe that these LLMs might know Voynichese. I explain the basic idea here in the article: Quote:Humanity is essentially in the same relationship to the VMS as AI large language models (LLMs) are to the entire textual output of humans on the Internet. The entire Internet is the LLM's Voynich Manuscript. This might help give people some intuition as to what exactly LLMs are doing. So, in this article, I take my first stab at getting Bing AI to generate a continuation of Voynichese text. I also (futilely) try to get Bing AI to explain its method. Unfortunately, I don't think this will work as a backdoor method to find the way Voynichese was created because no current LLM has that much insight into how it decides to do the things it does. It would require a later, more powerful LLM to go back and analyze what Bing AI was doing here. But before we get there, I wrote in the comments under the article some suggestions for, if someone wanted to continue this project to really rigorously find out how well Bing AI can generate Voynichese, how I would do it: 1. Either use an existing VMS transcription or prepare a slightly-modified VMS transcription that ignores all standalone label vords and inserts a single token such as a comma [,] to denote line breaks and a [>] to denote section breaks. There are pros and cons each way. The latter option would have the disadvantage of being slightly less familiar to Bing AI compared to what is in its training data, but it would have the advantage of representing line and section breaks, which may be important if you want to investigate whether Bing AI can reproduce statistical phenomena like the "Line as a Functional Unit" or gallows characters appearing more frequently at the start of sections. 2. Feed existing strings of Voynich text into Bing AI (or some other LLM) systematically starting from the beginning of the VMS to the end in chunks that are as big as the context window can allow. Record what Bing AI puts out. 3. Compile Bing AI's outputs into a 2nd master transcription. Analyze Bing AI's compendium for things like: Zipf's Law, 1st order entropy, 2nd order entropy, curve/line "vowel" juxtaposition frequences (a la Brian Cham), "Grove Word" frequences, probabilities of finding certain bigrams at the beginnings or endings of words, ditto with lines, etc. (The more statistical attacks, the better). 4. See how well these analyses match when applied to the original VMS. 5. Compile a second Bing AI-generated Voynich compendium, and a third, and a fourth, and a fifth, and see if the statistical attacks come up the same way again. There are probably ways to automate this that people smarter than me could figure out. RE: Bing AI Generating Continuations of Voynich Text - Psillycyber - 10-04-2023 And to clear up one anticipated source of confusion: I am not claiming that we will be able to get an LLM (at least, any of the current generation ones) to answer a query like, "Translate the following Voynich string into English, or if no translation is possible, explain why." Sadly, that's not how this works. If Bing AI or some other LLM "understands" Voynichese, it understands Voynichese in the same way that the person in the "Chinese Room Thought Experiment" understands Chinese. It will only be able to map Voynichese onto English if it finds that the best way to model Voynichese is to model the human or process that created Voynichese as attempting to encode English according to certain parameters. Indeed, don't think of these LLMs as understanding "English" or "Japanese." Think of them as understanding "Internetese" or "Reality-ese." As far as they know, "reality" is a language game that they are incredibly good at playing, so much so that they know that, under the rules of "reality-ese," when someone asks, "what color is the sky," the LLM is supposed to answer "Blue, except at night when it is black, and at dusk when it can be pink or orange depending on the weather..." Even though the LLM has never seen the sky. RE: Bing AI Generating Continuations of Voynich Text - tavie - 11-04-2023 Is it too long, or are you able to post it here in its entirety for those of us that don't want to patronize that particular site? RE: Bing AI Generating Continuations of Voynich Text - MarcoP - 11-04-2023 Hi Psillycyber, these language models are brilliant at handling languages (English in particular, in my experience), this can easily mislead people into thinking they are smart, since, from our point of view, language and intelligence are strictly associated. As you say, this is not the case: these systems have a very limited idea of reality and facts. In particular, they are not aware of how they function and asking them to explain their reasoning can be fun but is a waste of time. About the "continuation experiment". AI input from the linked web page (the end of f2r): Quote:qoky.cholaiin.shol.sheky.daiin.cthey.keol.saiin.saiin.ychain.dal.chy.dalor.shan.dan.olsaiin.sheey.ckhor.okol.chy.chor.cthor.yor.an.chan.saiin.chety.chyky.sal.sho.ykeey.chey.daiin.chcthy AI continuation: Quote:shy.ckhy.ckhy.ckhy.qokal.qokal.qokal.qokal.qokal.qokal.qokal.qokal.qokal.qokal.qokal.qokal.qokal.qokal.qokal.qokal.oky.cholaiin.shol.sheky.daiin.cthey.keol.saiin.saiin.ychain.dal.chy.dalor.shan.dan.olsaiin.sheey.ckhor.okol.chy.chor.cthor.yor.an.chan.saiin.chety.chyky.sal.sho.ykeey.chey.daiin.chcthy This shows that Bing AI has accessed a transcription of the Voynich manuscript. Since the training process of these tools is so opaque, I wonder if it saw Voynichese during language training, or if it just grabbed the transliteration from the web to reply the question. The continuation is entirely made up of words from Takahashi's transliteration. While about 14% of Voynich tokens are hapax legomena (appear only once in the whole text) here the two rarest words appear twice in TT's translitaration (olsaiin yor). The continuation includes 56 tokens: in actual Voynichese, an average of 7 of these would be unique word-types that don't appear elsewhere. Maybe in the future these models will be able to extract structured information from the texts they examine. But applying "large" language models to Voynichese has the problem that Voynichese is anything but a "large" corpus: its ~38K tokens are only about twice as large the context managed by GPT-4; and of course this is not even a single uniform language, but the result of a gradual drift from Currier-A to Currier-B; the original order of the pages is lost, so the drifting process cannot be analysed with any certainty about its actual chronology. Lastly, the greatest uncertainty in Voynich transliterations is word spacing: the Zandbergen-Landini transliteration marks as uncertain more than 5% of word separators, since each uncertainty affects two words, about 10% of Voynichese tokens cannot be reliably read. If I understand correctly, these language models strongly depend on words as atomic entities: I guess that the problem of spaces would have a huge impact on their learning. But, despite all these problems, I am sure that the field will evolve quickly in the next few years and it's quite possible that AI efforts specifically developed for Voynich research will result in actual advances. RE: Bing AI Generating Continuations of Voynich Text - Psillycyber - 11-04-2023 Quote:Is it too long, or are you able to post it here in its entirety for those of us that don't want to patronize that particular site? It's kinda long. But I guess I can quote the most relevant part here: Quote:Bing AI: (11-04-2023, 11:09 AM)MarcoP Wrote: You are not allowed to view links. Register or Login to view.Hi Psillycyber, Thank you MarcoP! In hindsight, the presence/absence of new vords in the AI's continuation is something obvious I should have checked myself. I wrote over in the LessWrong thread: Quote:I'm glad others are trying this out. I crossposted this over on the Voynich Ninja forum: |