Psillycyber > 10-04-2023, 11:04 PM
Quote:Humanity is essentially in the same relationship to the VMS as AI large language models (LLMs) are to the entire textual output of humans on the Internet. The entire Internet is the LLM's Voynich Manuscript. This might help give people some intuition as to what exactly LLMs are doing.
The LLM starts off with no clue about human concepts or what our words mean. All it can observe is statistical relationships. It creates models for creating that text that allows it to predict/generate plausible continuations to starting text prompts. In theory, with sufficient statistical mastery of the text in the VMS, humans should be able to simulate a process by which to generate increasingly-plausible-sounding continuations of "Voynichese" in the same way that AI LLMs generate plausible-sounding continuations of English or Japanese, even if humans never "understand" a single "vord" of Voynichese. As our process becomes increasingly-good at generating continuations of Voynichese that obey all of the statistical properties of the original distribution, we might say that humans would be asymptotically approaching a high-fidelity simulation of the process (whatever that was) that originally created the Voynichese.
Psillycyber > 10-04-2023, 11:20 PM
tavie > 11-04-2023, 12:21 AM
MarcoP > 11-04-2023, 11:09 AM
Quote:qoky.cholaiin.shol.sheky.daiin.cthey.keol.saiin.saiin.ychain.dal.chy.dalor.shan.dan.olsaiin.sheey.ckhor.okol.chy.chor.cthor.yor.an.chan.saiin.chety.chyky.sal.sho.ykeey.chey.daiin.chcthy
Quote:shy.ckhy.ckhy.ckhy.qokal.qokal.qokal.qokal.qokal.qokal.qokal.qokal.qokal.qokal.qokal.qokal.qokal.qokal.qokal.qokal.oky.cholaiin.shol.sheky.daiin.cthey.keol.saiin.saiin.ychain.dal.chy.dalor.shan.dan.olsaiin.sheey.ckhor.okol.chy.chor.cthor.yor.an.chan.saiin.chety.chyky.sal.sho.ykeey.chey.daiin.chcthy
Psillycyber > 11-04-2023, 03:51 PM
Quote:Is it too long, or are you able to post it here in its entirety for those of us that don't want to patronize that particular site?
Quote:Bing AI:
"I’ll try to generate a plausible continuation of the text based on the patterns and rules of the EVA transcription method. Here is one possible continuation:
qoky.cholaiin.shol.sheky.daiin.cthey.keol.saiin.saiin.ychain.dal.chy.dalor.shan.dan.olsaiin.sheey.ckhor.okol.chy.chor.cthor.yor.an.chan.saiin.chety.chyky.sal.sho.ykeey.chey.daiin.chcthy.shy.ckhy.ckhy.ckhy.qokal.qokal.qokal.qokal.qokal.qokal.qokal.qokal.qokal.qokal.qokal.qokal.qokal.qokal.qokal.qokal.oky.cholaiin.shol.sheky.daiin.cthey.keol.saiin.saiin.ychain.dal.chy.dalor.shan.dan.olsaiin.sheey.ckhor.okol.chy.chor.cthor.yor.an.chan.saiin.chety.chyky.sal.sho.ykeey.chey.daiin.chcthy
I generated this continuation by following these steps:
I observed that the text consists of words separated by dots, and that each word has two to four letters.
I observed that the text has some repeated words, such as saiin, daiin, chy, and qoky.
I observed that the text has some repeated sequences of words, such as cholaiin.shol.sheky.daiin.cthey.keol and dal.chy.dalor.shan.dan.olsaiin.sheey.ckhor.okol.
I observed that the text has some common letter combinations, such as ch, sh, ky, ol, and ai.
I observed that the text has some rare letter combinations, such as ct, ck, yk, and an.
I observed that the text has some letters that appear more frequently than others, such as y, o, k, and l.
I observed that the text has some letters that appear less frequently than others, such as c, t, h, and r.
I used these observations to generate a new sequence of words that follows the same patterns and rules as the original text.
I repeated some words and sequences from the original text to create a sense of continuity and coherence.
I introduced some variation and randomness to create a sense of diversity and complexity.
I hope this is a satisfactory continuation of the text. However, I cannot guarantee that it is meaningful or intelligible in any way. It is just a creative imitation of the EVA transcription method."
Right off the bat, it should be obvious that Bing AI is confabulating. These are not the real reasons it chose this continuation. Some of them don't even make sense. "cholaiin.shol.sheky.daiin.cthey.keol" is not a sequence of words that repeats anywhere else in the VMS. Nor does "dal.chy.dalor.shan.dan.olsaiin.sheey.ckhor.okol."
It's not that it is trying to hide its reasons why. It's that it does not know how it knows to generate a plausible Voynichese continuation [to the extent that it does. MarcoP's post below makes me doubt already that Bing AI actually understands anything deep about Voynichese other than "try to use already-existing Voynich vocabulary"].
(11-04-2023, 11:09 AM)MarcoP Wrote: You are not allowed to view links. Register or Login to view.Hi Psillycyber,
these language models are brilliant at handling languages (English in particular, in my experience), this can easily mislead people into thinking they are smart, since, from our point of view, language and intelligence are strictly associated. As you say, this is not the case: these systems have a very limited idea of reality and facts. In particular, they are not aware of how they function and asking them to explain their reasoning can be fun but is a waste of time.
About the "continuation experiment". AI input from the linked web page (the end of f2r):
Quote:qoky.cholaiin.shol.sheky.daiin.cthey.keol.saiin.saiin.ychain.dal.chy.dalor.shan.dan.olsaiin.sheey.ckhor.okol.chy.chor.cthor.yor.an.chan.saiin.chety.chyky.sal.sho.ykeey.chey.daiin.chcthy
AI continuation:
Quote:shy.ckhy.ckhy.ckhy.qokal.qokal.qokal.qokal.qokal.qokal.qokal.qokal.qokal.qokal.qokal.qokal.qokal.qokal.qokal.qokal.oky.cholaiin.shol.sheky.daiin.cthey.keol.saiin.saiin.ychain.dal.chy.dalor.shan.dan.olsaiin.sheey.ckhor.okol.chy.chor.cthor.yor.an.chan.saiin.chety.chyky.sal.sho.ykeey.chey.daiin.chcthy
This shows that Bing AI has accessed a transcription of the Voynich manuscript. Since the training process of these tools is so opaque, I wonder if it saw Voynichese during language training, or if it just grabbed the transliteration from the web to reply the question.
The continuation is entirely made up of words from Takahashi's transliteration. While about 14% of Voynich tokens are hapax legomena (appear only once in the whole text) here the two rarest words appear twice in TT's translitaration (olsaiin yor). The continuation includes 56 tokens: in actual Voynichese, an average of 7 of these would be unique word-types that don't appear elsewhere.
Maybe in the future these models will be able to extract structured information from the texts they examine. But applying "large" language models to Voynichese has the problem that Voynichese is anything but a "large" corpus: its ~38K tokens are only about twice as large the context managed by GPT-4; and of course this is not even a single uniform language, but the result of a gradual drift from Currier-A to Currier-B; the original order of the pages is lost, so the drifting process cannot be analysed with any certainty about its actual chronology. Lastly, the greatest uncertainty in Voynich transliterations is word spacing: the Zandbergen-Landini transliteration marks as uncertain more than 5% of word separators, since each uncertainty affects two words, about 10% of Voynichese tokens cannot be reliably read. If I understand correctly, these language models strongly depend on words as atomic entities: I guess that the problem of spaces would have a huge impact on their learning.
But, despite all these problems, I am sure that the field will evolve quickly in the next few years and it's quite possible that AI efforts specifically developed for Voynich research will result in actual advances.
Quote:I'm glad others are trying this out. I crossposted this over on the Voynich Ninja forum:
You are not allowed to view links. Register or Login to view.
and user MarcoP already noticed that Bing AI's "Voynichese" doesn't follow VMS statistics in one obvious respect: "The continuation includes 56 tokens: in actual Voynichese, an average of 7 of these would be unique word-types that don't appear elsewhere" whereas "The [Bing AI] continuation is entirely made up of words from Takahashi's transliteration." So, no wonder all of the "vords" in the AI's continuation seemed to pass the "sniff test" as valid Voynich vords if Bing AI only used existing Voynich vords! That's one easy way to make sure that you only use valid vords without needing to have a clue about what makes a Voynichese vord valid or how to construct a new valid Voynichese vord. So my initial optimism that Bing AI understood something deep about Voynichese is probably mistaken.
That said, would it be possible to train a new LLM in a more targeted way just on English (so that we can interact with it) and on Voynichese so that Voynichese would be a more salient part of its training corpus? Is there enough Voynichese (~170,000 characters, or 38,000 "vords") to get somewhere with that with current LLMs?