GPT Models Fail to Find Language Structure in the Voynich Manuscript

GPT Models Fail to Find Language Structure in the Voynich Manuscript - Printable Version

+- The Voynich Ninja (https://www.voynich.ninja)
+-- Forum: Voynich Research (https://www.voynich.ninja/forum-27.html)
+--- Forum: Analysis of the text (https://www.voynich.ninja/forum-41.html)
+--- Thread: GPT Models Fail to Find Language Structure in the Voynich Manuscript (/thread-4785.html)

Pages: 1 2 3

RE: GPT Models Fail to Find Language Structure in the Voynich Manuscript - Koen G - 02-07-2025

(02-07-2025, 07:05 PM)quimqu Wrote: You are not allowed to view links. Register or Login to view.This also makes me wonder: are you suggesting that Voynichese might behave similarly because it encodes something like a “morphologically rich” or non-repetitive learned language — perhaps even analogous to medieval Latin?

I don't think so. If you remove all potential suffixes from Voynichese, there's not enough left in terms of roots. At least that's my intuitive feeling about it. Maybe it's worth an experiment.

RE: GPT Models Fail to Find Language Structure in the Voynich Manuscript - quimqu - 02-07-2025

magnesium
This could be interesting to test systematically with different kinds of ciphertexts. Are there classes of substitution ciphers, for instance, that produce readily decipherable ciphertexts that exhibit anomalously low, VMS-like GPT predictability? If a given type of cipher, encrypting a wide range of plaintexts, consistently appears to be more GPT-predictable than the VMS, than that kind of cipher probably isn't consistent with the VMS. But this method probably would be overkill: Monoalphabetic substitution ciphers would probably show up as much more predictable in this analysis than the VMS, but we can rule out monoalphabetic substitution ciphers using much less computationally expensive techniques.
[/quote' Wrote: You are not allowed to view links. Register or Login to view.Hello! Check this! You are not allowed to view links. Register or Login to view.

I will work with the ciphers and the GPT

(02-07-2025, 08:26 PM)Koen G Wrote: You are not allowed to view links. Register or Login to view.[quote="quimqu" pid='68392' dateline='1751479547']This also makes me wonder: are you suggesting that Voynichese might behave similarly because it encodes something like a “morphologically rich” or non-repetitive learned language — perhaps even analogous to medieval Latin?

I don't think so. If you remove all potential suffixes from Voynichese, there's not enough left in terms of roots. At least that's my intuitive feeling about it. Maybe it's worth an experiment.

Interesting!

RE: GPT Models Fail to Find Language Structure in the Voynich Manuscript - quimqu - 02-07-2025

(02-07-2025, 07:20 PM)oshfdk Wrote: You are not allowed to view links. Register or Login to view.I just think that using a black box like GPT for analysis should be accompanied with very strict definition of what the inputs mean exactly and how exactly we interpret the outputs and why. Black boxes already introduce a lot of uncertainty by themselves, when we multiply uncertainties, weird things can happen.

I agree,.but the target of my experiment was just this, check what a blackbox like a transformer can learn. It is demonstrated that transformers "learn" grammatical structures, vocabulary, etc of any language... So I tested what happens with the Voynich... And it failed... Of course I will try and push for more results, but at least for now, the outputs of those blackboxes tell us that they cannot "understand" that "language" in terms of a natural language (the target of the transformers)

RE: GPT Models Fail to Find Language Structure in the Voynich Manuscript - Jorge_Stolfi - 02-07-2025

(02-07-2025, 04:10 PM)quimqu Wrote: You are not allowed to view links. Register or Login to view.I trained several nanoGPT models (roughly 1.1M parameters each) on corpora limited to 11,000 words each. The corpora included:

In case you want to try a wider sample, You are not allowed to view links. Register or Login to view.are some texts that I collected a while ago:

RE: GPT Models Fail to Find Language Structure in the Voynich Manuscript - ReneZ - 03-07-2025

Interesting!
You may want to try this on a sample of Torsten's generated text...

RE: GPT Models Fail to Find Language Structure in the Voynich Manuscript - Jorge_Stolfi - 03-07-2025

(02-07-2025, 04:10 PM)quimqu Wrote: You are not allowed to view links. Register or Login to view.The GPT model is a type of neural network trained to predict the next token (e.g., word or character) in a sequence, based on the context of the previous ones.

I am struggling to understand what was tested and what the results mean.

It would help to see a sample of the "three-word sequences of the original text" that were reproduced by the trained model. "Off with her head!"?

I think that a neural network is an overkill for that purpose,  Generating random words according to a frequency distribution that depends on the last k words is a Markov chain of order k. "Training" it is simply computing those probability distributions. It is a rather simple programming exercise (I did it when I was in college) and with k = 2 or k = 3 it already generates nonsense that sounds very much like the original.

The problem for such automatic gibberish generators (of wahtever method - Markov, neuroal net, etc) is that you need a LOT of sample text to get it to learn the features of the language. If the sample is too small, it will instead just learn the sample, and produce an output that is longish pieces of the sample spliced together at a few switch points. Specifically, for k = 2, if a pair of words like "rear view" occurs only once in the sample, the generator will always output the word that followed them in the sample, say "mirror". And then if "view mirror" occurred only once, it will always output the next word in that occurrence, say "was". Randomness would be used only when the last two words generated occurred twice or more in the sample.

How small is "too small"? Suppose that the language has a vocabulary of 1000 distinct words, and is such that every three-word combination occurred with equal frequency. (That would make the language seem totally random, but it is in fact what a maximally efficient language -- one that says the most information per word, with that vocabulary -- would look like). For a Markov chain (or neural network) to learn that fact, you would need a sample with at least 1000^3 = 1 billion words. Actually more like 30 billion because of a thing called the "coupon collector's problem". And much more than that for a natural language with a Zipf-like word frequency distribution.

So it is no wonder that your neural network generated text which reproduced many of the 3-word sequences of the original.

Another issue, that every Voynichologist should be aware of, is that any VMS transcription, no matter how careful, will contain a lot of errors. Because in many cases there is no way to tell whether the thing that the Scribe wrote is an a or an o, an r or an s, a ch or an ee., etc. And the Scribe himself may have made many such errors when reading the draft. I would guess that at least 10% of the os in the file were intended by the Author to be a and vice-versa, and ditto for r and s.

Another place where any transcription may be wrong is in the word spaces.  Again the Scribe may accidentally ran two words together or split a word in two. For instance, as others have noted, words that begin with y are more common at the start of a line. But some of those y are isolated,well separated from the next word, while others are part of a longer word. Is that difference intentional? If not, a large fraction of the lines will have an error there -- a missing space or a spurious space.

Those errors will make the sample seem more random, hence will reduce the tendency of the gibberish generator to merely copy-paste long substrings of the sample. That may be the reason why your Vonichese numbers were so low.

It would be interesting to see what your network would do if you mapped every a to o, every s to r, every ch to ee...

All the best, --jorge

RE: GPT Models Fail to Find Language Structure in the Voynich Manuscript - quimqu - 03-07-2025

(02-07-2025, 11:10 PM)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view.
(02-07-2025, 04:10 PM)quimqu Wrote: You are not allowed to view links. Register or Login to view.I trained several nanoGPT models (roughly 1.1M parameters each) on corpora limited to 11,000 words each. The corpora included:

In case you want to try a wider sample, You are not allowed to view links. Register or Login to view.are some texts that I collected a while ago:

Thanks a lot! In a couple of days I will send yo some results.

RE: GPT Models Fail to Find Language Structure in the Voynich Manuscript - stopsquark - 03-07-2025

Did the Latin text use scribal abbreviations (and if so, how were these tokenized?) or was it first expanded into plaintext? Many folks on this site have pointed out the similarities between Voynichese letters and Latin scribal abbreviations, and I'd be very curious about whether or not a Latin text with the sigla tokenized as distinct characters behaves like Voynichese. This would effectively be a verbose substitution cipher, but one leaning strongly on historical convention.

Great work on this- I'm planning on tinkering with something similar in order to try to do embedding space alignment across Currier hands. I'd love to look at your repository, if it's public!!!

RE: GPT Models Fail to Find Language Structure in the Voynich Manuscript - stopsquark - 03-07-2025

More thoughts on this- decoder-only Transformers generally do well on many nonlinguistic seq2seq tasks (arithmetic, DNA to protein mapping, etc) when trained totally from scratch- if you're using a pretrained model, though, what it's pretrained on matters a lot.

I STRONGLY suspect that what's happening here is that, if you're using a pretrained model (which GPT-nano is), the non-Voynich texts were of the same language as many others in the training corpus, so the model effectively "already knew Latin" to some extent. If it's never seen EVA, though, the model is going to need a lot of fine tuning to learn it. That doesn't necessarily mean there's no linguistic structure present, though- just that EVA is likely dissimilar to the pretraining data.

Given that Voynich has much lower entropy than language generally, I strongly suspect a small model could learn it very well if trained from scratch- I'd try that next, if you've got the compute for it.

RE: GPT Models Fail to Find Language Structure in the Voynich Manuscript - quimqu - 03-07-2025

(03-07-2025, 08:01 AM)stopsquark Wrote: You are not allowed to view links. Register or Login to view.if you're using a pretrained model, though, what it's pretrained on matters a lot.

Hello!

the nanoGPT are not pretrained, they learn from zero, book by book. That's why I wanted to keep the corpus similar and I cut the natural language books at 11k words, so I could compare simiar languages and voynich.