The Voynich Ninja

Pages: 1 2 3

(03-07-2025, 07:38 AM)stopsquark Wrote: You are not allowed to view links. Register or Login to view.Did the Latin text use scribal abbreviations (and if so, how were these tokenized?) or was it first expanded into plaintext? Many folks on this site have pointed out the similarities between Voynichese letters and Latin scribal abbreviations, and I'd be very curious about whether or not a Latin text with the sigla tokenized as distinct characters behaves like Voynichese. This would effectively be a verbose substitution cipher, but one leaning strongly on historical convention.

Great work on this- I'm planning on tinkering with something similar in order to try to do embedding space alignment across Currier hands. I'd love to look at your repository, if it's public!!!

Hello, the public plain texts I worked with were first expanded into full plaintext, so the original abbreviations are not present in the text. This should make Latin easier for GPT-style transformers to understand, but even with that expansion, the results in Latin are still not very good. I agree with Koen that Latin was a 'dead' language and must have been heavily influenced in many ways by natural languages in use at the time — including errors and other irregularities.

(03-07-2025, 01:30 AM)ReneZ Wrote: You are not allowed to view links. Register or Login to view.Interesting!
You may want to try this on a sample of Torsten's generated text...

Hello René.

Hre is a result for a 11k word generated text with Torsten Timm's algorithm:

]

Code:
Working on Torsten Timm generated

10832 words identified.

Trainng model for Torsten Timm generated

number of parameters: 1.08M

Random initial word: sharo

Entropy of word distribution for 'Torsten Timm generated': 8.7303 bits

Top 10 words by entropy contribution:

  word  count    prob  entropy

chedy    17 0.016983 0.099856

cheedy    16 0.015984 0.095380

  char    15 0.014985 0.090814

daiin    15 0.014985 0.090814

    ar    11 0.010989 0.071514

  dain    11 0.010989 0.071514

  air    11 0.010989 0.071514

  aiin    10 0.009990 0.066387

    ol    10 0.009990 0.066387

  ain    10 0.009990 0.066387

✔️ % of 2-grams found in original text: 18.20% (182/1000)

✔️ % of 3-grams found in original text: 0.10% (1/999)

At first glance, Torsten Timm’s generated text looks like the Voynich Manuscript: it has similar-looking words, character patterns, and structure. But when we analyze how well a language model like GPT can learn it, something surprising happens: results are much higher than original voynich text, and very simmilar to natural languages.

I think what happens is following: Timm’s generator works by modifying previous words to create new ones. This makes the text highly repetitive, created with consistent patterns, "easy" to predict from left to right. This regularity lowers the perplexity, meaning GPT finds it easier to guess the next character.

Even though the real Voynich also has repeated words and patterns, it is much less predictible. It has strange or rare characters, unusual word constructions, etc. This makes the Voynich more chaotik and much less predictable, which increases perplexity. GPT struggles more to model it.

Timm’s model generates text that lloks like Voynich in terms of length, characters, and repetition. But it may lack real linguistic structure (if the Voynich has), hidden rules (maybe), semantic meaning or encoding logic... who knows? So GPT finds it easier — but that doesn’t mean it’s more authentic. In fact, the very ease of learning it may prove it’s not like the real Voynich at all.

You can find a sample of outputs with block_size=8 (so 8 previous tokens for learning and predicting) here:

You are not allowed to view links. Register or Login to view.

To understand the html files, notice following (example:GPT output was: sharo.ydy.cheal.omororom.chol.pcheo.olkaim):
- in the htm file, you will see bigram by bigram (or trigram by trigram), this is sharo.ydy ydy.cheal cheal.omororom omororom.chol chol.pcheo pcheo.olkaim
- these bigrams are compared against the original corpus and highlighted in green if they appear in the exact same order, or in red if they do not (i.e., the two words never occur together in that order in the original text).
- the GPT does not create new words, all words are valid ones. The creation of new words is studied in another thread, where I trained GPT character by character.

I think, this leads us to two interpretations (nothing new, sorry):

1: The Voynich Manuscript is meaningful but complex

The high perplexity means the text is hard to predict, which could point to deep structure or an unknown system.
This might suggest it’s a real language, a cipher, or a sophisticated code that GPT simply hasn’t learned how to "understand".
In this view, the Voynich Manuscript is not random, but intentionally complex and meaningful.

2: The Voynich Manuscript is meaningless or pseudo-text

The high perplexity may reflect a lack of consistent rules or patterns, making it hard to learn.
It could indicate the text was intentionally designed to look like language, but doesn’t actually encode meaning.
In this view, the Voynich Manuscript is elaborate gibberish, which GPT finds difficult because there’s nothing real to model.

(03-07-2025, 03:59 AM)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view.It would be interesting to see what your network would do if you mapped every a to o, every s to r, every ch to ee...

Thanks, Jorge

lots of great points!

I agree that with such a small corpus, the model mostly memorizes sequences, especially at low context lengths.

In fact, I trained several models with different block sizes (from 1 to 32). Block size tells how many tokens are used for learning and predicting. As expected, with block size 1 the results are better — the model just echoes bigrams. But with larger blocks (e.g., 8 or more), it needs to rely on broader context, and that’s where Voynichese starts to diverge from natural languages.

Here’s a sample of matching bigrams and trigrams for block size = 8, for several texts including Voynichese:

You are not allowed to view links. Register or Login to view.

Format is so: each ngram is marked in green if it appears in the original book. Files are in html, so I suggest to open them in Google Docs.

[Image: w4mYK5M.png]

For example, the first line of the image:

- the GPT output was: sharo.ydy.cheal.omororom.chol.pcheo.olkaim
- in the htm file, you will see bigram by bigram (or trigram by trigram), this is sharo.ydy ydy.cheal cheal.omororom omororom.chol chol.pcheo pcheo.olkaim
- these bigrams are compared against the original corpus and highlighted in green if they appear in the exact same order, or in red if they do not (i.e., the two words never occur together in that order in the original text).
- the GPT does not create new words, all words are valid ones. The creation of new words is studied in another thread, where I trained GPT character by character.

Also, I’ll definitely try the substitution experiment you suggested — very curious to see how that affects predictability. Thanks again

(03-07-2025, 12:06 PM)quimqu Wrote: You are not allowed to view links. Register or Login to view.In fact, I trained several models with different block sizes (from 1 to 32). Block size tells how many tokens are used for learning and predicting. As expected, with block size 1 the results are better — the model just echoes bigrams. But with larger blocks (e.g., 8 or more), it needs to rely on broader context, and that’s where Voynichese starts to diverge from natural languages.

I would be surprised if the models with block size 3 and larger end up using more than the last two words.

Quote:Here’s a sample of matching bigrams and trigrams for block size = 8, for several texts including Voynichese:

You are not allowed to view links. Register or Login to view.

Google Drive says I don't have permission to read that link.

From the Voynichese output that you quoted, it seems that many word spaces are missing. I suppose they were marked with comma ("possible space") instead of period in the original file. (Or, sadly,not marked at all: it looks like the Scribe sometimes ran words together by accident.) Would you consider trying again with periods and commas treated as word separators?

All the best, --jorge

(03-07-2025, 02:34 PM)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view.Google Drive says I don't have permission to read that link.

From the Voynichese output that you quoted, it seems that many word spaces are missing. I suppose they were marked with comma ("possible space") instead of period in the original file. (Or, sadly,not marked at all: it looks like the Scribe sometimes ran words together by accident.) Would you consider trying again with periods and commas treated as word separators?

OK, now you should have access. Let me explain the content of the files: the output is a GPT generated text with words separated by a dot. In the html, each bigram is analyzed separately. For example, for the first line of the quoted image:

- the GPT output was: sharo.ydy.cheal.omororom.chol.pcheo.olkaim
- in the htm file, you will see bigram by bigram (or trigram by trigram), this is sharo.ydy ydy.cheal cheal.omororom omororom.chol chol.pcheo pcheo.olkaim
- these bigrams are compared against the original corpus and highlighted in green if they appear in the exact same order, or in red if they do not (i.e., the two words never occur together in that order in the original text).
- the GPT does not create new words, all words are valid ones. The creation of new words is studied in another thread, where I trained GPT character by character.

I hope it makes sense to you.

Regards

(02-07-2025, 04:10 PM)quimqu Wrote: You are not allowed to view links. Register or Login to view.I trained several nanoGPT models

Is it possible that n-gram predictability is driven by the absolute frequency of the most-frequent words? In Early Modern English (I checked the 1611 King James Gospels), the most frequent word ("and", "the", "of", "he", "him", ...) occurs ~3 times more often than the first-rank word in Cuva paragraph text ("DAM", "SEDY", "OL", "AM", "ZEDY", ...):

[attachment=10950]
The normalized Cuva frequency distribution (blue) is wider than the English one (magenta). If these frequency values are correlated with n-gram predictability, then (based on your results) we would expect the output of Torsten's generator to have a profile that more closely resembles English. A sample of 34 000 words, generated with the default parameters and expressed as Cuva, are represented by grey stars ("O", "SUDY", "HO", "Y", "AM", ...).

It makes combinatorial sense that n-grams are easier to predict in a sample where frequency is concentrated in fewer words. I can not judge whether any further explanation is required, because the workings of the transformer are obscure to me. Do we know what properties specifically get vectorized?

Thanks! I am still not sure what this test implies, though.

That is the problem with neural nets (and LLMs). They may work for some problems, but it is practically impossible to understand why they give the outputs that they give...

All the best, --jorge

What happens if you train Currier A and B separately? And conversely, what do the results look like if you train with a text that is half and half two different natural languages?

Pages: 1 2 3

quimqu

quimqu

quimqu

Jorge_Stolfi

quimqu

obelus

Jorge_Stolfi

srjskam