(03-07-2025, 01:30 AM)ReneZ Wrote: You are not allowed to view links. Register or Login to view.Interesting!
You may want to try this on a sample of Torsten's generated text...
Hello René.
Hre is a result for a 11k word generated text with Torsten Timm's algorithm:
]
Code:
Working on Torsten Timm generated
10832 words identified.
Trainng model for Torsten Timm generated
number of parameters: 1.08M
Random initial word: sharo
Entropy of word distribution for 'Torsten Timm generated': 8.7303 bits
Top 10 words by entropy contribution:
word count prob entropy
chedy 17 0.016983 0.099856
cheedy 16 0.015984 0.095380
char 15 0.014985 0.090814
daiin 15 0.014985 0.090814
ar 11 0.010989 0.071514
dain 11 0.010989 0.071514
air 11 0.010989 0.071514
aiin 10 0.009990 0.066387
ol 10 0.009990 0.066387
ain 10 0.009990 0.066387
✔️ % of 2-grams found in original text: 18.20% (182/1000)
✔️ % of 3-grams found in original text: 0.10% (1/999)
At first glance, Torsten Timm’s generated text looks like the Voynich Manuscript: it has similar-looking words, character patterns, and structure. But when we analyze how well a language model like GPT can learn it, something surprising happens: results are much higher than original voynich text, and very simmilar to natural languages.
I think what happens is following: Timm’s generator works by modifying previous words to create new ones. This makes the text highly repetitive, created with consistent patterns, "easy" to predict from left to right. This regularity lowers the perplexity, meaning GPT finds it easier to guess the next character.
Even though the real Voynich also has repeated words and patterns, it is much less predictible. It has strange or rare characters, unusual word constructions, etc. This makes the Voynich more chaotik and much less predictable, which increases perplexity. GPT struggles more to model it.
Timm’s model generates text that lloks like Voynich in terms of length, characters, and repetition. But it may lack real linguistic structure (if the Voynich has), hidden rules (maybe), semantic meaning or encoding logic... who knows? So GPT finds it easier — but that doesn’t mean it’s more authentic. In fact, the very ease of learning it may prove it’s not like the real Voynich at all.
You can find a sample of outputs with block_size=8 (so 8 previous tokens for learning and predicting) here:
You are not allowed to view links.
Register or
Login to view.
To understand the html files, notice following (example:GPT output was: sharo.ydy.cheal.omororom.chol.pcheo.olkaim):
- in the htm file, you will see bigram by bigram (or trigram by trigram), this is sharo.ydy ydy.cheal cheal.omororom omororom.chol chol.pcheo pcheo.olkaim
- these bigrams are compared against the original corpus and highlighted in green if they appear in the exact same order, or in red if they do not (i.e., the two words never occur together in that order in the original text).
- the GPT does not create new words, all words are valid ones. The creation of new words is studied in another thread, where I trained GPT character by character.
I think, this leads us to two interpretations (nothing new, sorry):
1: The Voynich Manuscript is meaningful but complex
- The high perplexity means the text is hard to predict, which could point to deep structure or an unknown system.
- This might suggest it’s a real language, a cipher, or a sophisticated code that GPT simply hasn’t learned how to "understand".
- In this view, the Voynich Manuscript is not random, but intentionally complex and meaningful.
2: The Voynich Manuscript is meaningless or pseudo-text
- The high perplexity may reflect a lack of consistent rules or patterns, making it hard to learn.
- It could indicate the text was intentionally designed to look like language, but doesn’t actually encode meaning.
- In this view, the Voynich Manuscript is elaborate gibberish, which GPT finds difficult because there’s nothing real to model.