Cut out - LLM compared to random

Cut out - LLM compared to random - Printable Version

+- The Voynich Ninja (https://www.voynich.ninja)
+-- Forum: Voynich Research (https://www.voynich.ninja/forum-27.html)
+--- Forum: Voynich Talk (https://www.voynich.ninja/forum-6.html)
+--- Thread: Cut out - LLM compared to random (/thread-5179.html)

Cut out - LLM compared to random - Ludwigvan - 27-12-2025

hi everyone, Smile

this is not a claim, but a question about a work in progress.

Is this a legit strategy to find out something about the text or if it is just not helpful at all?

Algo:
- cut out random 10 pages off the full text (voynich)
- train a base llm with the text (10 pages missing).
- give one of the cut out pages to the llm with one EVA-Token missing. the llm guesses a token for the missing one.
- repeat this 20 times with different tokens missing.
- do the same with a random text
- do the same with a generated text (from one of supposed solutions for example).
- compare the probability.
The result will show if the llm can predict the real text better than the other texts.

The problem might be, that the result doesn't say anything about the text but only about the model. But it might say that the text is not random, for example.

I have made the llm and tested it, but not scientifically. with mixed results. I used chatgpt mostly for python code generation and getting a new perspective. Most of what it wrote were guesses that I dismissed.

I got significant better results for predicting the real cut out text against a random text from EVA-Vocab. I used Qwen/Qwen3-1.7B-Base as llm.

I had against random something like this:
Δ avg log P/subtoken (REAL − RANDOM): 4.1635

However this is of course not scientific, so take it with a grain of salt. Especially because I am new to this and do this as a hobby holiday project.

A variation is: Can the llm predict the token from both sides better than only the left side? You just have to hide all right tokens for the left side test.

It might however be possible to make this scientific. What do you think?

thanks for your answers Smile

RE: Cut out - LLM compared to random - oshfdk - 27-12-2025

Hi,

I assume you mean something like "fine-tune" when you say "train", I don't think it's possible to train an LLM based on the text of the manuscript only, it would take a few orders of magnitude more data.

I think it should be possible to create some prediction model and check how much the actual text outperforms randomly shuffled tokens. I don't know if the result will have any practical application. Suppose, the result is 4.1635 as in your experiment, what are the conclusions here? After all, natural text will show some degree of unpredictability and at the same time random sequences generated using available in the XV century methods will show some degree of predictability.

Checking if it's easier to predict a token based on the left side context or the right side context looks interesting, but again I'm not sure about the implications. I don't even know what would happen if we run this on a normal text. I guess both values would be roughly the same?

RE: Cut out - LLM compared to random - nablator - 27-12-2025

Hi,

LLMs need huge datasets to be any good at predicting the next token, the VMS is too small for them. It was done recently, see You are not allowed to view links. Register or Login to view..

(27-12-2025, 03:35 PM)Ludwigvan Wrote: You are not allowed to view links. Register or Login to view.But it might say that the text is not random, for example.

We already know that. Simple word pair statistics show it clearly, no need for complex calculations.

RE: Cut out - LLM compared to random - Ludwigvan - 27-12-2025

oshfdk: yes, sorry, I meant "fine-tuned", I used as described the Base model to fine tune it.

I am not sure about the implications either. It might only be helpful in ruling out some possibilities.

with the reading the context from the left side against bidirectional I thought that you might rule out some form of lefthanded automation, but I think I am probably wrong.

nablator: thanks for the link.

yes, this is probably right. I had better results in predicting the next token against a random text, but this might be flawed.

Thanks, that clarifies it. So, it seems this is more of a hopeless endeavor, but I will have to think about it.

RE: Cut out - LLM compared to random - Jorge_Stolfi - 27-12-2025

(27-12-2025, 03:35 PM)Ludwigvan Wrote: You are not allowed to view links. Register or Login to view.- cut out random 10 pages off the full text (voynich)
- train a base llm with the text (10 pages missing).
- give one of the cut out pages to the llm with one EVA-Token missing. the llm guesses a token for the missing one.
- repeat this 20 times with different tokens missing.
- do the same with a random text
- do the same with a generated text (from one of supposed solutions for example).
- compare the probability.

Using an LLM to predict a missing word does not seem very different than using a second-order Markov model (that predicts the next word W3 from the last two words W1,W2, based on the frequencies of word triples W1 W2 W3 measured on the VMS itself. Or any variant thereof, e.g. predicting W2 given W1 and W3.

One difference is that maybe the LLM uses a more adaptive strategy, using more of fewer words of context depending on the local situation.

Another difference is that you will have no idea of what the LLM is actually doing.

In general, higher-order Markov models will get better and better at reproducing the training text (the subset of the text from which word tuple statistics are collected). At some point they will essentially memorize the training text, because (say) each triple W1 W2 W3 occurs only once in it, and thus the model will have only one choice for the next word W4. Then the model will get very good at predicting it -- and very bad at predicting the other half of the text, omitted from the training.

If the lexicon has N words, and all of them occur with roughly equal frequency, the training text for an order-k Markov must have a few times N^(k+1) words; so that the frequency of W[k+1] given W1 W2...Wk can be measured from it, for all (k+1)-tuples. Namely, if the lexicon has 2000 words, you need a text that is tens of millions of words in order to properly train even a first-order Markov.

It is much worse if the words have highly unequal frequencies, as predicted by Zipf's law.

Said another way, an order-k Markov has N^(k+1) internal parameters, so you need that magnitude of training data to even begin adjusting them to the proper values.

LLMs have this same problem. Depending on how many internal parameters the LLM has, you may need a training text with billions of words in order for them to get remotely close to proper values.

Then there is the question of what is the "random" text you are using as control. For instance, if the "random" text is generated by a second-order Markov, a predictor that also uses a second-order Markov may have the same success rate on both texts, I don't know...

All the best, --stolfi