hi everyone,
this is not a claim, but a question about a work in progress.
Is this a legit strategy to find out something about the text or if it is just not helpful at all?
Algo:
- cut out random 10 pages off the full text (voynich)
- train a base llm with the text (10 pages missing).
- give one of the cut out pages to the llm with one EVA-Token missing. the llm guesses a token for the missing one.
- repeat this 20 times with different tokens missing.
- do the same with a random text
- do the same with a generated text (from one of supposed solutions for example).
- compare the probability.
The result will show if the llm can predict the real text better than the other texts.
The problem might be, that the result doesn't say anything about the text but only about the model. But it might say that the text is not random, for example.
I have made the llm and tested it, but not scientifically. with mixed results. I used chatgpt mostly for python code generation and getting a new perspective. Most of what it wrote were guesses that I dismissed.
I got significant better results for predicting the real cut out text against a random text from EVA-Vocab. I used Qwen/Qwen3-1.7B-Base as llm.
I had against random something like this:
Δ avg log P/subtoken (REAL − RANDOM): 4.1635
However this is of course not scientific, so take it with a grain of salt. Especially because I am new to this and do this as a hobby holiday project.
A variation is: Can the llm predict the token from both sides better than only the left side? You just have to hide all right tokens for the left side test.
It might however be possible to make this scientific. What do you think?
thanks for your answers