The Voynich Ninja

Pages: 1 2 3 4 5 6 7 8 9

(26-04-2025, 02:45 AM)RobGea Wrote: You are not allowed to view links. Register or Login to view.What about Scribes and Currier languages?

Wasn't there 3 scribes writing Herbal folios and someone splitting Herbal into Herbal-A and Herbal-B dialects ?
Though that is not 4 divisions, it could cloud the issue.

I suppose we need the data to see which folios are in which group.

Indeed, divisions like Currier A/B and scribal hands could influence results.

We are currently exploring the internal structure of the Voynich manuscript based solely on the EVA transcription (no illustrations, no lunar phases, no Currier).
Preliminary results show that using unsupervised methods (vectorization + topic modeling + clustering), the folios naturally group into coherent sections with limited transitions between them.

It seems the manuscript has internal thematic organization beyond random noise.

Full results will be shared once the analysis is completed.

(25-04-2025, 10:18 PM)RadioFM Wrote: You are not allowed to view links. Register or Login to view.
Quote:No. Segmentation is not purely random. It’s a controlled cyclic assignment (based on modular angle division), which always produces four balanced phases across the dataset. It is not possible to end up with only one Phase.

Fair enough, I double-checked and true, the labelling is done evenly on the dataset and then shuffled, effectively giving 25% of the folios Phase 0, a different (random) 25% of Folios Phase 1, and so on. But it's only balanced on the whole dataset when accounting for training+test, no guarantees to have a balance after the 75-25 split. With just 4 Phases, no big deal of course, but with more than 4 the imbalance can take a toll on the results and worsen scores.

Quote:No. The model is not trained to predict Phase using the Lunar Angle. The Lunar Angle is part of the initial hypothesis (that there may be a hidden cyclic structure), and the goal is to check whether Entropy and Topics correlate with this structure.

Is the accuracy 90-ish % accuracy you mention the score given in supervised_models_seed.py? That is the accuracy of the predictions by the model.

Quote:“You’re overfitting by not splitting the data into Train/Validation/Test.”
No. This is not a predictive model intended for production. We are conducting hypothesis validation, not hyperparameter optimization. Cross-validation would be relevant in a different context.

It's not about models for production. The validation set is to avoid overfitting, which is also key in academia. And there IS hyperparameter optimization.

Quote:No. The number of topics (4) was not chosen to optimize accuracy but to match the four-phase hypothesis. We did not test multiple values of n_topics to pick the best one.

Oh yes you did:

Quote:Also, I think you’re right to question whether setting four topics is arbitrary. It’s not. We tried multiple options and observed that four is the only number that yields a strong, consistent internal pattern, not just via topic modeling, but also in entropy trends, classifier performance (96.5%), autocorrelation, FFT peaks, and alignment with agronomic cycles.

Quote:No. This assumes a continuous optimization landscape (e.g., with loss gradients), which does not apply. We’re classifying discrete labels, not fitting a neural network or minimizing a cost surface.
(...)
P.S. Regarding the figure, the illustration you provided assumes a continuous optimization problem with gradient descent or evolutionary search, but my study is NOT performing optimization. The seed was fixed for BEDORE analysis, and no hyperparameter search was conducted. The model is NOT climbing any surface. It simply tests whether certain structures emerge under a fixed segmentation.

The picture was to exemplify, but you can be sure that when accounting for all parameters and hyperparameters, there are indeed local optima in the solution space. And your study is performing optimization because it involves training a machine learning model Still, I don't think this is the reason why you're getting good results, just a pitfall I can see you falling into.

Quote:To sum up, you are confusing hypothesis validation with model optimization.
Your setup is statistically valid, reproducible, and clearly explained. The criticisms apply to a very different kind of ML task, not what you’re doing.

You are using the fact that you trained an ML to predict some labels as evidence that there is a pattern to even learn in the first place.

My point still stands that you are using a trivial predictor - when ablating Lunar Angle you claim the accuracy drops to about 28% (close to what random prediction, 25% would be). Lunar Angle is a trivial predictor of Phase because no matter how shuffled the dataset and the Lunar Angle will be, if folio f1 gets a Lunar Angle of 92º (or whatever angle that may be depending on the seed), the following lines of code will execute to give the 'correct' Phase to folio f1 against which you'll end up measuring your accuracy:

Code:
If 0º ≤ Lunar Angle < 90º then: Phase := 0 # Lluna Nova If 90º ≤ Lunar Angle < 180º then: Phase := 1 # Quart Creixent If 180º ≤ Lunar Angle < 270º then: Phase := 2 # Lluna Plena If 270º ≤ Lunar Angle < 360º: Phase := 3 # Quart Minvant
(generate_lunar_angles_seed.py)

The AI models will have the entropy, the topics and the 'Lunar Angle' for each and every folio, and they'll eventually come to the best way of predicting the correct Phase: by looking at the Lunar Angle. That's how you get such an extraordinarily high accuracy.

By the track record of this post, you'll disagree with my remarks and claim I know jackshit and that I'm mixing up circular logic with validation and whatnot. Would you please download the code again in a separate folder and perform the same study, but this time ablating all features EXCEPT Lunar Angle and see if you get >85% accuracy as well? Or try running it with some gibberish text, MarcoP once uploaded an OCR of the Codex Seraphinianus to GitHub, I think that ought to do it.

However, I can understand that it would be very interesting to perform a clean approach without the lunar angle, just in case. Remember, the established theory is that there is an underlying pattern, and the lunar part is more hypothetical. To make it as clean as possible, we would need to load the EVA transcription (and later the Takata for cross-checking) to assign tokens to each page of the manuscript. This would ensure working with exclusively internal data.

Once this is done, we would need to apply LDA (Latent Dirichlet Allocation) to identify latent topics. Perhaps there wouldn’t be just four topics, I agree with you, but for sure, the tokens would reveal lexical groupings. With all of this, we can perform clustering and group pages based on their coherence. My theory is that by using this methodology, internal structures within the VMS could emerge, potentially dividing it into three parts. I am working to test this hypothesis.

With all this data, we can perform an analysis of the transitions between topics throughout the manuscript. Most likely, we would find topics with high stability in specific sections.

I'm not going to think about this before I can read the ArXiv article. Please let us know when it'll be up.

Quote:It seems the manuscript has internal thematic organization beyond random noise.

It was suggested before. Personally I never got deep into it but some research shows that the pages seem to make some information clusters. Still it doesn't mean that the text is meaningful. The scribe could just repeat together some nonsense words through several pages and then drop them and switch to another ones.

The problem is that we don't really see any grammar in the text - sentences, structures, words working as sentence starts, verbs or conjunctions...

Anyway I'll wait for your article because at this moment we just don't have sufficient info for further discussion.

I have been completely silent here, not because I do not think that this is interesting, but I would need some time to go through it and understand it properly, and I will only have that in about 2 weeks.

(26-04-2025, 06:19 PM)Rafal Wrote: You are not allowed to view links. Register or Login to view.
Quote:It seems the manuscript has internal thematic organization beyond random noise.

It was suggested before. Personally I never got deep into it but some research shows that the pages seem to make some information clusters. Still it doesn't mean that the text is meaningful. The scribe could just repeat together some nonsense words through several pages and then drop them and switch to another ones.

The problem is that we don't really see any grammar in the text - sentences, structures, words working as sentence starts, verbs or conjunctions...

Anyway I'll wait for your article because at this moment we just don't have sufficient info for further discussion.

Yes!

Some previous studies indeed suggested that there might be clusters in the Voynich manuscript, based chiefly on surface clustering or visual observations.

However, my current approach is different:
-It starts from clean EVA tokenization (and later cross-checking with Takahashi),
-Applies both TF-IDF clustering and LDA topic modeling,
-Tracks thematic transitions across the manuscript,
-Compares the Voynich against multiple random control texts to validate the findings statistically.

Thus, it’s not just clustering pages based on similarity — it’s a complete pipeline to test structural organization rigorously.

Of course, proving true “meaning” or grammar would require much more, but the first step is to demonstrate that there is a non-random structure.

I understand it may not be very clear...

Many people are naturally drawn to the text itself, but from a linguistic perspective, detecting internal structural patterns is crucial. It gives us hints about whether the text was intentionally organized, and about how we might eventually approach decoding the language.

It is doubtful that the Voynich manuscript will ever be “translated” from scratch without finding internal structure; there will be no sudden Rosetta Stone. That is a romantic idea.

I’ll share the detailed results soon.

(25-04-2025, 10:18 PM)RadioFM Wrote: You are not allowed to view links. Register or Login to view.
Quote:No. Segmentation is not purely random. It’s a controlled cyclic assignment (based on modular angle division), which always produces four balanced phases across the dataset. It is not possible to end up with only one Phase.

Fair enough, I double-checked and true, the labelling is done evenly on the dataset and then shuffled, effectively giving 25% of the folios Phase 0, a different (random) 25% of Folios Phase 1, and so on. But it's only balanced on the whole dataset when accounting for training+test, no guarantees to have a balance after the 75-25 split. With just 4 Phases, no big deal of course, but with more than 4 the imbalance can take a toll on the results and worsen scores.

Quote:No. The model is not trained to predict Phase using the Lunar Angle. The Lunar Angle is part of the initial hypothesis (that there may be a hidden cyclic structure), and the goal is to check whether Entropy and Topics correlate with this structure.

Is the accuracy 90-ish % accuracy you mention the score given in supervised_models_seed.py? That is the accuracy of the predictions by the model.

Quote:“You’re overfitting by not splitting the data into Train/Validation/Test.”
No. This is not a predictive model intended for production. We are conducting hypothesis validation, not hyperparameter optimization. Cross-validation would be relevant in a different context.

It's not about models for production. The validation set is to avoid overfitting, which is also key in academia. And there IS hyperparameter optimization.

Quote:No. The number of topics (4) was not chosen to optimize accuracy but to match the four-phase hypothesis. We did not test multiple values of n_topics to pick the best one.

Oh yes you did:

Quote:Also, I think you’re right to question whether setting four topics is arbitrary. It’s not. We tried multiple options and observed that four is the only number that yields a strong, consistent internal pattern, not just via topic modeling, but also in entropy trends, classifier performance (96.5%), autocorrelation, FFT peaks, and alignment with agronomic cycles.

Quote:No. This assumes a continuous optimization landscape (e.g., with loss gradients), which does not apply. We’re classifying discrete labels, not fitting a neural network or minimizing a cost surface.
(...)
P.S. Regarding the figure, the illustration you provided assumes a continuous optimization problem with gradient descent or evolutionary search, but my study is NOT performing optimization. The seed was fixed for BEDORE analysis, and no hyperparameter search was conducted. The model is NOT climbing any surface. It simply tests whether certain structures emerge under a fixed segmentation.

The picture was to exemplify, but you can be sure that when accounting for all parameters and hyperparameters, there are indeed local optima in the solution space. And your study is performing optimization because it involves training a machine learning model Still, I don't think this is the reason why you're getting good results, just a pitfall I can see you falling into.

Quote:To sum up, you are confusing hypothesis validation with model optimization.
Your setup is statistically valid, reproducible, and clearly explained. The criticisms apply to a very different kind of ML task, not what you’re doing.

You are using the fact that you trained an ML to predict some labels as evidence that there is a pattern to even learn in the first place.

My point still stands that you are using a trivial predictor - when ablating Lunar Angle you claim the accuracy drops to about 28% (close to what random prediction, 25% would be). Lunar Angle is a trivial predictor of Phase because no matter how shuffled the dataset and the Lunar Angle will be, if folio f1 gets a Lunar Angle of 92º (or whatever angle that may be depending on the seed), the following lines of code will execute to give the 'correct' Phase to folio f1 against which you'll end up measuring your accuracy:

Code:
If 0º ≤ Lunar Angle < 90º then: Phase := 0 # Lluna Nova If 90º ≤ Lunar Angle < 180º then: Phase := 1 # Quart Creixent If 180º ≤ Lunar Angle < 270º then: Phase := 2 # Lluna Plena If 270º ≤ Lunar Angle < 360º: Phase := 3 # Quart Minvant
(generate_lunar_angles_seed.py)

The AI models will have the entropy, the topics and the 'Lunar Angle' for each and every folio, and they'll eventually come to the best way of predicting the correct Phase: by looking at the Lunar Angle. That's how you get such an extraordinarily high accuracy.

By the track record of this post, you'll disagree with my remarks and claim I know jackshit and that I'm mixing up circular logic with validation and whatnot. Would you please download the code again in a separate folder and perform the same study, but this time ablating all features EXCEPT Lunar Angle and see if you get >85% accuracy as well? Or try running it with some gibberish text, MarcoP once uploaded an OCR of the Codex Seraphinianus to GitHub, I think that ought to do it.

I hear you, so I’m exploring an approach that might be worth testing.

To avoid any issues with overfitting, circularity, or trivial phase assignment, what I’m currently doing is:
-Preprocessing the texts (EVA and controls) to extract tokens per page.
-Generating specific controls:
-Simple randomization (CONTROL).
-Randomization maintaining word length (CONTROL2).
-Randomization to maintain the number of tokens per page (CONTROL3).
-Randomization that maintains the token frequency distribution (CONTROL4).
-Vectorizing each page with TF-IDF.
-Applying thematic models like LDA and NMF (without using lunar angles).
-Segmenting the pages into macroblocks based on the detected thematic distribution (hierarchical clustering).
-Validating the structure through:
-Permutation Test with Random Forest to check if macroblock assignment can be predicted from topic distribution.
-Ablation of topics to see if the structure holds when features are removed.
-Bootstrap resampling of macroblocks to test structural robustness.
-Additional validations like predicting macroblocks based solely on page index (to rule out positional biases) and full text shuffle tests.
-Comparing the full pipeline applied to EVA and the different control corpora.

I’m currently testing this, and so far, the results seem pretty interesting. This approach would give a spotless result.

If any structures emerge this way, they would be pretty plausible.

If it holds up, it could be a way to validate the existence of an internal structure in the manuscript without relying on lunar angle information or risking circularity.

(Still working on it – if the methodology proves solid, I’ll share more details.)

Two small issues in preprocess_eva_seed.py:

1. It extracts at token_index = 0 the folio ID ($F) from the page header line:

f1r,0,a,0
...
f1v,0,a,0
...
f2r,0,b,228
...
f2v,0,b,228

The test line.strip().startswith("<!") doesn't work because line includes the line header.

2. It keeps comments between "<!" and ">" that should be skipped. For example:
f1r,13,doodle,13

Note: the removal of comments would also fix the page header issue because the page header is a comment. Something like this would work: line = line.remove(r"<!.*?>")

For checking the stability of results you could try You are not allowed to view links. Register or Login to view. the "reference" transliteration (no comments, no extended EVA). The Takeshi Takahashi transliteration is outdated (1999).

The Four-Phase-seed repo is now a 404 --> You are not allowed to view links. Register or Login to view.

found these words in "tokens_per_folio.csv" , including the comment it probably comes from.
( but i was fooling around with You are not allowed to view links. Register or Login to view. so these words may or may not exist in your output files )
f2v, above, <!mark above o>
f1v, over, <!bar over o>.
f2v, last, <!s above last a>
f68v, top, <!k on top of e>
f72v, stands, <!nymph stands on barrel>

This thread is clearly where we separate the programmers from the non-programmers Smile

I will wait until enough is published for someone to be able to explain the merit of the system.

Pages: 1 2 3 4 5 6 7 8 9

Urtx13

Urtx13

Mauro

Rafal

ReneZ

Urtx13

Urtx13

nablator

RobGea

Koen G