Urtx13 > 25-04-2025, 09:45 AM
(25-04-2025, 09:09 AM)nablator Wrote: You are not allowed to view links. Register or Login to view.(24-04-2025, 11:50 AM)Urtx13 Wrote: You are not allowed to view links. Register or Login to view.The real breakthrough is realizing that the Voynich text, when processed this way, contains a cyclical rhythm that aligns with that specific structure — something we wouldn’t expect from a meaningless or random text.
Random text (in case of some kind on non-uniform randomness) still has to be produced somehow and the workflow, whatever it was, could have resulted in a cyclical rhythm, especially when the work was probably done bifolio by bifolio, 4 pages of each bifolio in the same Currier language and by the same scribe. If the "phase" matched the advancement of work on each bifolio we would get a sequence of 0, 1, 0, 1, ... in the first half of quires and 2, 3, 2, 3, ... in the second half.
dashstofsk > 25-04-2025, 11:02 AM
Urtx13 > 25-04-2025, 11:33 AM
(25-04-2025, 11:02 AM)dashstofsk Wrote: You are not allowed to view links. Register or Login to view.Not everyone here knows Python. Are you able to attach some of your outputs.
RadioFM > 25-04-2025, 12:24 PM
Quote:The actual goal is to test whether lexical entropy and topic distributions (from LDA) correlate with that segmentation.
Quote:“How does a single decision tree perform?”
It performs well — almost too well — because it immediately splits on Lunar_Angle.
Urtx13 > 25-04-2025, 12:36 PM
(25-04-2025, 12:24 PM)RadioFM Wrote: You are not allowed to view links. Register or Login to view.Quote:The actual goal is to test whether lexical entropy and topic distributions (from LDA) correlate with that segmentation.
Okay, then no need for Lunar Angle to be a feature, otherwise the prediction is trivial....
Quote:“How does a single decision tree perform?”
It performs well — almost too well — because it immediately splits on Lunar_Angle.
Dude, how do you expect to get faithful metrics of correlations if your metrics are based on a trivial prediction?
- You assign a random decimal number called 'Lunar Angle' from 0 to 360 to each folio - seed 1405. Different seeds can give completely different results.
- You then discretize this number by creating an additional feature, 'Phase', using the following formula: Phase := ⌊LunarAngle % 90⌋
Depending on the seed you've chosen previously, maybe You are not allowed to view links. Register or Login to view. (or any folio) gets a Lunar Angle of 87.341º and Phase = 0, or maybe your dataset gets no Phase 0 folios at all! Change the seed to 1234 and maybe the whole dataset consists of all 116 folios sharing exactly Lunar Angle 201.12º and thus Phase 1. The segmentation is, as you said, random.
3. You calculate the Entropy and the topic for each folio (these numeric values are NOT random for each folio)
4. You train an AI that, looking at the Entropy, Topics and the Lunar Angle of a folio, predicts a Phase for that folio.
5. The accuracy and "correlation" is then calculated as follows: How did the AI perform with respects to the random segmentation of Step #2?
Any ML model capable of dividing the Real number line in 4 intervals will quickly arrive at the solution of using Lunar Angle to predict Phase, you say so yourself - even a simple model like a single decision tree does this, having Lunar Angle as the best predictor of Phase. Accuracy, precision and all metrics will always be high except for those few seeds where there's a clear imbalance in the dataset. When a random seed in Step #1 segments the folios in such a way that maybe very few folios having Phase 3 are seen by the model, the model may not predict Phase 3 with a lot of accuracy of course, so the results won't always be within 89-100% range all the time (but most of the time will!)
If you really want to measure how Entropy and Topics correlate to the segmentation, Lunar Angle should obviously NOT be a feature in the model. But even then, using only Entropy and Topics as a predictor of phase, IMO say your methodology has some flaws:
- The golden standard and 'true' segmentation/labeling with which you derive your metrics is random (Step #1). You can change the seed and get a different segmentation altogether. Any metric derived of this will be using a random mapping of Folia to Phases as a gold standard. I can change the seed and get arbitrary results, as close to 100% accuracy as I want. How? I look for a seed that, by chance, maps every Folio to Phase 1: all machine learning models, even if only using Entropy and Topics as features, will converge to the default pattern of "Assign every Folio to Phase 1", yielding 100% accuracy.
You could maybe use some unsupervised machine learning to get some other metrics from the text itself and not make Phase a random value, otherwise your accuracy/precision/recall/etc and all metrics will be based off random values taken as true.
- You are implicitly overfitting by performing hyperparameter search without splitting your data thrice. In ML academia, the data is usually split in 3 distinct datasets: Training, Development/Validation, Test. Why? Because when splitting it only in 2 (training and testing) but then doing a systematic search for just the right seed number or just the right amount of LDA Topics (in your case, 4 topics being the sweet spot) you end up overfitting your model and imbuing bias into it - you're performing model tuning to improve accuracy on the Test dataset by changing hyperparameters and end up defeating the purpose of the Test dataset (which is to provide unbiased data to compare the model performance against)
oshfdk > 25-04-2025, 12:36 PM
(25-04-2025, 12:24 PM)RadioFM Wrote: You are not allowed to view links. Register or Login to view.
- You assign a random decimal number called 'Lunar Angle' from 0 to 360 to each folio - seed 1405. Different seeds can give completely different results.
- You then discretize this number by creating an additional feature, 'Phase', using the following formula: Phase := ⌊LunarAngle % 90⌋
Depending on the seed you've chosen previously, maybe You are not allowed to view links. Register or Login to view. (or any folio) gets a Lunar Angle of 87.341º and Phase = 0, or maybe your dataset gets no Phase 0 folios at all! Change the seed to 1234 and maybe the whole dataset consists of all 116 folios sharing exactly Lunar Angle 201.12º and thus Phase 1. The segmentation is, as you said, random.
3. You calculate the Entropy and the topic for each folio (these numeric values are NOT random for each folio)
4. You train an AI that, looking at the Entropy, Topics and the Lunar Angle of a folio, predicts a Phase for that folio.
5. The accuracy and "correlation" is then calculated as follows: How did the AI perform with respects to the random segmentation of Step #2?
Urtx13 > 25-04-2025, 12:44 PM
(25-04-2025, 12:24 PM)RadioFM Wrote: You are not allowed to view links. Register or Login to view.Quote:The actual goal is to test whether lexical entropy and topic distributions (from LDA) correlate with that segmentation.
Okay, then no need for Lunar Angle to be a feature, otherwise the prediction is trivial....
Quote:“How does a single decision tree perform?”
It performs well — almost too well — because it immediately splits on Lunar_Angle.
Dude, how do you expect to get faithful metrics of correlations if your metrics are based on a trivial prediction?
- You assign a random decimal number called 'Lunar Angle' from 0 to 360 to each folio - seed 1405. Different seeds can give completely different results.
- You then discretize this number by creating an additional feature, 'Phase', using the following formula: Phase := ⌊LunarAngle % 90⌋
Depending on the seed you've chosen previously, maybe You are not allowed to view links. Register or Login to view. (or any folio) gets a Lunar Angle of 87.341º and Phase = 0, or maybe your dataset gets no Phase 0 folios at all! Change the seed to 1234 and maybe the whole dataset consists of all 116 folios sharing exactly Lunar Angle 201.12º and thus Phase 1. The segmentation is, as you said, random.
3. You calculate the Entropy and the topic for each folio (these numeric values are NOT random for each folio)
4. You train an AI that, looking at the Entropy, Topics and the Lunar Angle of a folio, predicts a Phase for that folio.
5. The accuracy and "correlation" is then calculated as follows: How did the AI perform with respects to the random segmentation of Step #2?
Any ML model capable of dividing the Real number line in 4 intervals will quickly arrive at the solution of using Lunar Angle to predict Phase, you say so yourself - even a simple model like a single decision tree does this, having Lunar Angle as the best predictor of Phase. Accuracy, precision and all metrics will always be high except for those few seeds where there's a clear imbalance in the dataset. When a random seed in Step #1 segments the folios in such a way that maybe very few folios having Phase 3 are seen by the model, the model may not predict Phase 3 with a lot of accuracy of course, so the results won't always be within 89-100% range all the time (but most of the time will!)
If you really want to measure how Entropy and Topics correlate to the segmentation, Lunar Angle should obviously NOT be a feature in the model. But even then, using only Entropy and Topics as a predictor of phase, IMO your methodology has some flaws:
- The golden standard and 'true' segmentation/labeling with which you derive your metrics is random (Step #1). You can change the seed and get a different segmentation altogether. Any metric derived of this will be using a random mapping of Folia to Phases as a gold standard. I can change the seed and get arbitrary results, as close to 100% accuracy as I want. How? I look for a seed that, by chance, maps every Folio to Phase 1: all machine learning models, even if only using Entropy and Topics as features, will converge to the default pattern of "Assign every Folio to Phase 1", yielding 100% accuracy.
You could maybe use some unsupervised machine learning to get some other metrics from the text itself and not make Phase a random value, otherwise your accuracy/precision/recall/etc and all metrics will be based off random values taken as true.
- You are not accounting for dataset imbalances - even if your gold standard is not some random number or Phase, given the small amount of folio for many Machine Learning setups (yes, even 116*2 can be small), you should try to balance classes in the training dataset, otherwise many models will tend to overfit by learning patterns typical to those classes with greater representation in the dataset.
- You are implicitly overfitting by performing hyperparameter search without splitting your data thrice. In ML academia, the data is usually split in 3 distinct datasets: Training, Development/Validation, Test. Why? Because when splitting it only in 2 (training and testing) but then doing a systematic search for just the right seed number or just the right amount of LDA Topics (in your case, 4 topics being the sweet spot) you end up overfitting your model and imbuing bias into it - you're performing model tuning to improve accuracy on the Test dataset by changing hyperparameters and end up defeating the purpose of the Test dataset (which is to provide unbiased data to compare the model performance against)
- Finding one seed that performs best among others is almost never evidence of a latent pattern, it's evidence that the solution space is fairly irregular and fraught with local optima, like this:
oshfdk > 25-04-2025, 12:44 PM
(25-04-2025, 12:36 PM)Urtx13 Wrote: You are not allowed to view links. Register or Login to view.The point is: why do entropy and topic structures — independently derived — line up so well with that segmentation?
Urtx13 > 25-04-2025, 12:56 PM
(25-04-2025, 12:44 PM)oshfdk Wrote: You are not allowed to view links. Register or Login to view.(25-04-2025, 12:36 PM)Urtx13 Wrote: You are not allowed to view links. Register or Login to view.The point is: why do entropy and topic structures — independently derived — line up so well with that segmentation?
If my understanding of the high level description is accurate, I don't think this is strange or unexpected. Probably, the model just learns the mapping from specific token distributions/entropies of folios to the lunar angle/phase.
oshfdk > 25-04-2025, 01:09 PM
(25-04-2025, 12:56 PM)Urtx13 Wrote: You are not allowed to view links. Register or Login to view.I can’t continue this discussion until you understand how this works.