quimqu > 8 hours ago
Notebook | Text Scope | Validation Loss | Perplexity |
---|---|---|---|
Voynich_char_tokenizer | Full manuscript | 1.2166 | 3.38 |
Biological_Voynich_char_tokenizer | Only biological section | 1.2845 | 3.61 |
Herbal_Voynich_char_tokenizer | Only herbal section | 1.5337 | 4.64 |
Herbal_and_pharmaceutical_Voynich_char_tokenizer | Herbal + pharmaceutical | 1.5337 | 4.64 |
Dataset | Perplexity |
---|---|
Full Voynich | 3.38 |
Biological | 3.61 |
Herbal | 4.64 |
Herbal + Pharmaceutical | 4.64 |
(7 hours ago)oshfdk Wrote: You are not allowed to view links. Register or Login to view.Well, that's the main question for me, because we already know about a lot of regularity from curve-line systems, slot grammars, etc. These approaches have explicit simple rules that are easy to analyze and compare, as opposed to black-box GPT models.
Without some metric that shows that a GPT based approach identifies structures beyond already identified with previous methods, it's hard for me to see if the GPT based approach is of any use at all.
Word | Prefix | Suffix | EndOK | NoBadDbl | ✅ AllOK |
---|---|---|---|---|---|
ROPAJ | False | False | True | True | False |
OFAEZE | False | False | True | True | False |
AEOR | False | False | True | True | False |
EZCC89R | False | False | True | True | False |
4OESCC9R | True | False | True | True | False |
ESCO8 | False | False | True | True | False |
4CFAR | False | False | True | True | False |
8AROE | False | False | True | True | False |
OEFCCC89 | False | True | True | True | False |
FAEOE9 | False | True | True | True | False |
POEZC89 | False | True | True | True | False |
EFS9 | False | True | True | True | False |
OZCC9 | False | True | True | True | False |
AEFM | False | False | True | True | False |
2OEZCC9 | False | True | True | True | False |
OEFAROR | False | False | True | True | False |
2OEZCC89 | False | True | True | True | False |
E8AN | False | True | True | True | False |
Z2AE | True | False | True | True | False |
AEAR | False | False | True | True | False |
8EAM | False | True | True | True | False |
RSCC89 | False | True | True | True | False |
8AEZC9 | False | True | True | True | False |
2AROE | False | False | True | True | False |
EOEZC9 | False | True | True | True | False |
BOEFAN | False | True | True | True | False |
EOEFCC89 | False | True | True | True | False |
4OFOEOE | True | False | True | True | False |
4OFCCOE | True | False | True | True | False |
oshfdk > 8 hours ago
quimqu > 8 hours ago
oshfdk > 7 hours ago
quimqu > 7 hours ago
quimqu > 5 hours ago
(7 hours ago)oshfdk Wrote: You are not allowed to view links. Register or Login to view.Well, that's the main question for me, because we already know about a lot of regularity from curve-line systems, slot grammars, etc. These approaches have explicit simple rules that are easy to analyze and compare, as opposed to black-box GPT models.
Without some metric that shows that a GPT based approach identifies structures beyond already identified with previous methods, it's hard for me to see if the GPT based approach is of any use at all.
Word | Prefix | Suffix | EndOK | NoBadDbl | ✅ AllOK |
---|---|---|---|---|---|
ROPAJ | False | False | True | True | False |
OFAEZE | False | False | True | True | False |
AEOR | False | False | True | True | False |
EZCC89R | False | False | True | True | False |
4OESCC9R | True | False | True | True | False |
ESCO8 | False | False | True | True | False |
4CFAR | False | False | True | True | False |
8AROE | False | False | True | True | False |
OEFCCC89 | False | True | True | True | False |
FAEOE9 | False | True | True | True | False |
POEZC89 | False | True | True | True | False |
EFS9 | False | True | True | True | False |
OZCC9 | False | True | True | True | False |
AEFM | False | False | True | True | False |
2OEZCC9 | False | True | True | True | False |
OEFAROR | False | False | True | True | False |
2OEZCC89 | False | True | True | True | False |
E8AN | False | True | True | True | False |
Z2AE | True | False | True | True | False |
AEAR | False | False | True | True | False |
8EAM | False | True | True | True | False |
RSCC89 | False | True | True | True | False |
8AEZC9 | False | True | True | True | False |
2AROE | False | False | True | True | False |
EOEZC9 | False | True | True | True | False |
BOEFAN | False | True | True | True | False |
EOEFCC89 | False | True | True | True | False |
4OFOEOE | True | False | True | True | False |
4OFCCOE | True | False | True | True | False |
Mauro > 4 hours ago
quimqu > 3 hours ago
(4 hours ago)Mauro Wrote: You are not allowed to view links. Register or Login to view.Interesting work. I agree with you (and many others) that the structure of Voynich word types is rather regular, with some underlying 'grammar', and I think the method you used is surely interesting.
I had my share of fun working on grammars (actual slot grammars, in my case). Now I don't have the time at hand to check your GitHub repository and see what I can understand, I confess I did not understand much of how your method works and what it actually does, but I'll try tomorrow.
One question: can you apply your method in 'reverse', I mean: use your GPT model to decompose the original Voynich word types in parts ('tokens'), such as "qo", "y", "dy", "aiin" and so on, for each word (some of which may turn out not to be conforming to the grammar)? In this case it could be possible to compare the results with different approaches (I defined a metric which ranks different grammars according to how many bits of information are needed to represent the text by using optimal encoding on the words divided in 'tokens').
davidma > 3 hours ago
(5 hours ago)quimqu Wrote: You are not allowed to view links. Register or Login to view.The high percentage of conformity suggests that the generation process is strongly guided by structural constraints similar to those observed in actual Voynichese. While not all words match real entries from the manuscript, most invented forms remain within plausible morpho-phonological boundaries defined by the slot grammar. This supports the idea that the model is not producing random noise, but instead approximates a coherent internal system—whether artificial or natural.