Welcome, Guest |
You have to register before you can post on our site.
|
Online Users |
There are currently 369 online users. » 3 Member(s) | 363 Guest(s) Applebot, Bing, Google
|
Latest Threads |
It is not Chinese
Forum: Voynich Talk
Last Post: oshfdk
1 hour ago
» Replies: 65
» Views: 2,096
|
Favorite Plant Tournament...
Forum: Voynich Talk
Last Post: Koen G
3 hours ago
» Replies: 0
» Views: 44
|
Favorite Plant Tournament...
Forum: Voynich Talk
Last Post: Koen G
3 hours ago
» Replies: 0
» Views: 39
|
Favorite Plant Tournament...
Forum: Voynich Talk
Last Post: Koen G
3 hours ago
» Replies: 0
» Views: 40
|
Month names collection / ...
Forum: Marginalia
Last Post: davidma
3 hours ago
» Replies: 8
» Views: 212
|
Upcoming Voynich program ...
Forum: News
Last Post: LisaFaginDavis
9 hours ago
» Replies: 8
» Views: 1,143
|
[split] Color annotations...
Forum: Voynich Talk
Last Post: Jorge_Stolfi
Yesterday, 09:38 PM
» Replies: 89
» Views: 45,248
|
Wherefore art thou, aberi...
Forum: Imagery
Last Post: nablator
Yesterday, 06:18 PM
» Replies: 45
» Views: 1,687
|
Visual dictionary of the ...
Forum: Analysis of the text
Last Post: anyasophira
12-06-2025, 02:40 AM
» Replies: 19
» Views: 2,323
|
An attempt at extracting ...
Forum: Analysis of the text
Last Post: anyasophira
12-06-2025, 12:20 AM
» Replies: 89
» Views: 6,069
|
|
|
Month names collection / metastudy |
Posted by: Koen G - Yesterday, 07:11 PM - Forum: Marginalia
- Replies (8)
|
 |
I split this from the Aberil thread. Many people have found interesting month series over the years that match in various ways to the VM Zodiac inscriptions. Since these are fragmented, shared in different formats and on old websites and blogs, it might be interesting to collect them all here in a unified way.
You are not allowed to view links. Register or Login to view.
Please post/link any old or new examples in this thread. When I have time, I will add them to the spreadsheet.
The color codes are:
Green: complete match or spelling variation of the same word (letters like v-u and i-j were interchangable).
Yellow: has one or more salient features of the VM version.
Red: far off.
|
|
|
It is not Chinese |
Posted by: dashstofsk - 09-06-2025, 08:53 AM - Forum: Voynich Talk
- Replies (65)
|
 |
A number of people have suggested that the VMS might be in some Chinese language.
I think it is unlikely that this is so, but I am still curious to know what people might think.
The manuscript was written by at least three people, possibly five. And who were they writing for? For themselves, or for other Chinese? Could there really have been that many of them in the whole of Europe, in the time of the manuscript, when trade routes were not that well established, to justify writing in that language? But also why invent a new alphabet when they could have just written in the Chinese script which hardly any European would have been able to read?
There is also a problem with any language that uses tones for meaning, as highlighted nicely in
You are not allowed to view links. Register or Login to view.
|
|
|
Wherefore art thou, aberil? |
Posted by: R. Sale - 08-06-2025, 08:05 PM - Forum: Imagery
- Replies (45)
|
 |
The question is two-fold. Where is the word 'aberil' found and wherefore, for what reason, was it used to name the month of April, twice in the VMs Zodiac sequence?
Where is it found? Nothing relevant on Google.
'Aberil' is apparently one of several variant words that are found in various languages. April is usually found on a calendar and the "ebooks" reference contains a number of liturgical calendars, which can be sorted by language groups. In the German group, the overwhelming preference is for "Aprilis" [Latin] or an abbreviation. In the French language group, the preference is for "Avril".
The only other viable alternative so far is the Germanic-group use of "Abrell" in 1540 Appenzell.
In a ninja search, back in 2019, Anton posted a reference that connects "Aberil" with the Swiss canton of Glarus - with no further info.
Is there more on this?
|
|
|
What Lies Beneath: Statistical Structure in Voynichese Revealed by Transformers |
Posted by: quimqu - 08-06-2025, 01:07 PM - Forum: Analysis of the text
- Replies (15)
|
 |
This work approaches the Voynich manuscript from a fresh angle—by applying small-scale character-level GPT models to its transliterated text. Instead of attempting direct decipherment or semantic interpretation, the focus is on investigating the internal structure and statistical patterns embedded in the glyph sequences.
By training models on different sections of the manuscript using a simplified and consistent transliteration system, this study probes how well a transformer-based language model can learn and predict the character sequences. The results provide compelling evidence that the text is far from random, showing meaningful structural regularities that a machine can capture with relatively low uncertainty.
This computational perspective offers a complementary lens to traditional Voynich research, suggesting that the manuscript’s mysterious text may follow underlying syntactic or generative rules—even if their semantic content remains unknown. It is an invitation to consider the manuscript as a linguistic system in its own right, accessible to modern machine learning tools, and to explore new paths for understanding its secrets.
Objective
The aim of this project is to explore the internal structure of the Voynich manuscript by training a small GPT model on its transliterated text. Using the Courrier transliteration, which offers a simplified and consistent representation of Voynichese glyphs, the goal is to test how well a transformer model can learn and predict character sequences within this mysterious and undeciphered corpus.
Methodology
I trained four different character-level GPT models (≈0.8M parameters) using You are not allowed to view links. Register or Login to view., each on a different subset of the manuscript:
Notebook | Text Scope | Validation Loss | Perplexity |
Voynich_char_tokenizer | Full manuscript | 1.2166 | 3.38 |
Biological_Voynich_char_tokenizer | Only biological section | 1.2845 | 3.61 |
Herbal_Voynich_char_tokenizer | Only herbal section | 1.5337 | 4.64 |
Herbal_and_pharmaceutical_Voynich_char_tokenizer | Herbal + pharmaceutical | 1.5337 | 4.64 |
Each dataset was carefully filtered to remove uncertain tokens (?), header lines, and other non-linguistic symbols. Paragraphs were reconstructed using markers from the transcription file.
Why character-level tokenization?
Early attempts at word-level tokenization (based on dot-separated EVA words) yielded poor results, primarily due to:- A large vocabulary size (~15,000+ unique tokens).
- Very sparse and repetitive training data per token.
- Increased perplexity and unstable loss curves.
In contrast, character-level models:- Have a much smaller and denser vocabulary.
- Perform well with limited data.
- Naturally capture the morphological regularities of Voynichese.
Perplexity Results & Interpretation
Perplexity measures how well a model predicts the next token — lower values mean better predictability.
Dataset | Perplexity |
Full Voynich | 3.38 |
Biological | 3.61 |
Herbal | 4.64 |
Herbal + Pharmaceutical | 4.64 |
The relatively low perplexity values (3–4.6) show that the model can learn strong internal structure from the Voynich text, particularly in the full corpus and biological section. These numbers are comparable to what we observe in natural languages at the character level, and far from what we would expect from purely random or meaningless sequences.
Why this matters
These results support the long-standing hypothesis — prominently discussed by René Zandbergen and others — that the Voynich manuscript, while undeciphered, exhibits non-random, rule-governed linguistic patterns.
Even though the GPT model has no access to semantics, its ability to predict Voynichese characters with such low uncertainty suggests that the manuscript likely follows an underlying syntax or generation process — artificial or natural.
In essence, the model behaves like a human listener hearing a foreign language repeatedly: it can’t understand the meaning, but learns to anticipate the next syllables based on structure.
Future Work
This approach opens up further directions:- Train section-specific models (e.g., cosmological, recipes).
- Cluster generated tokens morphologically.
- Compare synthetic Voynichese to natural languages.
- Test statistical properties against controlled glossolalia or cipher texts.
To facilitate further exploration and replication, I’m sharing my Github where you can find the Jupyter notebooks used in this study:
You are not allowed to view links. Register or Login to view.
Feel free to download, review, and experiment with the code and data. Your feedback and insights are very welcome!
(08-06-2025, 01:44 PM)oshfdk Wrote: You are not allowed to view links. Register or Login to view.Well, that's the main question for me, because we already know about a lot of regularity from curve-line systems, slot grammars, etc. These approaches have explicit simple rules that are easy to analyze and compare, as opposed to black-box GPT models.
Without some metric that shows that a GPT based approach identifies structures beyond already identified with previous methods, it's hard for me to see if the GPT based approach is of any use at all.
I ran a validation test on a recent output of 891 generated words using a simplified slot grammar system adapted to EVA transliteration conventions.
Basic slot grammar rules:
valid_prefixes = ("qo", "ch", "sh")
valid_suffixes = ("y", "dy", "aiin", "ain", "daiin")
invalid_double_letters = ("tt", "pp", "ff")
invalid_final_glifs = ("q", "e", "ch")
Adapted for EVA transcription in Courier-like glifs:
valid_prefixes = ("4O", "S", "Z")
valid_suffixes = ("9", "89", "AM", "AN", "8AM")
invalid_final_glifs = ("4", "C", "S")
invalid_double_letters = ("PP", "BB", "FF")
Summary of results:- ✅ 96.74% of the generated words match the slot grammar rules
- ✅ 95.5% of the total set correspond to real EVA words from the corpus
- ⚠️ 100% of the words have valid final glifs and no invalid double letters (so these two conditions have been learnt 100%)
Below is a list of the invented words that fulfill 100% the slot grammar restrictions:
Correct invented words: ['SCCX9', 'ZCCC9', '4OFAECC89', '4OFAEZC89', '4OFAEFCCC9', '4OFAEO9', 'SCCC89', '4OFAEZC89', 'SEFAN', '4ORAR9', '4OFCC889']
Below is a table of the invented words that do not fulfill 100% the slot grammar restrictions, and why:
Word | Prefix | Suffix | EndOK | NoBadDbl | ✅ AllOK |
ROPAJ | False | False | True | True | False |
OFAEZE | False | False | True | True | False |
AEOR | False | False | True | True | False |
EZCC89R | False | False | True | True | False |
4OESCC9R | True | False | True | True | False |
ESCO8 | False | False | True | True | False |
4CFAR | False | False | True | True | False |
8AROE | False | False | True | True | False |
OEFCCC89 | False | True | True | True | False |
FAEOE9 | False | True | True | True | False |
POEZC89 | False | True | True | True | False |
EFS9 | False | True | True | True | False |
OZCC9 | False | True | True | True | False |
AEFM | False | False | True | True | False |
2OEZCC9 | False | True | True | True | False |
OEFAROR | False | False | True | True | False |
2OEZCC89 | False | True | True | True | False |
E8AN | False | True | True | True | False |
Z2AE | True | False | True | True | False |
AEAR | False | False | True | True | False |
8EAM | False | True | True | True | False |
RSCC89 | False | True | True | True | False |
8AEZC9 | False | True | True | True | False |
2AROE | False | False | True | True | False |
EOEZC9 | False | True | True | True | False |
BOEFAN | False | True | True | True | False |
EOEFCC89 | False | True | True | True | False |
4OFOEOE | True | False | True | True | False |
4OFCCOE | True | False | True | True | False |
The high percentage of conformity suggests that the generation process is strongly guided by structural constraints similar to those observed in actual Voynichese. While not all words match real entries from the manuscript, most invented forms remain within plausible morpho-phonological boundaries defined by the slot grammar. This supports the idea that the model is not producing random noise, but instead approximates a coherent internal system—whether artificial or natural.
Update 06/09/2025: I update with a heatmap of the loss per folio according my trained GPT. This gives an insight of the strangest folia according to the model.
(08-06-2025, 05:44 PM)davidma Wrote: You are not allowed to view links. Register or Login to view. (08-06-2025, 03:39 PM)quimqu Wrote: You are not allowed to view links. Register or Login to view.The high percentage of conformity suggests that the generation process is strongly guided by structural constraints similar to those observed in actual Voynichese. While not all words match real entries from the manuscript, most invented forms remain within plausible morpho-phonological boundaries defined by the slot grammar. This supports the idea that the model is not producing random noise, but instead approximates a coherent internal system—whether artificial or natural.
Could you in theory test for non-conforming words already in the VM? Or i guess measure how much they break the internal rules? I wonder if it could be interesting to see where these pop up in the VM, if they are predominant in "labelese" or maybe in certain sections? Or certain scribes? To my knowledge this hasnt been done yet but I am probably wrong.
Hi!
Thanks for the thoughtful suggestion — I think you're absolutely right that identifying non-conforming words and tracking their locations in the manuscript could reveal meaningful patterns.
What I’ve done so far is train a character-level GPT model on Voynichese text (using Currier transliteration). Then, I used this model to estimate the average word-level loss for each token in the manuscript — essentially measuring how well the model “understands” each word given its context.
I attach here a png fie of the interesting resulting heatmap of loss, so you can easily see which folios are the most strange ones according to GPT. Surprisingly, the last folios have the lowest loss.
![[Image: jKk1ZAA.png]](https://i.imgur.com/jKk1ZAA.png)
From this, I have been able to:- Compute the average loss per folio, showing how predictable the text is in different sections.
- Visualize this data as a heatmap, coloring folios by their average word loss.
This framework opens up possibilities exactly along the lines you suggested:- Highlighting words or regions where the model struggles most.
- Investigating whether these “high-loss” zones correlate with labelese, specific sections, or particular scribes (if metadata is available).
- Zooming in on the individual words the model finds most anomalous, and seeing their frequency and distribution.
I haven’t done the full detailed analysis yet, but the infrastructure is ready and the heatmap helps guide further exploration.
Below is a BBCode table showing the top 30 most anomalous words in Currier transcription ranked by loss (i.e. the words the model found hardest to predict):
? Top 30 most anomalous words (by loss):
Word | Loss | Freq | Len |
RH | 10.1558 | 1 | 2 |
J | 9.8573 | 3 | 1 |
E | 9.8573 | 15 | 1 |
6 | 9.8573 | 15 | 1 |
B | 9.8573 | 2 | 1 |
Q | 9.8573 | 2 | 1 |
F | 9.8573 | 4 | 1 |
9 | 9.8573 | 53 | 1 |
3 | 9.8573 | 1 | 1 |
R | 9.8573 | 35 | 1 |
4 | 9.8573 | 4 | 1 |
8 | 9.8573 | 32 | 1 |
Z | 9.8573 | 9 | 1 |
A | 9.8573 | 1 | 1 |
2 | 9.8573 | 121 | 1 |
C | 9.8573 | 1 | 1 |
D | 9.8573 | 3 | 1 |
O | 9.8573 | 18 | 1 |
9J | 9.4344 | 1 | 2 |
FT | 8.8774 | 1 | 2 |
FU | 8.5493 | 1 | 2 |
OT | 8.3373 | 1 | 2 |
8DE | 8.2584 | 1 | 3 |
OU | 8.0167 | 1 | 2 |
O3 | 7.5268 | 3 | 2 |
9EJE | 7.5207 | 1 | 4 |
P3 | 7.4110 | 1 | 2 |
8R | 7.3208 | 1 | 2 |
ON | 7.2464 | 2 | 2 |
QE | 7.0748 | 1 | 2 |
These words often consist of very short tokens, sometimes single characters or rare combinations, which the model struggles to predict confidently. Investigating where these words cluster — whether in labels, particular manuscript sections, or scribes — could provide insights into the structure or anomalies within the text.
To my knowledge, a comprehensive analysis of “non-conforming” words in the Voynich Manuscript has not yet been performed at this level of detail, so this approach offers a promising direction for further research.
If you or anyone else is interested, I’d be happy to collaborate or share the tools I’ve developed so far.
|
|
|
ol most likely translates to "Io" using Latin |
Posted by: voynichrose - 07-06-2025, 10:46 PM - Forum: Analysis of the text
- Replies (2)
|
 |
ol is the most common two letter vord in the Voynich Manuscript. Since "io" is a name for a Myth then it would not rank high in a frequency list. If you run a frequency analysis o is the most common letter in the Voynich Manuscript. While i the letter in Latin is the most common. The letter
"o" for Latin is the 8th most common letter. l is the 7th most common letter in voynich. So I humbly submit this could ol be the vord for "io" in Latin.
Io was the love interest of Zeus. Could the myth of Io be in the pages of the Voynich Manuscript?
|
|
|
|