The Voynich Ninja

Pages: 1 2 3 4

It seems I can now confirm with objective data that the spaces in the Voynich text are real functional spaces.

For a long time I have seen people on the forum questioning whether the spaces in the Voynich are real or artificial, and whether they have any specific role. So I prepared a small experiment to see how coherent the spaces really are and whether they might have a clear function within the manuscript’s writing system.

What I did was apply a Byte Pair Encoding (BPE) segmentation model. This model does not need to know anything about the language. It simply takes the entire text as a sequence of characters with no spaces or punctuation, identifies which pairs of characters are most frequent, and merges them. After a certain number of merges, the model produces a set of segments that can be interpreted as statistical morphemes. With these segments, you can try to reconstruct the original words by segmenting them into the most likely units, and then compare whether these boundaries match the original spaces.

Before applying the method to the Voynich, I validated it with Latin. I took a Latin corpus, removed the spaces and punctuation, and trained the BPE model on the continuous sequence of characters. Once I had the set of segments, I tried to segment the original Latin words and counted how many word beginnings and endings were correctly recovered. With a simple model, the results are roughly:

About 82 percent correct word boundary detection.
About 67 percent of words completely correct (the predicted start and end boundaries match the real ones).
An average of about 3 segments per word.

These results show that even a very simple and completely blind method can reconstruct a large share of real word boundaries.
After that, I applied exactly the same process to the Voynich. First I concatenated all the text from paragraphs and long lines (avoiding labels), removed the spaces, and trained the BPE model on the continuous sequence using the same parameters as for Latin. The model produced a fairly compact set of about 265 unique segments. Then I returned to the original Voynich words and segmented them using the model’s dictionary.

The results are surprisingly high. Out of roughly 37 thousand words:

The model correctly recovers around 90 percent of word beginnings and endings.
About 8 out of 10 words have both boundaries correct.
The average is fewer than 2 segments per word, which suggests a fairly stable structure.

This suggests that the spaces in the Voynich are not random. The model, which only sees character statistics and knows nothing about meaning or how words are constructed, is still able to predict the same boundaries that the writer marked with spaces. If the spaces were decorative or had no function, the model should not be able to recover them with this level of accuracy.

It appears that the spaces in the Voynich show strong internal coherence, comparable to or even higher than that of real Latin text subjected to the same procedure. The spaces seem to mark meaningful units in the system, not arbitrary additions.

As extra, here I attach a list of the morphemes found:

Total number of segments: 67931
Number of unique segments: 265

Top 50 most frequent segments:
'ol' 2121
'ch' 1868
'or' 1599
'y' 1513
'ar' 1453
'aiin' 1444
'd' 1371
'che' 1361
's' 1301
't' 1233
'daiin' 1230
'al' 1201
'l' 1153
'o' 1118
'qot' 1026
'k' 1006
'sh' 1002
'chedy' 976
'e' 966
'ot' 948
'chy' 887
'she' 869
'qok' 860
'ain' 823
'ok' 814
'chey' 808
'eedy' 796
'ey' 791
'yk' 784
'dy' 779
'chol' 740
'olk' 737
'p' 736
'dar' 679
'edy' 668
'ody' 650
'yt' 639
'eey' 628
'dal' 628
'am' 611
'cth' 549
'ee' 546
'r' 542
'cho' 539
'chor' 531
'od' 520
'qo' 504
'chdy' 481
'shedy' 454
'os' 445

Out of curiosity, if you test a very simple rule that q always starts a word, n always ends a word and y ends a word unless this will create a single character word (.y.), what proportion of spaces would be correctly recovered? Just looking at the text it appears as if more than 50% of spaces can be explained by certain characters requiring a space before them, certain characters a space after them or, in the case of y, either before or after.

(14-11-2025, 11:54 PM)oshfdk Wrote: You are not allowed to view links. Register or Login to view.Out of curiosity, if you test a very simple rule that q always starts a word, n always ends a word and y ends a word unless this will create a single character word (.y.), what proportion of spaces would be correctly recovered? Just looking at the text it appears as if more than 50% of spaces can be explained by certain characters requiring a space before them, certain characters a space after them or, in the case of y, either before or after.

Well, this cannot be answered with the BPE model, but I have reused the final testing and checking logic. So I checked the rule against all the words that contain q, n or y (not single y), and the results are:

Results for q/n/y rule, restricted to words containing q, n or y (length > 1):
Number of selected words: 23577
Total word boundaries (start+end) in this subset: 47154
Correct word starts: 15341 (65.068 percent)
Correct word ends: 20939 (88.811 percent)
Overall boundary recall: 76.939 percent
Words with both boundaries correct: 13196 (55.970 percent)

Which spaces were tested (from what transliteration?) Does this include "unsure spaces" and can they be more "sure" by the model?

I did something a while back just looking at what letters usually follow what other letters, if we were to assign them to 4 groups only. I found that there was no way I could traverse spaces and keep the system even remotely as successful, this wasn't really a problem with creating longer "words" but more a focused problem at the letter to letter transitions across spaces, so I'm not surprised to hear these results. I am interested to hear others thoughts though as I haven't dabbled in what you have here before, so I have no idea what the potential pitfalls are.

(15-11-2025, 12:28 AM)Bluetoes101 Wrote: You are not allowed to view links. Register or Login to view.Which spaces were tested (from what transliteration?) Does this include "unsure spaces" and can they be more "sure" by the model?

I did something a while back just looking at what letters usually follow what other letters, if we were to assign them to 4 groups only. I found that there was no way I could traverse spaces and keep the system even remotely as successful, this wasn't really a problem with creating longer "words" but more a focused problem at the letter to letter transitions across spaces, so I'm not surprised to hear these results. I am interested to hear others thoughts though as I haven't dabbled in what you have here before, so I have no idea what the potential pitfalls are.

Hi, I used EVA transliteration and considered , as space. Tomorrow I will check not considering the commas, and see what we get.

Voynichese morphology is somewhat predictable, word boundaries are easy to pick up, like qok, e*dy. I don't think you can take models learning those patterns as evidence that whitespaces are "real functional spaces" or that they serve an underlying purpose in the language.

Would you like to test your model on Thai? This is written mostly without word spaces.
It would just need to be able to read unicode - I expect it would be UTF-8. I would be interested to see how well it does.

(14-11-2025, 11:41 PM)quimqu Wrote: You are not allowed to view links. Register or Login to view.This suggests that the spaces in the Voynich are not random.

Hi quimqu, I don't remember ever seeing anyone suggest that Voynich spaces are random.
The problem is the opposite, as pointed out by You are not allowed to view links. Register or Login to view.:

Quote:Do spaces ever convey any information that could be required for interpreting the text, or are they wholly predictable?

Patrick also found that pairs that are hard to predict (i.e. can appear both inside words and across word breaks) also have a higher chance of being separated by an uncertain space
e.g. r_a: 631 apart, 573 together, 248 (17%) ambiguous

Quote:it’s likely no coincidence that the glyph pairs with the highest incidences of unexpected behavior also tend to have relatively high proportions of ambiguous word breaks represented in the Zandbergen transcription, i.e., what I call “comma breaks,”

(14-11-2025, 11:41 PM)quimqu Wrote: You are not allowed to view links. Register or Login to view.If the spaces were decorative or had no function, the model should not be able to recover them with this level of accuracy.

On the contrary, the more (algorithmically) predictable they are, the less information they can carry. 100% predictable = 100% useless.

(14-11-2025, 11:41 PM)quimqu Wrote: You are not allowed to view links. Register or Login to view.This suggests that the spaces in the Voynich are not random.

You are right, random is not the word.

Pages: 1 2 3 4

quimqu

oshfdk

quimqu

Bluetoes101

quimqu

RadioFM

ReneZ

MarcoP

nablator

quimqu