14-11-2025, 11:41 PM
It seems I can now confirm with objective data that the spaces in the Voynich text are real functional spaces.
For a long time I have seen people on the forum questioning whether the spaces in the Voynich are real or artificial, and whether they have any specific role. So I prepared a small experiment to see how coherent the spaces really are and whether they might have a clear function within the manuscript’s writing system.
What I did was apply a Byte Pair Encoding (BPE) segmentation model. This model does not need to know anything about the language. It simply takes the entire text as a sequence of characters with no spaces or punctuation, identifies which pairs of characters are most frequent, and merges them. After a certain number of merges, the model produces a set of segments that can be interpreted as statistical morphemes. With these segments, you can try to reconstruct the original words by segmenting them into the most likely units, and then compare whether these boundaries match the original spaces.
Before applying the method to the Voynich, I validated it with Latin. I took a Latin corpus, removed the spaces and punctuation, and trained the BPE model on the continuous sequence of characters. Once I had the set of segments, I tried to segment the original Latin words and counted how many word beginnings and endings were correctly recovered. With a simple model, the results are roughly:
These results show that even a very simple and completely blind method can reconstruct a large share of real word boundaries.
After that, I applied exactly the same process to the Voynich. First I concatenated all the text from paragraphs and long lines (avoiding labels), removed the spaces, and trained the BPE model on the continuous sequence using the same parameters as for Latin. The model produced a fairly compact set of about 265 unique segments. Then I returned to the original Voynich words and segmented them using the model’s dictionary.
The results are surprisingly high. Out of roughly 37 thousand words:
This suggests that the spaces in the Voynich are not random. The model, which only sees character statistics and knows nothing about meaning or how words are constructed, is still able to predict the same boundaries that the writer marked with spaces. If the spaces were decorative or had no function, the model should not be able to recover them with this level of accuracy.
It appears that the spaces in the Voynich show strong internal coherence, comparable to or even higher than that of real Latin text subjected to the same procedure. The spaces seem to mark meaningful units in the system, not arbitrary additions.
As extra, here I attach a list of the morphemes found:
Total number of segments: 67931
Number of unique segments: 265
Top 50 most frequent segments:
'ol' 2121
'ch' 1868
'or' 1599
'y' 1513
'ar' 1453
'aiin' 1444
'd' 1371
'che' 1361
's' 1301
't' 1233
'daiin' 1230
'al' 1201
'l' 1153
'o' 1118
'qot' 1026
'k' 1006
'sh' 1002
'chedy' 976
'e' 966
'ot' 948
'chy' 887
'she' 869
'qok' 860
'ain' 823
'ok' 814
'chey' 808
'eedy' 796
'ey' 791
'yk' 784
'dy' 779
'chol' 740
'olk' 737
'p' 736
'dar' 679
'edy' 668
'ody' 650
'yt' 639
'eey' 628
'dal' 628
'am' 611
'cth' 549
'ee' 546
'r' 542
'cho' 539
'chor' 531
'od' 520
'qo' 504
'chdy' 481
'shedy' 454
'os' 445
For a long time I have seen people on the forum questioning whether the spaces in the Voynich are real or artificial, and whether they have any specific role. So I prepared a small experiment to see how coherent the spaces really are and whether they might have a clear function within the manuscript’s writing system.
What I did was apply a Byte Pair Encoding (BPE) segmentation model. This model does not need to know anything about the language. It simply takes the entire text as a sequence of characters with no spaces or punctuation, identifies which pairs of characters are most frequent, and merges them. After a certain number of merges, the model produces a set of segments that can be interpreted as statistical morphemes. With these segments, you can try to reconstruct the original words by segmenting them into the most likely units, and then compare whether these boundaries match the original spaces.
Before applying the method to the Voynich, I validated it with Latin. I took a Latin corpus, removed the spaces and punctuation, and trained the BPE model on the continuous sequence of characters. Once I had the set of segments, I tried to segment the original Latin words and counted how many word beginnings and endings were correctly recovered. With a simple model, the results are roughly:
- About 82 percent correct word boundary detection.
- About 67 percent of words completely correct (the predicted start and end boundaries match the real ones).
- An average of about 3 segments per word.
These results show that even a very simple and completely blind method can reconstruct a large share of real word boundaries.
After that, I applied exactly the same process to the Voynich. First I concatenated all the text from paragraphs and long lines (avoiding labels), removed the spaces, and trained the BPE model on the continuous sequence using the same parameters as for Latin. The model produced a fairly compact set of about 265 unique segments. Then I returned to the original Voynich words and segmented them using the model’s dictionary.
The results are surprisingly high. Out of roughly 37 thousand words:
- The model correctly recovers around 90 percent of word beginnings and endings.
- About 8 out of 10 words have both boundaries correct.
- The average is fewer than 2 segments per word, which suggests a fairly stable structure.
This suggests that the spaces in the Voynich are not random. The model, which only sees character statistics and knows nothing about meaning or how words are constructed, is still able to predict the same boundaries that the writer marked with spaces. If the spaces were decorative or had no function, the model should not be able to recover them with this level of accuracy.
It appears that the spaces in the Voynich show strong internal coherence, comparable to or even higher than that of real Latin text subjected to the same procedure. The spaces seem to mark meaningful units in the system, not arbitrary additions.
As extra, here I attach a list of the morphemes found:
Total number of segments: 67931
Number of unique segments: 265
Top 50 most frequent segments:
'ol' 2121
'ch' 1868
'or' 1599
'y' 1513
'ar' 1453
'aiin' 1444
'd' 1371
'che' 1361
's' 1301
't' 1233
'daiin' 1230
'al' 1201
'l' 1153
'o' 1118
'qot' 1026
'k' 1006
'sh' 1002
'chedy' 976
'e' 966
'ot' 948
'chy' 887
'she' 869
'qok' 860
'ain' 823
'ok' 814
'chey' 808
'eedy' 796
'ey' 791
'yk' 784
'dy' 779
'chol' 740
'olk' 737
'p' 736
'dar' 679
'edy' 668
'ody' 650
'yt' 639
'eey' 628
'dal' 628
'am' 611
'cth' 549
'ee' 546
'r' 542
'cho' 539
'chor' 531
'od' 520
'qo' 504
'chdy' 481
'shedy' 454
'os' 445