The Voynich Ninja

Pages: 1 2 3 4

(15-11-2025, 07:26 AM)ReneZ Wrote: You are not allowed to view links. Register or Login to view.Would you like to test your model on Thai? This is written mostly without word spaces. It would just need to be able to read unicode - I expect it would be UTF-8.

I would be interested too. Do you have a suitable text, or can you point me to one?

Thanks, and all the best, --stolfi

This is a bit off-topic, just an idea to test: How many words of Voynichese are a part of a larger Voynichese word?
And compare the results with same analisys of another languages. And do the same with units in another languages smaller than words, at level of bigrams, trigrams...
They idea is to test if Voychenese words have a behaviour similar to words of another languages or to units smaller than words.

(15-11-2025, 07:26 AM)ReneZ Wrote: You are not allowed to view links. Register or Login to view.Would you like to test your model on Thai? This is written mostly without word spaces.
It would just need to be able to read unicode - I expect it would be UTF-8. I would be interested to see how well it does.

OK, done. I took the book ThaiBhagavadGita from Sitalatma You are not allowed to view links. Register or Login to view.

I cleaned the text taking only the paragraphs and removing the non Thai chars. This is how it looks like:

ถึงแม้ว่าหนังสือ.ภควัตคีตาฺ.จะได้รับความนิยมจากการพิมพ์และการอ่านอย่างแพร่หลาย.เดิมทีเป็นตอนหนึ่งของ.มหาภารตะฺ.ซึ่งเป็นวรรณกรรมประวัติศาสตร์ภาษาสันสกฤตของโลกในอดีต.มหาภารตะฺ

I have to tune the model as at the beginning I had about 56% of the found segmens single characters, and that made the spacing detection go to almost 100%. It was too atomized. So I limited the model to only allow up to 10% of the segments found to be sole characters. With that, you can see the results here:

Number of gold words: 32413
Length of concatenated text: 916412
Stopping at iteration 1274 with char_ratio=0.100

SEGMENTATION STATS:
Number of segments: 235852
Segments per word: 7.28
Percentage of 1 character segments: 10.00%

EVALUATION RESULTS, length aware BPE:
Correct starts: 93.626% Correct ends: 93.626%
Boundary recall: 93.626% Perfect words: 87.406%

Example, first 200 characters segmented:
ถึงแม้ว่า | หน | ัง | ส | ือ | ภควัตคีตาฺ | จะ | ได้รับความ | นิ | ยม | จากการ | พิ | ม | พ | ์ | และการ | อ | ่าน | อย่าง | แ | พร | ่ | หลาย | เดิม | ที | เป็น | ตอน | หนึ่ง | ของ | มหา | ภ | าร | ต | ะฺ | ซึ่งเป็น | วรรณกรรม | ประ | วัต | ิ | ศาสตร์ | ภา | ษ | าส | ัน | ส | ก | ฤ | ต | ของ | โลก | ในอ | ดีต | มหา | ภ | าร | ต | ะฺ | กล่าว | ถึง | สถาน | การณ์ | ซึ่ง | น | ํ | า | เร | าม | าส | ู่

The Thai experiment has given me a new approach with the model, limiting the one single character segment. Setting a maximum one single character segment to 10%of the segments, has even improved the space detecting KPI in the Voynich:

Length of Voynich text without spaces: 190728
Number of original words (Voynich): 37503

Results on original spaces (Voynich, length aware BPE):
Total word boundaries (start+end): 75006
Correct word starts: 32782 (87.412 percent) *
Correct word ends: 32782 (87.412 percent) *
Words with both boundaries correct: 28627 (76.333 percent)

Boundary precision: 59.192 percent Boundary recall: 87.411 percent Boundary F1: 70.586 percent **
Oversegmentation factor: 1.477 Segments per word: 1.48
Percentage of 1 character segments: 9.98 percent

* the same number is normal, as a space is always at the beginning of a word or at the end of it.
** Boundary precision (59.19%) means that about six in ten predicted boundaries match real spaces, while the rest are internal morphemic splits inside words (for example fachys becomes fa | chy | s). Boundary recall (87.41%) means the model successfully recovers 87% real word boundaries in the text. Boundary F1 (70.59%) summarizes both effects, showing that the model captures most true boundaries while still adding some internal segment breaks.

Example, first 300 characters segmented:

f | a | chys | yk | al | ar | a | taiin | shol | shor | y | cth | r | es | yk | or | shol | dy |  | sor | y | ckh | ar | or | yk | air | cht | aiin | shar | as | e | cth | ar | cth | ar | d | an | sy | aiir | sheky | or | ykaiin | shod | cth | o | ary | cth | es | dar | aiin | s | ys | oiin | oteey | ot | eos | rol | oty | cth | i | ar | daiin | okaiin | or | ok | an | sair | yche | ar | cth | aiin | cph | ar | cfh | aiin | ydar | ai | shy |  | odar | c' | y | shol | cph | o | yo | ydar | sh | s | cfh | oaiin | sh | odar | yy | shey | shody | ok | cho | yot | chol | chocthy | os | chyd | ain | chor | k | os

In my previous run, qokeedy was splitted in quite different ways ('qok'+'eedy', 'eedy'+'qok' (where two qokeedy are together)). In this run, qokedy and qokeedy appear as full "words" and qokeeedy appears as the junction of two morphems:

Number of occurrences of 'qokedy': 269
Distinct segmentation patterns for 'qokedy': 1
Patterns and counts (most frequent first):
269 times: qokedy

Number of occurrences of 'qokeedy': 303
Distinct segmentation patterns for 'qokeedy': 1
Patterns and counts (most frequent first):
303 times: qokeedy

Number of occurrences of 'qokeeedy': 5
Distinct segmentation patterns for 'qokeeedy': 1
Patterns and counts (most frequent first):
5 times: qokee | edy

(14-11-2025, 11:41 PM)quimqu Wrote: You are not allowed to view links. Register or Login to view.What I did was apply a Byte Pair Encoding (BPE) segmentation model. This model does not need to know anything about the language. It simply takes the entire text as a sequence of characters with no spaces or punctuation, identifies which pairs of characters are most frequent, and merges them. After a certain number of merges, the model produces a set of segments that can be interpreted as statistical morphemes. With these segments, you can try to reconstruct the original words by segmenting them into the most likely units, and then compare whether these boundaries match the original spaces.

Using BPE in this way is a useful and interesting approach and your conclusions seem reasonable -- providing they aren't interpreted for more than what they are. I think you have to be careful with using the word "morpheme" though, since it suggests that each one carries some linguistic meaning or grammatical function, and by extension that we are dealing with words in a linguistic sense. BPE doesn't guarantee that at all.

You did initially qualify it as "statistical morphemes" (Not sure what that exactly is, but at least it distinguishes it from actual morphemes.) but after that you seem to treat them more like actual morphemes. Maybe it would be clearer and more accurate to just call them “segments”, or something like that, to avoid implying full linguistic status unless (until?) there is separate evidence for that. They could be (as an example) simply atomic components of some structured cipher system.

Thanks @quimqu for the effort!

(15-11-2025, 05:27 PM)quimqu Wrote: You are not allowed to view links. Register or Login to view.EVALUATION RESULTS, length aware BPE:
Correct starts: 93.626% Correct ends: 93.626%
Boundary recall: 93.626% Perfect words: 87.406%

Example, first 200 characters segmented:
ถึงแม้ว่า | หน | ัง | ส | ือ | ภควัตคีตาฺ | จะ | ได้รับความ | นิ | ยม | จากการ | พิ | ม | พ | ์ | และการ | อ | ่าน | อย่าง | แ | พร | ่ | หลาย | เดิม | ที | เป็น | ตอน | หนึ่ง | ของ | มหา | ภ | าร | ต | ะฺ | ซึ่งเป็น | วรรณกรรม | ประ | วัต | ิ | ศาสตร์ | ภา | ษ | าส | ัน | ส | ก | ฤ | ต | ของ | โลก | ในอ | ดีต | มหา | ภ | าร | ต | ะฺ | กล่าว | ถึง | สถาน | การณ์ | ซึ่ง | น | ํ | า | เร | าม | าส | ู่

Unfortunately, this is not correct. It is even very far off. I wonder how the tool decided on the 90+% percentages.

The line:

ถึงแม้ว่าหนังสือ.ภควัตคีตาฺ.จะได้รับความนิยมจากการพิมพ์และการอ่านอย่างแพร่หลาย.เดิมทีเป็นตอนหนึ่งของ.มหาภารตะฺ.ซึ่งเป็นวรรณกรรมประวัติศาสตร์ภาษาสันสกฤตของโลกในอดีต.มหาภารตะฺ

is quite an unfortunate choice as a sample Thai text. It includes several names that are Sanskrit. The full stops are also unusual and while they may be used in Thai, here they seem to also represent quotes. The character that looks like a tiny dot below some characters is normally not used.
I doubt that this was the main cause for the failure, but it will not have helped.

(16-11-2025, 02:54 PM)ReneZ Wrote: You are not allowed to view links. Register or Login to view.The full stops are also unusual and while they may be used in Thai, here they seem to also represent quotes. The character that looks like a tiny dot below some characters is normally not used.

Hi René, I use "." to separate the words. In the original document the "." are spaces.

I don't understand Thai and it could be very well that there are sanscrit words. If you can provide a text, I could give a try.

About the percentages: if we have word "house" and the split is "hou"+"se", the boundary recall is 100%, because there are two splits that match the original word. Of course there are internal splits, but in terms of spaces recall, they are not counted. A 93% of boundary recall means that 7% of the boundaries were not matched (for example, "these houses" could have splitted in "thes" "ehou" "ses", where the middle space is not matched).

Let me try to parse the Thai text into words. I put the names (Sanskrit) into brackets, and definitive word breaks as two vertical bars. I put one vertical bar when it is a composite where each part could also be a word.

Example: the English word 'mainstream' would be parsed as || main | stream ||

|| ถึง || แม้ || ว่า || หนัง | สือ || [ภควัตคีตาฺ] จะ || ได้ || รับ || ความ | นิยม || จาก ||
การ | พิมพ์ || และ || การ | อ่าน || อย่าง || แพร่ | หลาย || เดิม || ที || เป็น || ตอน ||
หนึ่ง || ของ || [มหาภารตะฺ] ซึ่ง || เป็น || วรรณ | กรรม || ประวัติ | ศาสตร์ || ภาษา ||
สันสกฤต || ของ || โลก || ใน || อดีต || [มหาภารตะฺ]

Compare with:

ถึงแม้ว่า | หน | ัง | ส | ือ | ภควัตคีตาฺ | จะ | ได้รับความ | นิ | ยม | จากการ | พิ | ม | พ | ์ | และการ | อ | ่าน | อย่าง | แ | พร | ่ | หลาย | เดิม | ที | เป็น | ตอน | หนึ่ง | ของ | มหา | ภ | าร | ต | ะฺ | ซึ่งเป็น | วรรณกรรม | ประ | วัต | ิ | ศาสตร์ | ภา | ษ | าส | ัน | ส | ก | ฤ | ต | ของ | โลก | ในอ | ดีต | มหา | ภ | าร | ต | ะฺ | กล่าว | ถึง | สถาน | การณ์ | ซึ่ง | น | ํ | า | เร | าม | าส | ู่

For Stolfi: Thai has quite a number of non-separable multi-syllable words. This text has five, not counting [font='Roboto Mono', monospace]สันสกฤต w[/font]hich is a loan word and says 'Sanskrit'.

(Not sure what is happening with the fonts here).

(17-11-2025, 05:33 AM)ReneZ Wrote: You are not allowed to view links. Register or Login to view.For Stolfi: Thai has quite a number of non-separable multi-syllable words. This text has five, not counting [font='Roboto Mono', monospace]สันสกฤต w[/font]hich is a loan word and says 'Sanskrit'.
(Not sure what is happening with the fonts here).

Hello René.

I would like to clarify a few things about the BPE model before addressing the Thai case. BPE was originally designed for text compression, and it is now used in LLM training. Basically, through several passes, it groups the most frequent characters into increasingly larger groups (first in pairs, then groups of three, and so on). I thought that with this method I could determine whether the spaces in the Voynich manuscript delimit real words or whether these were divided into false spaces (for instance to confuse a potential “reader”).

From your answer about Thai I understand the following.

I have seen that in the original text the Sanskrit words are marked, and I have been able to remove them. Therefore I now have the text cleaned of Sanskrit.
I understand from what you say that Thai texts contain few spaces, and that multiple words are grouped between two consecutive spaces. This means that my precision and recall calculations are meaningless, because I can only compute them if I know the real word boundaries, and the original text gives me no indication.
Likewise, this means that only someone who knows how the words are actually “delimited”, independent of the spaces in the original text, can interpret the results. It will be impossible for me to do much more than show the segmentation output (as in my previous post).
I train different models and compare them with the spaces in the original text. One of the main parameters is the use of single characters. The more single characters there are, the less information the model can provide about the spaces. If the model does not group, I will find all the original spaces, obviously with many intra word spaces. I also think it is useful to limit the number of characters in a group in order to try to find syllables rather than full words (which was not done in the Thai experiment, which is why long segments appear).
Finally, I want to stress that the groupings depend on the original text. If the original text tends to have many repetitions or very frequent phrases, these will tend to be grouped into a single segment. Therefore it is reasonable to limit the maximum number of characters to group in order to make the model more independent of the content of the text.

With all this, and with the cleaned text (I do not know whether it is a good text to test), and limiting the maximum group size to 5 characters, this output appears:

ถึง | แม้ | ว่า | หน | ังส | ือ | จะ | ได้ | รับ | ความ | นิยม | จาก | การ | พิ | มพ | ์และ | การ | อ่าน | อย | ่าง | แพร่ | หลาย | เดิม | ที | เป็น | ตอน | หน | ึ่ง | ของ | ซึ่ง | เป็น | วรรณ | กรรม | ประ | วัติ | ศาส | ตร์ | ภา | ษ | าส | ัน | สกฤต | ของ | โลก | ในอ | ดีต | กล | ่าว | ถึง | สถาน | การ | ณ์ | ซึ่ง | น | ํา | เร | ามาส | ู่ | ยุค | ปัจ | จุ | บัน | คือ | กล | ีย | ุค | ซึ่ง | เป็น | จุด | เริ | ่ม | ต้น | ของ | ยุค | นี้ | กล | ีย | ุค | เริ | ่ม | ขึ้ | นเม | ื่อ | ประ | มาณ | ห้า | พัน | ปีก | ่อน | องค์ | ชรี | คริ | ชณะ | ตรัส | ให้ | แก | ่อาร | จุนะ | ผู้ | ทรง | เป็น | ทั้ง | สหาย | และ | สาวก | ของ | พระ | องค์ | การ | สนท | นา | คร | ั้ง | นี้ | เป็น | การ | สนท | นา | ปรัช | ญา | และ | ธรรม | ะ | อัน | ยิ่ง | ใหญ่ | ที่ | สุด | ที่ | มนุษ | ย์ | เคย | รู้

(15-11-2025, 10:57 AM)Koen G Wrote: You are not allowed to view links. Register or Login to view.The result may be exactly the same, but the underlying system and the way we may think about spaces is different.

Nicely put. My own feeling is that it's easier to account for phenomena such as Smith-Ponzi word-break combinations if we imagine spaces being inserted into fundamentally continuous sequences of glyphs. We might still not know why [dy] and [y.q] are both so common -- but these would then become two observations of the same type rather than observations of two different types (requiring two different kinds of explanation).

Pages: 1 2 3 4

Jorge_Stolfi

Juan_Sali

quimqu

quimqu

asteckley

ReneZ

quimqu

ReneZ

quimqu

pfeaster