(18-11-2025, 01:05 AM)pfeaster Wrote: You are not allowed to view links. Register or Login to view.Nicely put. My own feeling is that it's easier to account for phenomena such as Smith-Ponzi word-break combinations if we imagine spaces being inserted into fundamentally continuous sequences of glyphs.
I lean the other way, since the word-end glyphs also look like they have been designed for word end. Look at the practice of using a special last minim in roman numerals, for example (xviij). This is similar to [iin]. Similar things can be said about [y] and for example the "-us" abbreviation. I mean, the glyphs kind of look like they belong where we find them most often.
Although gallows with prefixes may be an exception. [qoty] just looks bad.
(18-11-2025, 10:48 AM)Koen G Wrote: You are not allowed to view links. Register or Login to view. (18-11-2025, 01:05 AM)pfeaster Wrote: You are not allowed to view links. Register or Login to view.Nicely put. My own feeling is that it's easier to account for phenomena such as Smith-Ponzi word-break combinations if we imagine spaces being inserted into fundamentally continuous sequences of glyphs.
I lean the other way, since the word-end glyphs also look like they have been designed for word end. Look at the practice of using a special last minim in roman numerals, for example (xviij). This is similar to [iin]. Similar things can be said about [y] and for example the "-us" abbreviation. I mean, the glyphs kind of look like they belong where we find them most often.
Although gallows with prefixes may be an exception. [qoty] just looks bad.
These views aren't necessarily mutually exclusive. Imagine, for example, a system in which letters of the alphabet are encoded as Roman numerals, with the value of every odd-numbered letter being multiplied by 100. Thus:
mmccxvmmdxivcmiiidcccxiiicxivmmcxixcccxviiicmxvimm = VOYNICHMANUSCRIPT
This represents a continuous underlying string of text (there's a word break, but it isn't encoded), and it's unambiguous even without breaks inserted. At the same time, it would be helpful for the reader -- and respect the nature of the encoding scheme -- to separate the segments like this (note the closing "j" where appropriate):
mmccxv mmdxiv cmiij dcccxiij cxiv mmcxix cccxviij cmxvj mm
It would be interesting to see what Sukhotin's algorithm would predict are the vowels and consonants here. I'm not suggesting this is precisely how Voynichese works -- just offering it as a model for imagining how spaces might be functioning within it.
In this scheme (or in the Naibbe cipher, for that matter):
(1) Are the spaces "really spaces"?
(2) Do the spaces make the writing system more or less predictable than it would otherwise be?
(17-11-2025, 09:43 AM)quimqu Wrote: You are not allowed to view links. Register or Login to view. (17-11-2025, 05:33 AM)ReneZ Wrote: You are not allowed to view links. Register or Login to view.For Stolfi: Thai has quite a number of non-separable multi-syllable words. This text has five, not counting [font='Roboto Mono', monospace]สันสกฤต w[/font]hich is a loan word and says 'Sanskrit'.
(Not sure what is happening with the fonts here).
[...]
With all this, and with the cleaned text (I do not know whether it is a good text to test), and limiting the maximum group size to 5 characters, this output appears:
ถึง | แม้ | ว่า | หน | ังส | ือ | จะ | ได้ | รับ | ความ | นิยม | จาก | การ | พิ | มพ | ์และ | การ | อ่าน | อย | ่าง | แพร่ | หลาย | เดิม | ที | เป็น | ตอน | หน | ึ่ง | ของ | ซึ่ง | เป็น | วรรณ | กรรม | ประ | วัติ | ศาส | ตร์ | ภา | ษ | าส | ัน | สกฤต | ของ | โลก | ในอ | ดีต | กล | ่าว | ถึง | สถาน | การ | ณ์ | ซึ่ง | น | ํา | เร | ามาส | ู่ | ยุค | ปัจ | จุ | บัน | คือ | กล | ีย | ุค | ซึ่ง | เป็น | จุด | เริ | ่ม | ต้น | ของ | ยุค | นี้ | กล | ีย | ุค | เริ | ่ม | ขึ้ | นเม | ื่อ | ประ | มาณ | ห้า | พัน | ปีก | ่อน | องค์ | ชรี | คริ | ชณะ | ตรัส | ให้ | แก | ่อาร | จุนะ | ผู้ | ทรง | เป็น | ทั้ง | สหาย | และ | สาวก | ของ | พระ | องค์ | การ | สนท | นา | คร | ั้ง | นี้ | เป็น | การ | สนท | นา | ปรัช | ญา | และ | ธรรม | ะ | อัน | ยิ่ง | ใหญ่ | ที่ | สุด | ที่ | มนุษ | ย์ | เคย | รู้
This is a lot better, but still fails in a significant way. I suspect that the input text may have been too short.
All things that are superscripts are vowels or tone marks that cannot be separated from the consonant above which they are written. Also, all vowels are strictly bound to the adjacent consonant. Some are written to the left and some to the right but the rules are strict. My guess is that with a longer text this would have been detected.....
(19-11-2025, 08:23 AM)ReneZ Wrote: You are not allowed to view links. Register or Login to view.This is a lot better, but still fails in a significant way. I suspect that the input text may have been too short.
All things that are superscripts are vowels or tone marks that cannot be separated from the consonant above which they are written. Also, all vowels are strictly bound to the adjacent consonant. Some are written to the left and some to the right but the rules are strict. My guess is that with a longer text this would have been detected.....
Thanks René. The text is quite long:
36782 "words" separated by a space
872639 "chars"
I suppose the issue is that the model detects the vowel chars and it asumes that they should be better treated as single chars. My question: does Thai have single char words? If not, I could set the model not to take into consideration single chars and maybe have a better result.
This resitrictions, on the other hand, are not good to apply to the Voynich. Note that for the MS, I have analysed the single char usage: in order to have a better match with the spaces, we need to have about 7% of the segments as a single chars. My analysis has taken in consideration only the paragraphs and long sentences. No labels.
That is quite a long sample indeed, so I would need to conclude empirically and objectively that this algorithm fails to detect word boundaries in a Thai text.
(19-11-2025, 09:33 AM)quimqu Wrote: You are not allowed to view links. Register or Login to view.My question: does Thai have single char words? If not, I could set the model not to take into consideration single chars and maybe have a better result.
No, Thai words always have at least one consonant and one vowel. In fact, not all vowels are written, in particular the default short 'a' at the end of a syllable, or the default short 'o' when the syllable has a closing consonant.
However, I only know one word that consists of a single character: ณ , pronounced 'na', which is quite infrequent.
There are several possible ways in which the algorithm should have worked, but fails to. One is that a longish text will have 60-70 different symbols. If this exceeds a limit in the code, I suppose that the output would indicate this....
Also, not all symbols are vowels or consonants. There are four tone marks, of which two appear frequently.
That shouldn't really matter though...