Jorge_Stolfi > 15-11-2025, 02:29 PM
(15-11-2025, 07:26 AM)ReneZ Wrote: You are not allowed to view links. Register or Login to view.Would you like to test your model on Thai? This is written mostly without word spaces. It would just need to be able to read unicode - I expect it would be UTF-8.
Juan_Sali > 15-11-2025, 02:59 PM
quimqu > 15-11-2025, 05:27 PM
(15-11-2025, 07:26 AM)ReneZ Wrote: You are not allowed to view links. Register or Login to view.Would you like to test your model on Thai? This is written mostly without word spaces.
It would just need to be able to read unicode - I expect it would be UTF-8. I would be interested to see how well it does.
quimqu > 15-11-2025, 06:03 PM
asteckley > 16-11-2025, 02:07 AM
(14-11-2025, 11:41 PM)quimqu Wrote: You are not allowed to view links. Register or Login to view.What I did was apply a Byte Pair Encoding (BPE) segmentation model. This model does not need to know anything about the language. It simply takes the entire text as a sequence of characters with no spaces or punctuation, identifies which pairs of characters are most frequent, and merges them. After a certain number of merges, the model produces a set of segments that can be interpreted as statistical morphemes. With these segments, you can try to reconstruct the original words by segmenting them into the most likely units, and then compare whether these boundaries match the original spaces.
ReneZ > 16-11-2025, 02:54 PM
(15-11-2025, 05:27 PM)quimqu Wrote: You are not allowed to view links. Register or Login to view.EVALUATION RESULTS, length aware BPE:
Correct starts: 93.626% Correct ends: 93.626%
Boundary recall: 93.626% Perfect words: 87.406%
Example, first 200 characters segmented:
ถึงแม้ว่า | หน | ัง | ส | ือ | ภควัตคีตาฺ | จะ | ได้รับความ | นิ | ยม | จากการ | พิ | ม | พ | ์ | และการ | อ | ่าน | อย่าง | แ | พร | ่ | หลาย | เดิม | ที | เป็น | ตอน | หนึ่ง | ของ | มหา | ภ | าร | ต | ะฺ | ซึ่งเป็น | วรรณกรรม | ประ | วัต | ิ | ศาสตร์ | ภา | ษ | าส | ัน | ส | ก | ฤ | ต | ของ | โลก | ในอ | ดีต | มหา | ภ | าร | ต | ะฺ | กล่าว | ถึง | สถาน | การณ์ | ซึ่ง | น | ํ | า | เร | าม | าส | ู่
quimqu > 16-11-2025, 04:20 PM
(16-11-2025, 02:54 PM)ReneZ Wrote: You are not allowed to view links. Register or Login to view.The full stops are also unusual and while they may be used in Thai, here they seem to also represent quotes. The character that looks like a tiny dot below some characters is normally not used.
ReneZ > 17-11-2025, 05:33 AM
quimqu > 17-11-2025, 09:43 AM
(17-11-2025, 05:33 AM)ReneZ Wrote: You are not allowed to view links. Register or Login to view.For Stolfi: Thai has quite a number of non-separable multi-syllable words. This text has five, not counting [font='Roboto Mono', monospace]สันสกฤต w[/font]hich is a loan word and says 'Sanskrit'.
(Not sure what is happening with the fonts here).
pfeaster > 18-11-2025, 01:05 AM
(15-11-2025, 10:57 AM)Koen G Wrote: You are not allowed to view links. Register or Login to view.The result may be exactly the same, but the underlying system and the way we may think about spaces is different.