Jorge_Stolfi > Yesterday, 02:29 PM
(Yesterday, 07:26 AM)ReneZ Wrote: You are not allowed to view links. Register or Login to view.Would you like to test your model on Thai? This is written mostly without word spaces. It would just need to be able to read unicode - I expect it would be UTF-8.
Juan_Sali > Yesterday, 02:59 PM
quimqu > Yesterday, 05:27 PM
(Yesterday, 07:26 AM)ReneZ Wrote: You are not allowed to view links. Register or Login to view.Would you like to test your model on Thai? This is written mostly without word spaces.
It would just need to be able to read unicode - I expect it would be UTF-8. I would be interested to see how well it does.
quimqu > Yesterday, 06:03 PM
asteckley > Today, 02:07 AM
(14-11-2025, 11:41 PM)quimqu Wrote: You are not allowed to view links. Register or Login to view.What I did was apply a Byte Pair Encoding (BPE) segmentation model. This model does not need to know anything about the language. It simply takes the entire text as a sequence of characters with no spaces or punctuation, identifies which pairs of characters are most frequent, and merges them. After a certain number of merges, the model produces a set of segments that can be interpreted as statistical morphemes. With these segments, you can try to reconstruct the original words by segmenting them into the most likely units, and then compare whether these boundaries match the original spaces.
ReneZ > 5 hours ago
(Yesterday, 05:27 PM)quimqu Wrote: You are not allowed to view links. Register or Login to view.EVALUATION RESULTS, length aware BPE:
Correct starts: 93.626% Correct ends: 93.626%
Boundary recall: 93.626% Perfect words: 87.406%
Example, first 200 characters segmented:
ถึงแม้ว่า | หน | ัง | ส | ือ | ภควัตคีตาฺ | จะ | ได้รับความ | นิ | ยม | จากการ | พิ | ม | พ | ์ | และการ | อ | ่าน | อย่าง | แ | พร | ่ | หลาย | เดิม | ที | เป็น | ตอน | หนึ่ง | ของ | มหา | ภ | าร | ต | ะฺ | ซึ่งเป็น | วรรณกรรม | ประ | วัต | ิ | ศาสตร์ | ภา | ษ | าส | ัน | ส | ก | ฤ | ต | ของ | โลก | ในอ | ดีต | มหา | ภ | าร | ต | ะฺ | กล่าว | ถึง | สถาน | การณ์ | ซึ่ง | น | ํ | า | เร | าม | าส | ู่
quimqu > 4 hours ago
(5 hours ago)ReneZ Wrote: You are not allowed to view links. Register or Login to view.The full stops are also unusual and while they may be used in Thai, here they seem to also represent quotes. The character that looks like a tiny dot below some characters is normally not used.