oshfdk > 25-06-2025, 11:46 AM
(25-06-2025, 10:18 AM)Koen G Wrote: You are not allowed to view links. Register or Login to view.Ah, I see. So you have a sliding window of n, and then you see how well you can predict the next n characters? I just can't intuitively grasp what this tells us. It's too abstract - a measure of the information density of the writing system.
Quote:An n-gram model learns the likelihood of seeing a particular character given the previous n-1 characters.
quimqu > 25-06-2025, 12:08 PM
(25-06-2025, 11:46 AM)oshfdk Wrote: You are not allowed to view links. Register or Login to view.(25-06-2025, 10:18 AM)Koen G Wrote: You are not allowed to view links. Register or Login to view.Ah, I see. So you have a sliding window of n, and then you see how well you can predict the next n characters? I just can't intuitively grasp what this tells us. It's too abstract - a measure of the information density of the writing system.
Actually, maybe it was me who misunderstood what quimqu was computing. According to the intro post:
Quote:An n-gram model learns the likelihood of seeing a particular character given the previous n-1 characters.
So, for 5-gram it should be guessing the next character based on the previous 4 characters.
For example, given qoke*, what is *?
But then it's strange that the perplexity goes up with longer ngrams, it shouldn't.
So, I don't understand what is going on here.
quimqu > 25-06-2025, 12:37 PM
![[Image: CYM3Nr6.png]](https://i.imgur.com/CYM3Nr6.png) [/font][/size]
[/font][/size]Rafal > 25-06-2025, 12:37 PM
Quote:But then it's strange that the perplexity goes up with longer ngrams, it shouldn't.
quimqu > 25-06-2025, 03:43 PM
(25-06-2025, 10:17 AM)ReneZ Wrote: You are not allowed to view links. Register or Login to view.Even if we don't strictly consider numbers, but rather some more generic 'enumeration' system, we will run into a problem that I see clearly in my mind, but may not be able to explain clearly.
The words okeey and qokeey can appear near each other, but differ only on the extreme left side of the word.
The words qokal and qokar are also similar and differ only on the extreme right.
Using computer terminology, if this were an enumeration system, it would appear to be neither high-endian nor low-endian, but rather both-endian.
![[Image: pSGS61S.png]](https://i.imgur.com/pSGS61S.png)
![[Image: FK69oVE.png]](https://i.imgur.com/FK69oVE.png) [/font][/size]
[/font][/size]Jorge_Stolfi > 26-06-2025, 03:46 AM
(25-06-2025, 10:17 AM)ReneZ Wrote: You are not allowed to view links. Register or Login to view.Using computer terminology, if this were an enumeration system, it would appear to be neither high-endian nor low-endian, but rather both-endian.
ReneZ > 26-06-2025, 05:11 AM
(26-06-2025, 03:46 AM)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view.This process has the result that the most frequent words of the language will tend to get the smallest code numbers. if the codes are written in a Roman-like scheme as described in that page, the most common words will tend to get shorter codes-- an optimization seen in natural languages and the VMS.