30-03-2018, 11:12 PM
Let's denote probability of "daiin" to occur as "p".
Now, assuming that there are no interdependencies between words, i.e. each next word is independent of the preceding one, the probability would be p*(1-p)*p*(1-p).
The value of p may be taken from some assumption (as e.g. the Zipf's law), or calculated from the actual count of "daiin". If from the actual count, then it's 845/37718 = 2.240e-2, hence the pattern's probability would be 4.78e-4, which related to ~37718 patterns in total yields the expected count of 18,1 versus 18.9 of yours.
Assuming there are interdependencies (which one normally would expect from a meaningful text), we need to somehow estimate them.
Now, assuming that there are no interdependencies between words, i.e. each next word is independent of the preceding one, the probability would be p*(1-p)*p*(1-p).
The value of p may be taken from some assumption (as e.g. the Zipf's law), or calculated from the actual count of "daiin". If from the actual count, then it's 845/37718 = 2.240e-2, hence the pattern's probability would be 4.78e-4, which related to ~37718 patterns in total yields the expected count of 18,1 versus 18.9 of yours.
Assuming there are interdependencies (which one normally would expect from a meaningful text), we need to somehow estimate them.