Quote:The key question is, what verifiable features does it predict that we don't yet know about?
For me this key feature was that rare words do co-occur with similar ones. You can check this yourself. Choose a rule for selecting some low frequent types and check if this words do co-occur with similar ones.
For instance glyphs beside 'i' and 'e' occur rarely duplicated. The bigram 'll' occurs 28 times and the bigram 'dd' occurs 23 times (see You are not allowed to view links.
Register or
Login to view.). If you check this words you will probably find patterns like the three 'dd' words on page You are not allowed to view links.
Register or
Login to view. (see You are not allowed to view links.
Register or
Login to view.).
Even for very rare words it is easy to find this type of pattern.
The bigram 'an' occurs 118 times and the bigram 'on' only five times. But this doesn't mean that 'on'-words must be errors:
You are not allowed to view links.
Register or
Login to view.
It is also possible to search the word with the highest number of similarities for each page. In this case you would frequently find pairs like 'chol' & 'chor' or 'chedy' & 'chedy':
f1r : chol
You are not allowed to view links.
Register or
Login to view. : chol
f2r : chy
You are not allowed to view links.
Register or
Login to view. : chor
You are not allowed to view links.
Register or
Login to view. : chol
You are not allowed to view links.
Register or
Login to view. : chor
You are not allowed to view links.
Register or
Login to view. : chol
You are not allowed to view links.
Register or
Login to view. : sho
...
f82r : chedy
You are not allowed to view links.
Register or
Login to view. : chedy
You are not allowed to view links.
Register or
Login to view. : chedy
You are not allowed to view links.
Register or
Login to view. : shedy
...
You are not allowed to view links.
Register or
Login to view. : aiin
You are not allowed to view links.
Register or
Login to view. : oaiin
You are not allowed to view links.
Register or
Login to view. : chey
f115v : chedy
Or you can check similarly spelled word types. Types which contain less frequent glyphs or bigrams in most cases occur less frequently:
bigram - frequencies for the most common words with this glyph sequence
cho : cha = 2552:468 - 'chol' (396 times) : 'char' (72 times)
sho : sha = 949:127 - 'shol' (186 times) : 'shar' (34 times)
qo : qa = 5186:8 - ...
ok : ak = 5950:36
ot : at = 3767:11
op : ap = 560:6
of : af = 154:3
ar : or = 3151:2723
al : ol = 3002:5507