I believe that the triplets discussed by Emma are indeed related words. This appears to be particularly likely for the qoX / oX variants.
The issue pointed out by Zhe could be particularly tricky in the VMS. In particular, the Currier A/B continuum could be some sort of drifting or evolution in spelling preferences. If this is the case, if words A1 and A2 are two different inflections of a same stem, they may appear as B1 and B2 (or A3 and A4) in another page.
There are a number of approaches that could be tried. For instance:
- The examples proposed by Zhe suggest that consonants are much more resilient than vowels: one could attempt matching words (e.g. labels vs text paragraphs) in such a way that consonants have a greater weight than vowels.
- If we could find a way to map Currier A into Currier B, that would make it possible to recognize occurrences of the same word that are hidden by the different spelling preferences.
- One could check which similar words are also close in space (e.g. occur in the same bifolio). We know that this will often be the case, but one could conclude that similar words that only occur far away from each other are likely unrelated. With enough data, it might be possible to say something about the differences that are compatible with different forms of the same stem and those that appear to be genuinely different words.