14-12-2019, 09:35 AM
Interesting Marco,
I am not aware that this has been tried in any serious manner before.
I wonder about your distance measure. I don't know what is "right" of course, but one can do several things.
My inclination would be to look at bigram distribution of the result.
There is a statistical advantage that there are a lot more samples and fewer different items, so the result would be numerically more significant.
The other advantage is a bit more hypothetical.
The difference between A and B can be more about 'rules' on the 'dialect' or it can be more about a different vocabulary (different subject matter).
Looking at bigrams will concentrate more on the former and, assuming that such rules exist, will be less affected by changing vocabulary or subject matter.
From my experimentation with alternative HMM I found that measuring the distance between bigram distributions works better with the Bhattacharyya distance than just RSS-ing the frequencies, but this may just be fine-tuning.
I am not aware that this has been tried in any serious manner before.
I wonder about your distance measure. I don't know what is "right" of course, but one can do several things.
My inclination would be to look at bigram distribution of the result.
There is a statistical advantage that there are a lot more samples and fewer different items, so the result would be numerically more significant.
The other advantage is a bit more hypothetical.
The difference between A and B can be more about 'rules' on the 'dialect' or it can be more about a different vocabulary (different subject matter).
Looking at bigrams will concentrate more on the former and, assuming that such rules exist, will be less affected by changing vocabulary or subject matter.
From my experimentation with alternative HMM I found that measuring the distance between bigram distributions works better with the Bhattacharyya distance than just RSS-ing the frequencies, but this may just be fine-tuning.