03-10-2017, 08:14 AM
While dong some experimentation in this area, I ran into the following complication.
When working with character pair statistics, which are relevant both for entropy calculations and HMM analyses, there are three different ways of treating word spaces.
1) One counts word spaces among the characters to be analysed.
2) One deletes word spaces completely
3) One treats word spaces as breaks in the string.
Option 2 seems the least interesting, for obvious reasons.
If one chooses option 3, one is analysing only the internal structures of the words. The number of character pairs is limited to pairs inside words. The problem is that the character frequencies for
- all occurrences of character X (i.e. any character)
- the occurrences of character X as the first of the pair
- the occurrences of character X as the second of the pair
are no longer the same.
This is of course expected, but what surprised me is that even for a "well-behaved" language like Latin, the differences are significant.
For Voynichese, with its strong position-dependent behaviour, the differences are completely overwhelming the results.
The clean way out is to treat spaces as characters throughout all calculations.
By 'circularising' the text, i.e. adding one space at the end, and pretending that the very first character of the text follows this, one achieves that there are the same number of character pairs as single characters, and also that all frequencies of character X:
- as single character
- as first of a pair
- as second of a pair
are the same.
While this solves the problem, the issue remains that calculations based on different treatments of spaces will yield significantly different results.
PS: the text should have been 'cleaned up', by replacing all punctuation by word spaces and making sure that there is exactly one space between each pair of words.
When working with character pair statistics, which are relevant both for entropy calculations and HMM analyses, there are three different ways of treating word spaces.
1) One counts word spaces among the characters to be analysed.
2) One deletes word spaces completely
3) One treats word spaces as breaks in the string.
Option 2 seems the least interesting, for obvious reasons.
If one chooses option 3, one is analysing only the internal structures of the words. The number of character pairs is limited to pairs inside words. The problem is that the character frequencies for
- all occurrences of character X (i.e. any character)
- the occurrences of character X as the first of the pair
- the occurrences of character X as the second of the pair
are no longer the same.
This is of course expected, but what surprised me is that even for a "well-behaved" language like Latin, the differences are significant.
For Voynichese, with its strong position-dependent behaviour, the differences are completely overwhelming the results.
The clean way out is to treat spaces as characters throughout all calculations.
By 'circularising' the text, i.e. adding one space at the end, and pretending that the very first character of the text follows this, one achieves that there are the same number of character pairs as single characters, and also that all frequencies of character X:
- as single character
- as first of a pair
- as second of a pair
are the same.
While this solves the problem, the issue remains that calculations based on different treatments of spaces will yield significantly different results.
PS: the text should have been 'cleaned up', by replacing all punctuation by word spaces and making sure that there is exactly one space between each pair of words.