Hi Marco,
Quote:if I understand correctly "expansion" will only increase entropy if it maps to different sequences. If you always replace, say o with [ab] the result will be even lower entropy (e.g. all low-entropy qo- sequences will be converted in even lower entropy qab- sequences). Is it so?
Not necessarily.
In the first place, information entropy is a characteristic of the information source, not of any individual patterns that that source produces. In this view, it is not correct to speak of "low-entropy" "qo"-sequence or of "high-entropy" another sequence. You calculate entropy over the whole text - technically, over a sample of considerable length, so that the result of the calculation serves to represent the characteristic of the information source.
Second, h
2 is the measure of mean information per character provided that the preceding character is given. So the result will depend on the total set of "expansions", not on any individual expansion considered per se. And also, on whether the symbols comprising the result of expansion do or do not appear in the original text.
For an example, let's take
Davidsch's signature right above. The sample is too short, so the results will not be characteristic for
Davidsch's written speech, but the maths are the same regardless of the size of the sample, so we'll observe the changes.
The original text (capitalization and punctuation removed for simplicity) is:
Code:
do with this posting what you want if you simply reply what do you mean i do not understand you i will not respond because then you did not read it well enough
For this text, h
2 calculation is is 2.15.
Let's now "expand" the letter "o" into the sequence [xz], where neither letter "x" nor "z" are present in the original text:
Code:
dxz with this pxzsting what yxzu want if yxzu simply reply what dxz yxzu mean i dxz nxzt understand yxzu i will nxzt respxznd because then yxzu did nxzt read it well enxzugh
For this text, h
2 calculation decreases to 1.97.
Now let us introduce another expansion - of the letter "h" to the sequence [xd], where both letters "x" and "d" are present in our preceding revision (and letter "d" is present even in the original revision):
Code:
dxz witxd txdis pxzsting wxdat yxzu want if yxzu simply reply wxdat dxz yxzu mean i dxz nxzt understand yxzu i will nxzt respxznd because txden yxzu did nxzt read it well enxzugxd
For this text, h
2 calculation increases to 2.04 from the previous 1.97.
Quote:The other point that is unclear to me is why you say that this "moves us from the plain language to cipher". For instance, in Latin abbreviation, it was common to have symbols that expanded into more than one syllable. In the attached verse (from the bottom of You are not allowed to view links. Register or Login to view.), a “crossed p” is used for both “par” and “per”. Is it really necessary to think of a cipher to consider similar possibilities? Could we maybe say that this "moves us from a plain phonetic script to some kind of abbreviation"?
Abbreviation can be considered a cipher in its essence - a set of rules to convert original plain text into its representation.