Summary
A
q-token is a token (word occurrence) whose EVA transcription begins with the EVA glyphs
qo or
oqo. This note examines the spacing between q-tokens in the paragraph text of the VMS.
More specifically, a
plain token is a token that is not a q-token; a
q-gap is the list of plain tokens in a parag before the first q-token (a
BQ gap), between two successive q-tokens (a
QQ gap), or after the last q-token (a
QE gap).
In particular, if the first token of a parag is a q-token, we have an empty BE gap. If the last token is a q-token, we have an empty QE gap. And if two q-tokens occur in consecutive positions, we have an empty QQ gap. This post reports statistics on the lengths of those three kinds of q-gaps.
The motivation was to test the hypothesis that the
qo and
oqo glyphs could be start-of-sentence markers, or more generally could be clues to sentence structure like subject case markers, verbal tense markers, etc. The results were not what I had expected, but are intriguing nonetheless.
Results
The following histograms show how many q-gaps there are of each size and type, in the Starred Parags ('str') section:
Discussion
One observation we can make from these plots is that the distribution of q-tokens in a paragraph is not random. If a token could be plain or q-token with the same probability at any position, independently (a Markov process of order zero), then the number of q-gaps (of any type) of size
k would be a decaying exponential function
A*
p**
k, shown in the plots as the blue line with dots. For the 'str' section, the parameter
p is 0.8222.
Compared to the random model, in the BQ plot we see a clear excess of parags with 2 to 6 plain tokens before the first q-token, and a scarcity of parags where the first q-token occurs at the beginning (gap size 0) or only after 7 to 12 plain tokens.
A similar pattern is visible in the QE plot. There is a relative excess of parags with 5 or 6 plain tokens after the last q-token, and a relative scarcity of parags that end with a q-token, or with exactly 2, 4, or 8 plain tokens.
On the other hand, pairs of /successive/ q-tokens (QQ gaps of size 0) are much more common than expected, and ditto for pairs separated by a single plain token; whereas pairs separated by three plain tokens are visibly less common than expected.
It is not obvious how one could reproduce these deviations from the zero-order Markov model with some other simple random generator.
Technical details
Input file
For this analysis, an EVA transcription of the parags text from selected pages was reformatted by joing all lines of each parag into a single sequence of tokens. Line breaks internal to the parag, EVA dubious space codes ',', and figure intrusion markers '-' were converted to EVA word spaces '.'.
For this analysis, a token was considered invalid if it contained a
q glyph but did not start with
qo or
oqo; or if it started with '?' or '
o?', so that it could not be determined if it was a q-token or a plain one. Any q-gaps that contained invalid tokens were excluded from the plot.
BE gaps
There were a few parags with no q-tokens at all. In such cases the entire parag is a q-gap, of a separate type (a
BE gap). In the zero-order Markov model, these BE gaps too have an exponentially decaying distribution, with the same exponent. However, there were too few of them to yield a meaningful plot.
Handling of dubious spaces
For the plots above, the EVA dubious space codes ',' were mapped randomly to either '.' or nothing, at random, with equal probability.
This hack had relatively little impact on the statistics above. Changing the probability of conversion to dot from 50% to 0% or 100% only shifted the histograms a little, without affecting the qualitative conclusion -- that the q-gap sizes are far from random.
All the best, --stolfi