I got excited about burst detection and went a little further. The logical step, now that I have the bursts detected, is to see if there is any logic within the bursts on the same page. What I have done is to analyze how the words are generated within the manuscript, not so much by looking at whether they are similar, but whether they follow some generation mechanism.
The key idea is to look at "bursts", groups of related words, and see if there is a pattern within these groups and if this pattern is repeated within the same page.
The first important result is that within a burst the behavior is very coherent. Words do not change arbitrarily, but evolve with small modifications. This points to a local mechanism, almost as if one word is transformed from the previous one.
However, when moving from one burst to another, the behavior changes. It is not completely random, but it does not follow a simple rule either. There are preferences, but not determinism.
This is clear with a sequential model. I have applied a Markov model not directly on the words, but on the burst type associated with each word.
That is, I first create types of bursts with clustering models (kmeans). Then I assign each word a type (derived from its burst), and then I analyze the sequence of these types throughout the text.
Therefore, the sequence I model is not:
word1 → word2 → word3
but:
type1 → type2 → type3
This is important because bursts are not continuous blocks, but rather their words are mixed. Working at the type level allows me to capture the behavior of the system (or at least I try).
With this approach, it is seen that a Markov model clearly improves over assuming independence:
Code:
Model | Log-loss | Perplexity
---------------------------------------------
Independent | 1.19 | 3.29
Markov (order 1) | 1.07 | 2.91
Markov (order 2) | 1.01 | 2.73
This indicates that the type of the next word depends on the type of the previous ones,it is not independent. And the fact that order 2 improves order 1 shows that this dependence is not only on the immediate step, but that there is memory from more than one step.
In other words, the system does not generate words independently, but rather tends to maintain or change "mode" (burst type) following sequential patterns.
Furthermore, if I try to predict the burst type of a word, we see that it depends on both context and position:
Code:
Feature | Importance
---------------------------
line_ord | 0.287
prev_type | 0.283
prev2_type | 0.255
pos_norm | 0.097
pos_in_line | 0.073
is_line_start | 0.004
I think this is very revealing. Sequential context is very important, but so is vertical position within the page. On the other hand, the beginning of the line does not carry much weight.
When we try to predict whether there will be a type change (change of state), the result is much weaker:
Code:
Metric | Value
----------------
Accuracy | 0.556
ROC AUC | 0.574
Log-loss | 0.684
These values are only slightly above chance, indicating that the model is hardly able to predict when a change will occur. This suggests that the transition between states is not governed by a simple and easily observable or modelable rule, such as position within the line or the immediate context. Although the system shows clear structure within the bursts, the times when it changes from one type to another are much more difficult to predict and are not well explained by simple local variables.
Finally, when we analyze the different types of bursts, we see that they are not arbitrary. Each type has a different internal behavior: some modify mainly the end of the word, others the beginning, others are more balanced. In other words, each cluster corresponds to a specific way of transforming words.
Putting all this together, the model that best explains the behavior is this:
“within a state (burst type), the text evolves in a very coherent and local way. From time to time, the system changes state. This change is not random, but not deterministic either, and depends on both context and position.”
The text appears neither random nor purely linguistic in the classical sense. It has discrete states, short memory, and local transformation rules. It is much more compatible with a generative system with rules than with a natural text without explicit structure.