Hello again (warning, this post is long!),
I have been wondering if the method I have been using really gives us the desired information. I think the definition of core and the generation of bursts was perhaps not the best for trying to understand the generation of similar words, so I have reconsidered certain points. Forgive the length of this post but I think it is quite interesting.
About the bursts: before this post, the burst detection followed a fairly natural idea: the burst starts with the first word and the following are considered variations of it. This is simple and works well for segmenting the text, but it has one major problem: it assumes that the first word is the origin of the others. And this does not have to be true.
If the text is generated by modifying words already written on the page, it is most likely that the most frequent words will be the ones that form the basis. Rare words could, on the other hand, be the end result of a transformation. This means that the core of a burst cannot be defined simply as the first word.
From here, the first decision I made was to separate two things that until now I had mixed up: detecting which words form the same group, and understanding how they are generated from each other.
New way to detect bursts: bursts are detected within each page, but with two restrictions:
- Only words with a minimum length of 2 in CUVA are considered, to avoid noise (note that a minimum length of 2 in CUVA can perfectly be a length of 3, 4 or 5 in EVA).
- And, above all, bursts are not defined with a fixed window, but grow as long as there is continuity, and close when the page ends.
This ensures two important things: there are no overlaps, and all tokens within a burst are in the same real visual context. In addition, each word is stored with its exact position, line and position within the line. This allows the phenomenon to be analyzed in two dimensions, not just as a linear sequence.
Redefinition and detection of the "core" of the burst: Instead of defining a single core, four different types have been defined, each with a different meaning.
- The first is the "
anchor core" which is simply the first word of the burst. It serves to order and reference, but does not imply causality.
- The second is the "
central core", which is the most formally representative word, the one that most resembles the rest.
- The third is the "
frequent core", which is the most repeated form within the burst. This is interesting because it could represent the “base vocabulary” that the copyist reuses.
- The fourth is the most important: the "generative core". This is not defined by position or frequency, but by seeing which word best explains the rest if we assume that words are generated by small modifications.
The most important change I've made is not just redefining the core, but completely changing the model. Instead of saying "all words come from the core," what I do is, for each word, look for which other previous word is the best candidate to be its origin.
This creates a network of parent-child relationships within the burst.
Each relationship is calculated by combining three factors: formal similarity between words, spatial proximity within the page, and a slight preference for more frequent forms.
This model is much more realistic, because it allows generation to be progressive and local, not radial.
The results are quite clear.
[
attachment=14700][
attachment=14699]
The vast majority of parent-child relationships have a very low distance. In many cases they are exact copies or with a single modification. This indicates that the system does not generate words arbitrarily, but rather by small variations on existing forms.
When looking at the type of modification, the dominant pattern is very clear. Almost 40% are exact copies, and 30% are single-character substitutions. Insertions and deletions are much less frequent.
This suggests a very simple rule: first it is copied, and then, sometimes, it is modified.
Thanks to having the position of each word, the direction of generation can be studied.
[
attachment=14698]
A significant portion of the relationships occur within the same line or very close together, often to the right. This fits with sequential writing.
But there are also many relationships between separate lines. This indicates that the system does not depend only on the immediately preceding word (within the burst, of course), but on a wider set of visible shapes on the page.
Furthermore, when the distance is greater, the probability of modification increases and the exact copy decreases. In other words, the further away the source is, the more it is transformed (although the values are light).
[
attachment=14696]
When comparing which type of core actually acts as the origin of words, the result is consistent. The first token of the burst is not the best candidate. In contrast, core and generative words more often match the real parents. The frequent core also fits well, especially from a formal point of view.
[
attachment=14697]
This reinforces the idea that the system is based on already established and reused forms, not on an expansion from a single starting point.
Provisional conclusion as of today: the change in model has been to move from a simple view, where a burst is an expansion around an initial word, to a dynamic view, where words are generated from other visible words through small modifications.
The results indicate a strongly structured system, based on copying, minimal variation and local reuse. It does not appear to be a random or purely linear process.
This opens the door to going one step further: identifying the specific rules of transformation and seeing if they can be formalized as a kind of productive grammar of the text.
I've started to address this issue and I'll give you a glimpse: when you look at the specific modifications, very consistent patterns emerge. There are substitutions that are repeated many times, often in specific positions within the word. It doesn't seem like a random process.
Even more interesting is the case of insertions and deletions. Most of them are concentrated at the beginning of the word. This suggests that a significant part of the variation occurs by adding or removing elements at the beginning.
Overall, this points to a reduced set of recurrent operations, not to free combinations.