(21-05-2022, 06:44 PM)bi3mw Wrote: You are not allowed to view links. Register or Login to view.Can you also sort the "Dots table" by rank ?
Hopefully, you mean like this. (else send me a pm)
Same data with columns reversed:[
attachment=6558]
@Koen G,
Regarding the question I asked in post #1 <<"What does it mean when ...?">>
Yes, sorry, my bad, ignore it, i wasn't looking at all the data.
I was wondering if the most common, separated bigrams ( Top 50 ) are also correspondingly common in the VMS text. In other words: For example, is "yq" also the most frequently occurring bigram in percentage terms in the entire VMS text ?
The result shows that there is no recognizable connection.
[
attachment=6559]
I do not quite understand, according to my question the result is clear. Could you perhaps explain your point of view in more detail ?
The tallest red spikes are to the left, so it looks like "real" bigrams vs separated bigrams are not only separate categories but also exclude each other to some extent. Just like there is only one discernable red bar among the top five blue ones.
Yes, I agree with that. - By the way, I added the dot- and comma-separated bigrams together so that a comparison is possible overall. Of course, I have recalculated the percentages.
This histogram is similar to what Matthias posted, but I split certain (green) and uncertain (yellow) spaces. Data are sorted by increasing certain-space % value. I think this can help illustrate Patrick's findings.
Many bigrams have a clear preference for appearing adjacent or separated by certain spaces: they occupy the opposite sides of the graph. When it is not clear whether a bigram should be regarded as mostly conjoint or disjoint, we can expect a peak for uncertain spaces. Uncertain spaces tend to appear in those cases where both certain spaces and adjacency have significant values. They tend to cluster at the middle of the histogram, resulting in a gradual shift from adjacency, to uncertain spaces, to certain spaces. The scribes mostly wrote uncertain spaces for those bigrams whose statistical nature is ambiguous about adjacency / spacing.
EDIT1: C and S correspond to EVA ch / sh
EDIT2: here % values are computed with respect to the totals of each category. E.g. 12% for certain-separated 'yq' means that 12% of all the bigrams separated by '.' are 'y.q'.
Thanks, Marco, this is a complete and intuitive way to visualize the issue.
I feel like we cannot ignore the role of the transliterator in this context. This is not meant as a criticism, but more of a reflection on what exactly these uncertain spaces are. For example, might a bigram like "lk" have so many uncertain spaces because it does not exclusively belong to the "adjacent" or "space" category? I think what Emma asked earlier in this thread is relevant in this regard:
(21-05-2022, 09:42 PM)Emma May Smith Wrote: You are not allowed to view links. Register or Login to view.I ask because I'm interested in how strong the uncertain spaces were affected by subjectivity. I wonder if uncertain spaces became more or less common during the course of the transcription. That is, either you or Gabriel became more aware or word patterns and they influence judgements.
(This isn't a criticism, as I know I would do the same.)
(23-05-2022, 12:49 PM)Koen G Wrote: You are not allowed to view links. Register or Login to view.I feel like we cannot ignore the role of the transliterator in this context.
I agree: Emma pointed out an important side of the subject. Of course it's easy to imagine that a
transcriber transliterator is influenced by word structure.
Another missing bit is how spaces work in readable manuscripts: we know that there is a great variability in how spaces were written, e.g. prepositions were often attached to the following word. But I don't think any transcription handles something like "uncertain spaces".
This is an area where one might think of using image processing: e.g. the number of spaces in each line (N) is extracted from a manual transcription and given as input to the software; then the N wider spaces are classified as certain / uncertain according to a threshold on a computed measure.