The Voynich Ninja

Full Version: Bigrams across uncertain spaces
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Pages: 1 2 3
Bigrams across uncertain spaces.
All errors are mine and there probably are some.

Extract text from ZL_ivtff_2a using IVTT: from Windows cmdline "ivtt -x7 -s0 -h0 ZL_ivtff_2a.txt ZL2adot-comma.txt"
text file contains : pure text, keep commas and dots, also keeps '?', '*' and @123; notation

Find all the commas denoting uncertain spaces, get EVA-character from either side of the comma, concatenate to form bigram  i.e ('X,Y' = 'XY')
Find all the dots denoting certain spaces, get EVA-character from either side of the comma, concatenate to form bigram  i.e ('X.Y' = 'XY')

Count them, Rank them, Simple stat them:
abs(Percentage divide) =
percentage occurrence of Comma-bigram, percentage occurrence of Dot-bigram, take which ever number is higher and divide by the lower

abs(Rank Comma - Rank Dot) =
Absolute value of the Comma bigram rank subtracted from the Dot bigram rank

Did the 'percent divide' and 'rank subtract' because it just makes it easier to spot the differences and provides a simple way to compare them.
Top12 shown here:
Code:
Commas(X,Y) 2737 Total          Dots(X.Y) 30890 Total    abs(Percentage divide)  abs(Rank Comma - Rank Dot)

Rank  count bigram  %          Rank  count bigram  %
('R1', 285, 'ra', 10.413)      ('R14', 730, 'ra', 2.363)    4.407      13
('R2', 147, 'lc', 5.371)      ('R10', 1146, 'lc', 3.71)    1.448    8
('R2', 147, 'lk', 5.371)      ('R33', 201, 'lk', 0.651)    8.25      31      --lk
('R4', 125, 'ls', 4.567)      ('R15', 672, 'ls', 2.175)    2.1    11
('R5', 119, 'sa', 4.348)      ('R28', 245, 'sa', 0.793)    5.483    23
('R6', 101, 'yk', 3.69)      ('R18', 443, 'yk', 1.434)    2.573      12
('R7', 93, 'ol', 3.398)      ('R39', 118, 'ol', 0.382)    8.895      32      --ol
('R8', 90, 'ld', 3.288)      ('R17', 569, 'ld', 1.842)    1.785    9
('R9', 83, 'ro', 3.033)      ('R6', 1355, 'ro', 4.387)    1.446    3
('R10', 78, 'lo', 2.85)      ('R11', 996, 'lo', 3.224)    1.131    1
('R11', 73, 'yd', 2.667)      ('R7', 1275, 'yd', 4.128)    1.548      4
('R12', 68, 'yt', 2.484)      ('R23', 312, 'yt', 1.01)    2.459    11
('R12', 68, 'ok', 2.484)      ('R52', 65, 'ok', 0.21)    11.829    40        --ok

Observations:
the 'o<character>' family turn up a lot with high abs(Rank subtract) scores ; ol, ok, or,oa,ot

The bigram 'lg' has the highest  abs(Rank subtract) score:
       comma                            dot
('R58', 5, 'lg', 0.183)      ('R206', 1, 'lg', 0.003)    61.0      148        --lg
One conclusion is that at least some of those 'l,g' bigrams are real bigrams and any apparent space is a scribal artifact.

Questions:
What does it mean when the Dot-bigram occurrence percentage is higher than the Comma-bigram occurrence percentage? e.g
('R38', 15, 'yo', 0.548)      ('R2', 2687, 'yo', 8.699)    15.874      36

Data attached:[attachment=6556]
I havent drawn any conclusions 'cos ive now got a headache Smile
So feel free to post any observations , conclusions, ideas and whatnot.

What could be done better:
use high ascii codes instead of notation so then any '@<chr>'  or '<chr>@' bigrams would probably have slighly different counts.
Make an 'expected differences' column, i was going to and forgot Doh

This experiment was derived from Juan_Salis' suggestion You are not allowed to view links. Register or Login to view.
Also, related work is 'Glyph combinations across word breaks in the Voynich Manuscript' but to my shame i can't remember the details in it Blush 
You are not allowed to view links. Register or Login to view.
Can you also sort the "Dots table" by rank ?
Thanks, Rob, this is some great work. What it all means is a difficult question that probably won't be resolved by looking at individual examples.

Quote:What does it mean when the Dot-bigram occurrence percentage is higher than the Comma-bigram occurrence percentage? e.g

('R38', 15, 'yo', 0.548)      ('R2', 2687, 'yo', 8.699)    15.874      36

I'm a bit confused by this. Aren't there a lot of cases of "y.o", compared to only a few cases of "y,o"?

PS: Rene, in case you read this thread, could you tell us some more about the comma vs dot distinction? (This happened long before many people here were into VM research). How exactly did the "uncertain spaces" come to be? How were they judged?
(21-05-2022, 06:46 PM)Koen G Wrote: You are not allowed to view links. Register or Login to view.PS: Rene, in case you read this thread, could you tell us some more about the comma vs dot distinction? (This happened long before many people here were into VM research). How exactly did the "uncertain spaces" come to be? How were they judged?


I'm here...
This is something that naturally came up while transcribing. Some gaps really look like word spaces, and they have been denoted by a full stop / period.
Others seemed doubtful - it was not clear if these were word spaces or just a slightly larger gap between adjacent characters. They have been denoted by a comma.

This process was, of course, strongly subjective.
Gabriel Landini and myself did this in parallel, and we came up with different opinions.
This appears to be related with what Patrick Feaster discusses in his You are not allowed to view links. Register or Login to view. § 4 Word Breaks, Line Breaks, Paragraph Breaks, Labels. I read this post a while ago and at the moment I can't tell how close the measures are.
(21-05-2022, 07:55 PM)ReneZ Wrote: You are not allowed to view links. Register or Login to view.
(21-05-2022, 06:46 PM)Koen G Wrote: You are not allowed to view links. Register or Login to view.PS: Rene, in case you read this thread, could you tell us some more about the comma vs dot distinction? (This happened long before many people here were into VM research). How exactly did the "uncertain spaces" come to be? How were they judged?


I'm here...
This is something that naturally came up while transcribing. Some gaps really look like word spaces, and they have been denoted by a full stop / period.
Others seemed doubtful - it was not clear if these were word spaces or just a slightly larger gap between adjacent characters. They have been denoted by a comma.

This process was, of course, strongly subjective.
Gabriel Landini and myself did this in parallel, and we came up with different opinions.

Do you recall the order in which the pages were transcribed?

I ask because I'm interested in how strong the uncertain spaces were affected by subjectivity. I wonder if uncertain spaces became more or less common during the course of the transcription. That is, either you or Gabriel became more aware or word patterns and they influence judgements.

(This isn't a criticism, as I know I would do the same.)
(21-05-2022, 07:57 PM)MarcoP Wrote: You are not allowed to view links. Register or Login to view.This appears to be related with what Patrick Feaster discusses in his You are not allowed to view links. Register or Login to view. § 4 Word Breaks, Line Breaks, Paragraph Breaks, Labels. I read this post a while ago and at the moment I can't tell how close the measures are.

I limited my analysis to "paragraphic" text, while I suspect RobGea included labels, circles, radii, etc.  For that reason, my counts are likely to be a bit lower in each case.  But in general the ratios seem comparable.  Here are just a couple examples:

For r_a, I counted 631 apart (r.a) and 248 ambiguous (r,a), plus 573 together (ra).
RobGea has 730 apart (r.a) and 285 ambiguous (r,a).

For l_o, I counted 847 apart (l.o) and 72 ambiguous (l,o), plus 499 together (lo).
RobGea has 996 apart (l.o) and 78 ambiguous (l,o).

Beyond these raw counts, I think we're trying to measure different things, so the rest of the figures may not be comparable.

If I'm reading RobGea's post correctly, the idea there is to measure differences between the proportion of total "dot breaks" and the proportion of total "comma breaks" corresponding to each bigram.  And there do seem to be some striking differences.

My figures came out of an attempt to study the predictability of spacing -- i.e., given a string of Voynichese with all spaces removed, how accurately could the spaces be reinserted according to consistent rules about whether each "adjacency" should have a space inserted or not.  Spacing in this sense turned out to be around 95% predictable.  That remaining 5% consisted of cases where an adjacency that's usually spaced wasn't, or vice versa.  So far this analysis ignored comma breaks.  But the interesting detail, I thought, was that the same adjacencies that were most inconsistent when there was no question about there being a space or not -- for example, lots of unambiguous cases of both "r.a" and "ra" -- also made up the greatest proportion of the comma breaks ("r,a").  That is, the adjacencies that were being written inconsistently also seemed to be the ones that were being written ambiguously enough to leave doubt and trigger a comma.

One explanation would be that "comma spaces" aren't just an artifact of transcription, but reflect a real ambiguity in the script that might shed light on how it's structured.  But I'll admit that isn't the only possible explanation. 

(21-05-2022, 09:42 PM)Emma May Smith Wrote: You are not allowed to view links. Register or Login to view.Do you recall the order in which the pages were transcribed?

I ask because I'm interested in how strong the uncertain spaces were affected by subjectivity. I wonder if uncertain spaces became more or less common during the course of the transcription. That is, either you or Gabriel became more aware or word patterns and they influence judgements.

So I guess we could imagine a learning process in which as the transcribers gained more experience with which glyph pairs are typically spaced or unspaced, they would become more confident in their judgment calls and more likely to opt for a dot than for a comma.  That said, I have a feeling that Rene and Gabriel were already immersed in the patterns of Voynichese before they started their transcription, and when I've checked comma spaces against the scans, they seem for the most part to be legitimately ambiguous.  And sometimes ambiguous spacing seems even more curiously patterned:

[Image: voynich-triple.jpg?w=768&h=295]
robgea@
You have counted bigrams but some of them can be part of a 3-gram. You can try to count the following most common 3-grams with 'o' in the middle (half space will be after of before the 'o'):
chor, chor with all the family of 'ch' and all the family of 'r'.
chol, wiht all the family of 'ch'
(21-05-2022, 09:42 PM)Emma May Smith Wrote: You are not allowed to view links. Register or Login to view.
(21-05-2022, 07:55 PM)ReneZ Wrote: You are not allowed to view links. Register or Login to view.I'm here...
This is something that naturally came up while transcribing. Some gaps really look like word spaces, and they have been denoted by a full stop / period.
Others seemed doubtful - it was not clear if these were word spaces or just a slightly larger gap between adjacent characters. They have been denoted by a comma.

This process was, of course, strongly subjective.
Gabriel Landini and myself did this in parallel, and we came up with different opinions.

Do you recall the order in which the pages were transcribed?

I ask because I'm interested in how strong the uncertain spaces were affected by subjectivity. I wonder if uncertain spaces became more or less common during the course of the transcription. That is, either you or Gabriel became more aware or word patterns and they influence judgements.

(This isn't a criticism, as I know I would do the same.)

Yes, this was strictly in order of the pages in the MS.
I really doubt that the assessment ('certain'  vs. uncertain) changed in the course of the activity.
Pages: 1 2 3