The Voynich Ninja

Full Version: Input data accuracy
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Pages: 1 2
Hi Rene,
I used the following very rough code:

Code:
grep -v '^#' ~/rec/voynich/software/ivtt/ZL_ivtff_1r.txt > /tmp/clean

for c in {a..z}
do
echo -n $c' '
for sep in '[.]' ','
do
END=`grep -o $c$sep'[a-z]' /tmp/clean | wc -l`
START=`grep -o '[a-z]'$sep$c /tmp/clean | wc -l`
echo -n $START $END ' '
done
echo
done

These are two problems I see, but there might be others:
  • 1-character words are not correcly handled
  • I only consider couples of consecutive words both ending/starting with [a-z] 

If I understand correctly, another difference is that you treat line-breaks as certain spaces, while I only counted spaces between consecutive words on the same line. I guess this explains why you find fewer c-starting words (they are rare after a line break).
Hi Marco,

that looks correct. This also means that labels are not counted in your case, while they are in my case.
Since I like to get to the bottom of things, I redid my plot by removing (the counts for) the first words of all lines. This also removes essentially all of the labels.

This is the result, which can barely be distinguished from Marco's first plot:

[attachment=5535]
I created plots for certain/uncertain word start/end for the 18 most frequent characters in GC and EVA transliterations.
This table shows the corresponding EVA characters for the GC (aka v101) characters considered here.

[attachment=5548]

My overall impression is that the figures are rather consistent both across different transcribers and across different scribes. Of course, stats for scribes are affected by the different dialects (so Scribe1, Currier A, has fewer EVA:q GC:4 than the others). Scribes 4 and 5 wrote considerably less text than the other 3, so I did not give them much attention.

For instance, if one focuses on word-start characters, one can see that EVA:q and (less clearly) 'o' are frequent after certain spaces and rare after uncertain spaces. On the other hand, 'a' and EVA:k (GC:h) are more than twice more frequent after uncertain spaces than after certain spaces.

EVA start:
You are not allowed to view links. Register or Login to view.
GC start:
You are not allowed to view links. Register or Login to view.
EVA end:
You are not allowed to view links. Register or Login to view.
GC end:
You are not allowed to view links. Register or Login to view.
I think the different behaviour of different symbols could be related with their role in Voynichese word structure. The symbols that frequently appear as prefixes tend to occur after certain spaces, while those that only rarely are word initial are more likely to occur after uncertain spaces.
The green bars represent the % of word-initial occurrences for the most frequent EVA characters (considering both certain and uncertain spaces), i.e. the number of occurrences that are word-initial divided by the total number of occurrences. The purple bars are the ratio of certain spaces with respect to the total of spaces preceding each character i.e. cert/(cert+unc) where cert and unc are the blue and orange bars in the "start" histogram above. The correlation between the two is 0.85, meaning that uncertain spaces are less likely to appear before characters that have a preference to appear word-initially.

[attachment=5550]

All this definitely deserves further investigation. As always, I may have made errors, so be careful!
I am hijacking this thread just to provide a minor update.

Many months ago, I received an E-mail of a user of the transliteration files at my site, that something seemed to be wrong with the page variable $I in case the page is text-only.
It has taken me until now to fix it.

Unfortunately, I have not been able to find the related E-mail again, so if that person is also reading here, a response or a DM to check that it is indeed fixed would be welcome.

Latest versions are now version 1b for the GC file and version 2b for the ZL file.
The text of the transliterations has not changed:
GC:  0c = 1a =1b
ZL:  1r = 2a = 2b

The files are linked in the usual place:
You are not allowed to view links. Register or Login to view. (near the bottom of the page).
(25-12-2022, 09:48 AM)ReneZ Wrote: You are not allowed to view links. Register or Login to view.Unfortunately, I have not been able to find the related E-mail again, so if that person is also reading here, a response or a DM to check that it is indeed fixed would be welcome.

I got the confirmation, thanks very much!

In the mean time, also the other three files have now been brought up to date, and use paragraph start markers and Lisa's hand identification as selection option.
(Unfortunately, there may be a bug in ivtt concerning specifically that selection option, which I will have to check before anything else.)
Pages: 1 2