The Voynich Ninja

Full Version: Input data accuracy
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Pages: 1 2
For all text analysis, people use one transliteration file or another, without knowing clearly how accurate this data is. The purpose of this short analysis is to give some indication of that.

This is done by looking at the occurrence of "Hapax Legomena", i.e. words that appear only once in the entire text.
The reason for selecting this statistic is, because it is particularly sensitive to the transliteration quality. It depends both on the choices made by the transcriber about the alphabet, and on the decisions where the word spaces are. With respect to the alphabet, it is not a matter of whether d is transcribed as "8" or "d", but whether slighly different-looking versions are transliterated the same, or differently.

The purpose is not to analyse whether Hapax in the Voynich text are normal, or comparable to other texts.

Most people use the Takahashi transliteration, which is in Basic Eva.
There are also the ZL transliteration, which uses extended Eva, and the GC transliteration, which uses the v101 alphabet.
The last two further use a symbol to indicate "uncertain spaces" (namely the comma symbol). Effectively, this means that one can extract two different transliterations out of each of them namely:

- case 1, consider that the uncertain spaces are also spaces, so count all of them as spaces.
- case 2, consider that the uncertain spaces are not spaces, so only count the certain ones as spaces

It is clear that case 1 will lead to a greater number of words (word tokens) in the text.

Altogether, this leads to five different transliterations, all of which are more or less complete for the MS.

For each of the files, I did the Hapax statistics, and in the case presented below, this is done for only the "normal text in paragraphs", i.e. excluding labels, circular and radial texts. (The alternatives have also been done and lead to similar results).

One can count word tokens, word types and Hapax, and then compute the three ratios:
types/tokens,  hapax/types,  hapax/tokens

The following plot shows the third ratio, which is typically in the area of 10-20%
(Note that hapax/types is usually over 50%, and can be up to 70%).

One should keep in mind that, a priori, all five transliterations should be considered of equal quality.

[attachment=5506]

I would describe these results as "all over the place".  The number of word tokens varies between 32,500 and 36,700 while the hapax ratio is between 14.1% and 19.6%. There is also no correlation between the two.
One can clearly observe that, on average, GC "sees" far more spaces than ZL, while IT (Takahashi) is in the middle between the two ZL options.
In general, GC includes more hapax, which can be explained by the specific character set definition it uses.

In a second iteration, I have simplified all five files by translating them to a more reduced character set, in a way similar to the Cuva alphabet I have used occasionally at my web site. This results in five new observation points, which have been added to the plot below:

[attachment=5507]

This has almost no impact on the Takahashi transliteration, which already uses a limited character set.
It has only limited impact on the ZL transliteration(s), whose "special" characters tend all to be rare, while it has a major impact on the GC transliteration(s), bringing the points down to lie (more or less) on a straight line through all points.

The variation is still very significant, due to the definition of words, which strongly depends on how many spaces there really are.
I think many of us do look at the actual text, rather than simply using the Takahashi or whichever transliteration file and accepting its accuracy on faith alone.

However, I would rather for example do a text analysis based on a "pure Takahashi transliteration", understanding that there may be a small number of mistakes and inaccuracies, rather than "mix and match" transliteration files or substitute my own personal readings of particular characters, word break spaces, etc. 

There will still inevitably be errors in any such text analysis. But I would rather have such errors be consistently Takahashi errors, rather than a mixed grab bag of various different people's transliteration errors, including my own.
(07-05-2021, 02:12 PM)geoffreycaveney Wrote: You are not allowed to view links. Register or Login to view.substitute my own personal readings of particular characters,

Ahem, isn't this what your Middle English reading is all about?
Need I provide any examples?
(07-05-2021, 02:31 PM)ReneZ Wrote: You are not allowed to view links. Register or Login to view.
(07-05-2021, 02:12 PM)geoffreycaveney Wrote: You are not allowed to view links. Register or Login to view.substitute my own personal readings of particular characters,
Ahem, isn't this what your Middle English reading is all about?
Need I provide any examples?

I mean that I take the Takahashi transliteration as my starting point, rather than saying, "Oh, here I dispute Takahashi's transliteration," and changing the EVA transliteration value before proceeding. I accept the Takahashi transliteration as it is for now, then apply my EVA/Currier : Yorkist cipher letter value correspondence table.
Coming back to the opening post, it is also possible to plot, instead of the Hapax ratio, the absolute number vs.  the number of word tokens. This is done below, first for word types vs. word tokens, and then for hapax vs. word tokens.

[attachment=5524]

[attachment=5525]

We see that for both quantities, the values decrease for increasing number of spaces, i.e. increasing nr. of word tokens.
This is counter-intuitive, and tells us something (but what?) about the Voynich text.

The purple lines give the same two ratios for the text of Dante's Divina Commedia. They were generated by cutting the text after "N" word tokens from the start, and computing the two values for this partial text. These lines therefore represent the effect of a 'growing text'. The values increase, though not monotonously, which indicates some repetitiveness.
On the other hand, the individual points all refer to a text of the same length, but we just don't know for sure how many words this text has.

One important observation is that the variation due to this uncertainty is bigger than the variation from Dante's text length. This again underlines the importance of the transliteration errors we have. Rather than errors, we may also call them 'uncertainties'.

Another observation was already made just before: the nr. of word types and nr. of hapax goes down when the nr. of word tokens increases. Of course, this may just be a side effect of the fact that the average word length decreases, but it deserves a closer look.

Finally, for the case of word types, the different transliteration options match the Dante plain text for the case of GC's transliteration with all spaces taken into account. So, if the Voynich MS plain text were Dante-like, this would be the most accurate option. But we don't know that, and to make it worse, the cross-over point for Hapax is even further to the right.

However, that interesting question is a different one than the topic of this thread, which is specifically about transliteration errors.
(12-05-2021, 12:06 PM)ReneZ Wrote: You are not allowed to view links. Register or Login to view.Coming back to the opening post, it is also possible to plot, instead of the Hapax ratio, the absolute number vs.  the number of word tokens. This is done below, first for word types vs. word tokens, and then for hapax vs. word tokens.

We see that for both quantities, the values decrease for increasing number of spaces, i.e. increasing nr. of word tokens.
This is counter-intuitive, and tells us something (but what?) about the Voynich text.

I think that the decrease in word-types and hapax when more spaces are considered is due to the presence of more short words. Long words are almost always hapax, while splitting a Voynichese word in two often results in two valid words. 

This could be connected with the method that Wentian Li (Random Texts Exhibit Zipf’s-Law-Like Word Frequency Distribution, 1992) used to produce his Zipfian "random" text: instead of just considering the space as the other characters he had to "introduce a cutoff of the maximum possible word length L-max". Without the cutoff, you get a huge number of hapax.
Hi Rene,
I ran a couple of simple experiments. As always, I may have made errors.

These are % histograms of word-start and word-end characters after and before Certain (.) or Uncertain (,) spaces (ZL EVA). There are some large and interesting differences, but both types of spaces appear to respect the basics of Voynichese word structure: e.g. there are very few words ending with 'e' or 'q' even among uncertain words.

[attachment=5530]

[attachment=5529]

I also modified the first ~38K words of Dante's Comedy by inserting a space after some specific characters.
E.g. the first lines are:

nel mezzo del cammin di nostra vita
mi ritrovai per una selva oscura 
che la diritta via era smarrita 

dante_e transforms them in the following way:

ne l me zzo de l cammin di nostra vita
mi ritrovai pe r una se lva oscura 
che  la diritta via e ra smarrita

Of course these replacements increase the number of tokens. Unless I messed up something, it seems that also in this case splitting words results in fewer word types and hapax words.

[attachment=5528]
Thanks Marco!

I understand that in the first two plots, the red points refer to the characters following uncertain spaces only, i.e. the increase, converted to a percentage. It's interesting to see some examples:

- Spaces preceding a "q" are felt mostly to be 'certain'
- Spaces following "y" and "n" are felt mostly to be certain.

This is almost certainly to be ascribed to the writing in the MS, where such spaces are clear, rather than a tendency of the transcriber.
The alternative is also quite interesting:
- Many spaces preceding "a" or "k" are considered 'uncertain'
- Many spaces following "o" are considered 'uncertain'

Here, it might be of interest to see if there is any tendency depending on the scribe as per Lisa's identification.
(14-05-2021, 06:03 AM)ReneZ Wrote: You are not allowed to view links. Register or Login to view.I understand that in the first two plots, the red points refer to the characters following uncertain spaces only, i.e. the increase, converted to a percentage. 

Hi Rene,
I confirm your interpretation of the plots.

As for the scribes, I attach histrograms of the % of uncertain spaces with respect to all spaces (certain+uncertain). Here it seems that there is a great difference between ZL and GC: the ratio is quite constant across scribes for ZL but varies considerably in GC.
I can try making more detailed plots for specific characters per scribe, but comparing things will of course be complicated by the different character statistics for different "dialects".
Thanks, that is quite interesting!

I tried to reproduce your plots. The result was similar but not quite the same. I used the following commands:

Code:
ivtt -x7 -s3 ZL.txt >zl_c7.words
cut -c1 zl_c7.words | sort | uniq -c >zl_c7.sorted

ivtt -x8 -s3 ZL.txt >zl_c8.words
cut -c1 zl_c8.words | sort | uniq -c >zl_c8.sorted

This led to the following two files:
You are not allowed to view links. Register or Login to view.
By subtracting the two, one gets the number of characters following uncertain spaces. This led to the following graph:

[attachment=5532]

It is mainly different w.r.t. the 'c'. It is entirely possible that I made a mistake, but it would be nice if we could get the same result.
Pages: 1 2