07-05-2021, 08:26 AM
For all text analysis, people use one transliteration file or another, without knowing clearly how accurate this data is. The purpose of this short analysis is to give some indication of that.
This is done by looking at the occurrence of "Hapax Legomena", i.e. words that appear only once in the entire text.
The reason for selecting this statistic is, because it is particularly sensitive to the transliteration quality. It depends both on the choices made by the transcriber about the alphabet, and on the decisions where the word spaces are. With respect to the alphabet, it is not a matter of whether d is transcribed as "8" or "d", but whether slighly different-looking versions are transliterated the same, or differently.
The purpose is not to analyse whether Hapax in the Voynich text are normal, or comparable to other texts.
Most people use the Takahashi transliteration, which is in Basic Eva.
There are also the ZL transliteration, which uses extended Eva, and the GC transliteration, which uses the v101 alphabet.
The last two further use a symbol to indicate "uncertain spaces" (namely the comma symbol). Effectively, this means that one can extract two different transliterations out of each of them namely:
- case 1, consider that the uncertain spaces are also spaces, so count all of them as spaces.
- case 2, consider that the uncertain spaces are not spaces, so only count the certain ones as spaces
It is clear that case 1 will lead to a greater number of words (word tokens) in the text.
Altogether, this leads to five different transliterations, all of which are more or less complete for the MS.
For each of the files, I did the Hapax statistics, and in the case presented below, this is done for only the "normal text in paragraphs", i.e. excluding labels, circular and radial texts. (The alternatives have also been done and lead to similar results).
One can count word tokens, word types and Hapax, and then compute the three ratios:
types/tokens, hapax/types, hapax/tokens
The following plot shows the third ratio, which is typically in the area of 10-20%
(Note that hapax/types is usually over 50%, and can be up to 70%).
One should keep in mind that, a priori, all five transliterations should be considered of equal quality.
[attachment=5506]
I would describe these results as "all over the place". The number of word tokens varies between 32,500 and 36,700 while the hapax ratio is between 14.1% and 19.6%. There is also no correlation between the two.
One can clearly observe that, on average, GC "sees" far more spaces than ZL, while IT (Takahashi) is in the middle between the two ZL options.
In general, GC includes more hapax, which can be explained by the specific character set definition it uses.
In a second iteration, I have simplified all five files by translating them to a more reduced character set, in a way similar to the Cuva alphabet I have used occasionally at my web site. This results in five new observation points, which have been added to the plot below:
[attachment=5507]
This has almost no impact on the Takahashi transliteration, which already uses a limited character set.
It has only limited impact on the ZL transliteration(s), whose "special" characters tend all to be rare, while it has a major impact on the GC transliteration(s), bringing the points down to lie (more or less) on a straight line through all points.
The variation is still very significant, due to the definition of words, which strongly depends on how many spaces there really are.
This is done by looking at the occurrence of "Hapax Legomena", i.e. words that appear only once in the entire text.
The reason for selecting this statistic is, because it is particularly sensitive to the transliteration quality. It depends both on the choices made by the transcriber about the alphabet, and on the decisions where the word spaces are. With respect to the alphabet, it is not a matter of whether d is transcribed as "8" or "d", but whether slighly different-looking versions are transliterated the same, or differently.
The purpose is not to analyse whether Hapax in the Voynich text are normal, or comparable to other texts.
Most people use the Takahashi transliteration, which is in Basic Eva.
There are also the ZL transliteration, which uses extended Eva, and the GC transliteration, which uses the v101 alphabet.
The last two further use a symbol to indicate "uncertain spaces" (namely the comma symbol). Effectively, this means that one can extract two different transliterations out of each of them namely:
- case 1, consider that the uncertain spaces are also spaces, so count all of them as spaces.
- case 2, consider that the uncertain spaces are not spaces, so only count the certain ones as spaces
It is clear that case 1 will lead to a greater number of words (word tokens) in the text.
Altogether, this leads to five different transliterations, all of which are more or less complete for the MS.
For each of the files, I did the Hapax statistics, and in the case presented below, this is done for only the "normal text in paragraphs", i.e. excluding labels, circular and radial texts. (The alternatives have also been done and lead to similar results).
One can count word tokens, word types and Hapax, and then compute the three ratios:
types/tokens, hapax/types, hapax/tokens
The following plot shows the third ratio, which is typically in the area of 10-20%
(Note that hapax/types is usually over 50%, and can be up to 70%).
One should keep in mind that, a priori, all five transliterations should be considered of equal quality.
[attachment=5506]
I would describe these results as "all over the place". The number of word tokens varies between 32,500 and 36,700 while the hapax ratio is between 14.1% and 19.6%. There is also no correlation between the two.
One can clearly observe that, on average, GC "sees" far more spaces than ZL, while IT (Takahashi) is in the middle between the two ZL options.
In general, GC includes more hapax, which can be explained by the specific character set definition it uses.
In a second iteration, I have simplified all five files by translating them to a more reduced character set, in a way similar to the Cuva alphabet I have used occasionally at my web site. This results in five new observation points, which have been added to the plot below:
[attachment=5507]
This has almost no impact on the Takahashi transliteration, which already uses a limited character set.
It has only limited impact on the ZL transliteration(s), whose "special" characters tend all to be rare, while it has a major impact on the GC transliteration(s), bringing the points down to lie (more or less) on a straight line through all points.
The variation is still very significant, due to the definition of words, which strongly depends on how many spaces there really are.