The Voynich Ninja

Full Version: sh_ and ch_ compose the same words
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Pages: 1 2 3 4 5 6 7 8 9 10
(17-11-2019, 02:21 PM)MarcoP Wrote: You are not allowed to view links. Register or Login to view.While I understand the usefulness of percentages in removing the spurious correlation induced by varying page-lengths, I second Torsten's question. I am also interested in understanding what can be the meaning of overall correlation being greater than that of the individual sections.

The reason overall correlation is greater is that the line is defined mostly by the extreme points from the biological section.
Torsten is correct in stating that the calculation of R assumes that the data is normally distributed. Of course that rarely happens which is one reason why this is not a good statistic to calculate in these very non-normal circumstances.
(17-11-2019, 03:25 PM)ReneZ Wrote: You are not allowed to view links. Register or Login to view.
(17-11-2019, 02:21 PM)MarcoP Wrote: You are not allowed to view links. Register or Login to view.The plot I get for chedy / shedy seems to me reasonably close (though not identical) to the one computed by Rene on the basis of Nablator's spreadsheet. I have included {0,0} pages. The red line is y=x.

I think that nablator's spreadsheet was based on the Takeshi Takahashi transcription. If you used another one, the difference would be explained.

I am sorry, I forgot to mention that I used Takahashi's transcription. There must be some other reason for the differences.
(17-11-2019, 04:31 PM)MarcoP Wrote: You are not allowed to view links. Register or Login to view.I am sorry, I forgot to mention that I used Takahashi's transcription. There must be some other reason for the differences.
Maybe if you used regex \b to search for words boundaries, it assumes that '?' is a word separator.
(17-11-2019, 03:36 PM)DONJCH Wrote: You are not allowed to view links. Register or Login to view.Torsten is correct in stating that the calculation of R assumes that the data is normally distributed.

I would love to see a reference for that.

If two quantities are fully correlated, their observations are on a straight line, and this is completely independent of their distribution.

Even if one were to feel more comfortable about computing the correlation of two normally distributed quantities, I would (again) like to point out that the distribution of the word counts as originally proposed by Torsten is very far from normally distributed (probably the furthest away), with all values being positive and the mass of them for small counts.
On the other hand, the quantity defined by nablator is significantly more similar to normally distributed. The correlation values computed from it are very close to those computed from the percentages.
(17-11-2019, 08:50 PM)ReneZ Wrote: You are not allowed to view links. Register or Login to view.
(17-11-2019, 03:36 PM)DONJCH Wrote: You are not allowed to view links. Register or Login to view.Torsten is correct in stating that the calculation of R assumes that the data is normally distributed. 


I would love to see a reference for that.


See "You are not allowed to view links. Register or Login to view." (Charles J. Kowalski 1972). Skip to the very short conclusion at the end of the paper for a summary.

This doesn't mean that the sampling data distribution requires normality if the samples are large enough and therefore represent the real world data. The reason for that is You are not allowed to view links. Register or Login to view.. In this way it is possible to assume that the frequencies for all 'ch'/'sh' words are representative for all the words in the Voynich manuscript.
(17-11-2019, 03:36 PM)DONJCH Wrote: You are not allowed to view links. Register or Login to view.
(17-11-2019, 02:21 PM)MarcoP Wrote: You are not allowed to view links. Register or Login to view.While I understand the usefulness of percentages in removing the spurious correlation induced by varying page-lengths, I second Torsten's question. I am also interested in understanding what can be the meaning of overall correlation being greater than that of the individual sections.

The reason overall correlation is greater is that the line is defined mostly by the extreme points from the biological section.
Torsten is correct in stating that the calculation of R assumes that the data is normally distributed. Of course that rarely happens which is one reason why this is not a good statistic to calculate in these very non-normal circumstances.

The calculation of correlation doesn't require normality, it only requires that the data is not constant. The interpretation and analysis of result is discussed You are not allowed to view links. Register or Login to view.. In our case it's perfectly applicable, I think.

The important thing however is that the correlation value itself says a little. Is value 0.5 high or low? To deal with it one must also calculate correlation between other types of words. And if one thinks that EVA-sh and EVA-ch are the same based on correlation values, he must be very cunning when explaining almost equally high correlation coefficient for EVA-ok and EVA-qok Smile
(17-11-2019, 09:58 PM)farmerjohn Wrote: You are not allowed to view links. Register or Login to view.The important thing however is that the correlation value itself says a little. Is value 0.5 high or low? To deal with it one must also calculate correlation between other types of words. And if one thinks that EVA-sh and EVA-ch are the same based on correlation values, he must be very cunning when explaining almost equally high correlation coefficient for EVA-ok and EVA-qok Smile


Hi Farmerjohn,
the thread title and poll title are different, I found that confusing.

Thread: sh_ and ch_ compose the same words
Poll: CH_ and SH_ words are similar


The poll statement is much weaker, and I feel I can answer "yes" to that. The two families of words are similar, for instance (as Koen said You are not allowed to view links. Register or Login to view.) they can be compared with inflected words, e.g. also the families of Latin words ending -arum and -ae are similar (but I did not make any quantitative comparison). As Torsten also says, there is some kind of relationship between these two classes of words. The nature of this relationship is of course debatable, but I believe its existence is supported by considerable evidence.

About the much stronger statement in the thread title, I don't know. If I understand correctly, Davidsch believes that the difference between ch- and sh- is irrelevant, the kind of inconsistent spelling you see in medieval manuscripts:

(14-11-2019, 03:39 PM)Davidsch Wrote: You are not allowed to view links. Register or Login to view.(In my bowl apples and pears are both fruit.  chedy= shedy)


(15-11-2019, 12:48 PM)Davidsch Wrote: You are not allowed to view links. Register or Login to view.If we estimate the level of experience and intelligence of the author of the VMS to an equally Medieval text, you will find that such written text, at least in the occult genre, are very inconsistent and much words are used inconsistently.

You will very frequently find in the same text, that every possible variation is used, color , colour , kolor, collor, kolore. and such.

Rene also finds that the correlation patterns could indicate arbitrariness. He also mentions that the difference between ch- and sh- could be a "trivial" one, which I interpret to mean "marginally meaningful". An example of arbitrariness in medieval scripts could be the presence of a dot above 'i': it can be present or absent, also in the same word on the same page. At the moment, I cannot think of examples of trivial differences: maybe the difference between upper-case and lower-case? This is something that conveys some meaning, but that was often treated arbitrarily in manuscripts (while there is total functional identity between dotted-i and dottles-i, the first is only easier to read).


(16-11-2019, 06:51 AM)ReneZ Wrote: You are not allowed to view links. Register or Login to view.The first thing that comes to mind is that this is an indication of arbitrariness. Like pulling words arbitrarily out of a hat. Or, as if the difference between ch and Sh is meaningless. This way of thinking causes other problems, e.g. in the area of entropy, and one would be pushed into the direction that Voynich words are not complete words, but verbose renditions of something smaller.

'Verbose' of course implies that there is a meaningless component in the text, but it is not all meaningless.

The difference between ch and Sh could also be 'trivial' rather than meaningless. What do I mean with that?
If the text is an encoding of some plain text, then the plain text was of course a handwritten text. The curl on top of the Sh could be a representation of a serif. Just one more idea....


From the point of view of auto-copying, the question makes little sense: all words are arbitrary variations of other words.

I agree about the relevance of the analogy between ch- / sh- and qok- / ok-. It is something that I too mentioned You are not allowed to view links. Register or Login to view.. I believe that qok- / ok- are related too, but the relationship is not the same (and not as strong) as that between ch- and sh-.

A major difference is that for qo- o- there is some evidence suggesting that q- is added to o- words under certain circumstances:
q- is quite rare in labels
qo- and o- words have preferences for following different suffixes

I am aware of two possible explanations for these phenomena:
* a phonetic sandhi effect (as discussed by Emma and myself in You are not allowed to view links. Register or Login to view.)
* a visual preference due to "graphic harmony" principles as discussed by You are not allowed to view links. Register or Login to view. and referenced by Timm and Schinner.

The point is, we have clear differences in the behaviour of qo- vs o- (and qok- vs ok-) words. This is not so clear for ch- sh-: also when analysing the distribution after different word endings, the two classes of words are extremely similar. These graphs illustrate the actual/expected ratios of last-first combinations at word boundaries: 1 corresponds to the actual number of occurrences being exactly as expected. One can see that there are several cases in which qok- is more frequent than expected while ok- is less frequent than expected and vice-versa. For instance (as first observed by Currier) -y.q- is much more frequent than expected: words ending with -y tend to be followed by words starting with q-, while words starting with o- are relatively rare in this position.
Nothing similar happens for ch- sh-: their deviation from the expected is always consistent. So, it is easier to exclude the identity of qok- and ok- than that of ch- and sh-.

[attachment=3721]

Some differences between ch- and sh- do exist (e.g. the preference of ch- words to appear at line end). Maybe they could be enough to argue against total identity, but the subject is certainly worth further investigation.
By the way: Schwerdtfeger
The Yahoo page Voynich will be deleted on 14.12. 2019.
Last possibility to have a look at some dates again.


You are not allowed to view links. Register or Login to view.
I did not yet list the percentages of chedy and Shedy occurences for the different sections.
They are:

              chedy   shedy
              -----   -----
Herbal B:     1.933   1.132
Biological:   2.979   3.485
Cosmo:        1.155   0.880
Stars (all):  1.732   1.048
Stars-Bio:    1.984   1.297
Stars (rest): 1.355   0.677

Total:        2.092   1.780

[font=Sans-serif]This shows that there is a general trend, but still significant variability between the different sections.[/font]
[font=Sans-serif]It is mainly in the Biological section that there is also an internal correlation.[/font]
[font=Sans-serif]I wonder if this could be related to the hypothesis that this quire is a mixture of two original quires. There has been quite some speculation about that [/font]
(18-11-2019, 03:54 PM)ReneZ Wrote: You are not allowed to view links. Register or Login to view.This shows that there is a general trend, but still significant variability between the different sections.


Indeed, there is a trend especially in the usage of <chedy>. I also fully agree to your previous statement "The group of bifolios 103+116 , 107+112, 108+111 is more like the biological text than the other three."

This way it is possible to reorder the sections of the VMS: "Reordering the sections with respect to the frequency of token <chedy> replaces the seemingly irregular mixture of two separate languages by the gradual evolution of a single system from 'state A' to 'state B.'" (Timm & Schinner 2019, p. 7)

This is the resulting table for <chedy> and <shedy> (see You are not allowed to view links. Register or Login to view., p. 7):

             chedy shedy word count
             ----- ----- ----------
Herbal     (A)   1     0      8,087
Pharma     (A)   1     1      2,529
Astro            4     0      2,136
Cosmo           24    17      2,691
Herbal     (B)  62    35      3,233
Stars      (B) 190   113     10,673
Biological (B) 210   247      6,911
Pages: 1 2 3 4 5 6 7 8 9 10