27-06-2020, 11:54 PM
Due to the discussion in another thread, I was re-reading the 20 years old article by Stolfi on the VMS vord length distribution:
You are not allowed to view links. Register or Login to view.
In short, its results are as follows. If we abstain from the vord frequencies in the corpus and focus just on the Voynich "vocabulary" and calculate the distribution of vord length of this vocabulary, we will find out that a) unlike at least some natural languages (English, Latin), it's symmetrical, and b) it's neatly approximated by binomial curve.
Of course, generally, the results may depend on the exact transcription. So it must be said at once (which Stolfi does) that he used Currier's approach (EVA ch and sh counted as single symbols, and so are EVA cth, ckh, etc.).
For those not deeply in probability theory, the You are not allowed to view links. Register or Login to view. is the probablility distribution law which holds in a series of N statistically independent "tests" which yield one of two outcomes with known probablilities p and q. The values p and N are parameters of the binomial law, and q, as it is obvious, is just equal to 1-p. Let's call the test result with probability p a "success" for brevity. Then the probability that you will have K successes out of N tests is B(K), where B(x) is the binomial law formula, including N and p as parameters.
So Stolfi approximated the distribution with the binomial curve with parameters p = 0.5 and N = 9 (I don't know why he took N=9, most probably because that was the best fit), only he needed to shift this whole curve by the value of 1 to the right.
Based on this, Stolfi developed some considerations about what technique may have led to produce such vord length distribution.
What Stolfi omitted or maybe missed, is that with large values of N the binomial distribution becomes close to normal (Gaussian). So what he observed may have been not binomial in nature, but Gaussian in nature (or maybe something that becomes close to Gaussian with large values of N), only because the vocabulary size is quite large, the two would look the same and he mistook it for binomial.
So what I did I built the vord length distribution (used the Voynich Reader tool, Takahashi transcription (mind it uses EVA and thus it's different from Currier used by Stolfi), excluded vords with dubious characters), calculated estimates of EV (= 6.2) and RMS ( = 1.64), and built the normal curve with the same EV and RMS, and what do you think I got?
[attachment=4474]
What would that mean?
You are not allowed to view links. Register or Login to view.
In short, its results are as follows. If we abstain from the vord frequencies in the corpus and focus just on the Voynich "vocabulary" and calculate the distribution of vord length of this vocabulary, we will find out that a) unlike at least some natural languages (English, Latin), it's symmetrical, and b) it's neatly approximated by binomial curve.
Of course, generally, the results may depend on the exact transcription. So it must be said at once (which Stolfi does) that he used Currier's approach (EVA ch and sh counted as single symbols, and so are EVA cth, ckh, etc.).
For those not deeply in probability theory, the You are not allowed to view links. Register or Login to view. is the probablility distribution law which holds in a series of N statistically independent "tests" which yield one of two outcomes with known probablilities p and q. The values p and N are parameters of the binomial law, and q, as it is obvious, is just equal to 1-p. Let's call the test result with probability p a "success" for brevity. Then the probability that you will have K successes out of N tests is B(K), where B(x) is the binomial law formula, including N and p as parameters.
So Stolfi approximated the distribution with the binomial curve with parameters p = 0.5 and N = 9 (I don't know why he took N=9, most probably because that was the best fit), only he needed to shift this whole curve by the value of 1 to the right.
Based on this, Stolfi developed some considerations about what technique may have led to produce such vord length distribution.
What Stolfi omitted or maybe missed, is that with large values of N the binomial distribution becomes close to normal (Gaussian). So what he observed may have been not binomial in nature, but Gaussian in nature (or maybe something that becomes close to Gaussian with large values of N), only because the vocabulary size is quite large, the two would look the same and he mistook it for binomial.
So what I did I built the vord length distribution (used the Voynich Reader tool, Takahashi transcription (mind it uses EVA and thus it's different from Currier used by Stolfi), excluded vords with dubious characters), calculated estimates of EV (= 6.2) and RMS ( = 1.64), and built the normal curve with the same EV and RMS, and what do you think I got?
[attachment=4474]
What would that mean?
