The Voynich Ninja
Vord length distribution - Printable Version

+- The Voynich Ninja (https://www.voynich.ninja)
+-- Forum: Voynich Research (https://www.voynich.ninja/forum-27.html)
+--- Forum: Analysis of the text (https://www.voynich.ninja/forum-41.html)
+--- Thread: Vord length distribution (/thread-3261.html)

Pages: 1 2 3 4 5


Vord length distribution - Anton - 27-06-2020

Due to the discussion in another thread, I was re-reading the 20 years old article by Stolfi on the VMS vord length distribution:

You are not allowed to view links. Register or Login to view.

In short, its results are as follows. If we abstain from the vord frequencies in the corpus and focus just on the Voynich "vocabulary" and calculate the distribution of vord length of this vocabulary, we will find out that a) unlike at least some natural languages (English, Latin), it's symmetrical, and b) it's neatly approximated by binomial curve.

Of course, generally, the results may depend on the exact transcription. So it must be said at once (which Stolfi does) that he used Currier's approach (EVA ch and sh counted as single symbols, and so are EVA cth, ckh, etc.).

For those not deeply in probability theory, the You are not allowed to view links. Register or Login to view. is the probablility distribution law which holds in a series of N statistically independent "tests" which yield one of two outcomes with known probablilities p and q. The values p and N are parameters of the binomial law, and q, as it is obvious, is just equal to 1-p. Let's call the test result with probability p a "success" for brevity. Then the probability that you will have K successes out of N tests is B(K), where B(x) is the binomial law formula, including N and p as parameters.

So Stolfi approximated the distribution with the binomial curve with parameters p = 0.5 and N = 9 (I don't know why he took N=9, most probably because that was the best fit), only he needed to shift this whole curve by the value of 1 to the right.

Based on this, Stolfi developed some considerations about what technique may have led to produce such vord length distribution.

What Stolfi omitted or maybe missed, is that with large values of N the binomial distribution becomes close to normal (Gaussian). So what he observed may have been not binomial in nature, but Gaussian in nature (or maybe something that becomes close to Gaussian with large values of N), only because the vocabulary size is quite large, the two would look the same and he mistook it for binomial.

So what I did I built the vord length distribution (used the Voynich Reader tool, Takahashi transcription (mind it uses EVA and thus it's different from Currier used by Stolfi), excluded vords with dubious characters), calculated estimates of EV (= 6.2) and RMS ( = 1.64), and built the normal curve with the same EV and RMS, and what do you think I got?

[attachment=4474]

What would that mean? Undecided


RE: Vord length distribution - Anton - 28-06-2020

By way of comparison, here's the word length distribution of the vocabulary of 6818 words (such is the size of the VMS vocabulary as per my test in the above post), where each word is coded with the unique binary number, starting with 1 and ending with ‭1101010100010‬.

It's self-evident that this distribution would be highly asymmetric (each subsequent value of word length will occur twice times the previous value).

[attachment=4475]

(Of course, binary numbers would not have been known to a XV century author).

On the other hand, here's the word length distribution of the vocabulary of the same size (6818 words) where each word is coded with the unique Roman number, starting with I and ending with MMMMMMDCCCXVIII.

Blue is the distribution itself, and red is the Gaussian curve with the same EV (8.9) and RMS (2.7).

[attachment=4476]


RE: Vord length distribution - Alin_J - 28-06-2020

Interesting. The idea that words might be mapped and encoded into some kind of numerical form representation in the VM has also come up before, hasn't it? But perhaps with other alterations and obfuscations as well, because the VM writing seems much more complex than for example roman numbers.


RE: Vord length distribution - -JKP- - 28-06-2020

(28-06-2020, 07:21 AM)Alin_J Wrote: You are not allowed to view links. Register or Login to view.
Interesting. The idea that words might be mapped and encoded into some kind of numerical form representation in the VM has also come up before, haven't it? But perhaps with other alterations and obfuscations as well, because the VM writing seems much more complex than for example roman numbers.


Yes, the similarity of some of the glyphs to Roman numerals has been noted by quite a few researchers.

I have frequently mentioned that the VMS might be numbers, but I don't think they necessarily have to be Roman numerals. They might be, but there are other possibilities, as well. The reason I have given for the possibility of numbers is the positional characteristics of the glyphs. This is not typical of natural language, but is absolutely essential for numbers.


RE: Vord length distribution - Koen G - 28-06-2020

This is an interesting line of investigation.  But I see one big problem, which you mention already yourself: it depends on the transcription. This does not mean that all transliteration systems are equal options: we know almost for certain that EVA inflates word length, so it simply cannot be used for this. The problem is that we don't quite know what to use instead.


RE: Vord length distribution - -JKP- - 28-06-2020

It's not difficult to write a script that converts one glyph to another.

For example, if cTz were to be considered one glyph (or two glyphs) instead of three. It takes a couple of minutes to generate a new transcript with all the changes made. Even a search-and-replace engine can do this.


RE: Vord length distribution - MarcoP - 28-06-2020

(28-06-2020, 07:21 AM)Alin_J Wrote: You are not allowed to view links. Register or Login to view.Interesting. The idea that words might be mapped and encoded into some kind of numerical form representation in the VM has also come up before, haven't it?

The best exploration of the subject I am aware of is You are not allowed to view links. Register or Login to view.. In particular, that system results in systematic quasi-reduplication: as far as I know, this is the only encoding of meaningful text that reproduces that particular feature of Voynichese. I don't remember if Rene also commented on word-lengths produced by mod2, but Anton's distribution based on Roman numbers is impressive...

(28-06-2020, 07:40 AM)-JKP- Wrote: You are not allowed to view links. Register or Login to view.The reason I have given for the possibility of numbers is the positional characteristics of the glyphs. This is not typical of natural language, but is absolutely essential for numbers.

Obviously positional characteristics of glyphs have little to do with languages or numbers and everything  to do with the writing systems used to represent them.
Half of the alphabet we use in our current writing system (upper-case characters) consists of glyphs that are basically constrained to appear in the first position of words (this has been changing recently with the increasing usage of camel-case).
Late medieval writing systems had positional constraints that were closer to Voynichese than the writing system we use now: for instance, in Gothic scripts, characters like 's' and 'r' had different variants that were positionally constrained. Some Latin abbreviation symbols were mostly used for suffixes or prefixes and were basically constrained to appear at the beginning or end of words.

Also, I am not sure that a different positional behaviour of glyphs is essential to writing numbers.
If one considers how we currently write numbers (Indo-Arabic numerals), it is easy to see that glyphs are independent from each other. Differently from our system to write languages and from Roman numbers, there are no preferences for symbols to follow each other. One can say that vowels are often followed and preceded by consonants, for instance 'ne' and 'en' are both frequent in written English. In Roman numbers, 'V' is often followed or preceded by 'I'. But 1, 2, 3 etc have no preference to appear after or before any other digit. Similarly, in the binary system, the various combinations of 0 and 1 (00,01,10,11) occur with basically the same frequency, with the exception that leading zeros only appears when a fixed length is needed for some reason.

Coming back  to the subject of this thread:

(27-06-2020, 11:54 PM)Anton Wrote: You are not allowed to view links. Register or Login to view.What would that mean?

I don't know about the Gaussian distribution, but here is what You are not allowed to view links. Register or Login to view. wrote about his observations:

Stolfi Wrote:...the East Asian monosyllabic languages [Chinese, Vietnamese, and Tibetan] *do* have symmetric, binomial-like word length distributions, just like Roman numerals...

Why binomial?

Why should those languages have a binomial-like syllable-length distribution? Well, as observed in the previous page, if you add many random variables with arbitrary distributions, you get a random variable with a binomial-like, bell-shaped distribution, which approaches a Gaussian as you add more and more terms. (Technically, the histogram of the sum of two independent variables is the convolution of their histograms; and the convolution of N arbitrary histograms, as N increases, generally becomes more and more like a Gaussian distribution.)

Now, unlike a polysyllabic word, a single syllable has only a fixed number N of phonetic "slots" (attributes), corresponding to separate muscular controls; and each slot can have a finite number of possible values. In the Chinese syllable, for instance, the initial consonant is one slot, which can have some 20 values including "silent". Another slot would be the glide before the main vowel ("i", "u", or "none", as in "lian", "luan", or "lan"). The main vowel, the secondary glide, the final consonant, and the syllable tone would be the other slots.

In principle, then, a syllable could be written as a sequence of N symbols, each corresponding to one phonetic slot. However, that would be a rather inefficient encoding, because the values of each slot have highly different frequencies in common use. (In particular, the most frequently used words will tend to use slot values that can be articulated with less time or effort.)

For that reason, almost all scripts follow the model of Roman numerals, where one value for each slot is assigned as "default" and not written, while the other values are mapped to distinctive symbols. Thus the "silent" consonant and "none" glides are omitted in pin-yin; the "a" vowel is omitted in Hindu scripts; and the mid level tone of Vietnamese is not marked in Quo^'c Ngu+~. Moreover, if a slot has many possible values, some of them are often encoded by sequences of two or more symbols, such as "ch" in Chinese or "u+" in Vietnamese.

Thus, in all those scripts, the written syllable is the concatenation of N variable-length strings. Assuming that the value of a slot is to some extend independent of other slots, the word-length histogram is therefore the convolution of N slot-length histograms, and therefore is expected to resemble a binomial distribution.



RE: Vord length distribution - -JKP- - 28-06-2020

MarcoP Wrote:Obviously positional characteristics of glyphs have little to do with languages or numbers and everything  to do with the writing systems used to represent them.


Positional characteristics of glyphs (if they are numeric) have a great deal to do with numbers.

31 is not the same as 13.  143 is not the same as 413. VI is not the same as IV. Without the position, you can't interpret the numbers.

Yes, I agree that they have to do with the writing systems used to represent them, but I am not aware of any medieval writing system for numbers (I'm talking about a typical system that can represent a large set of numbers) that is not positional.


I suppose you could invent another way to show the priority (as with a marker glyph), or create individual glyphs for each number (in other words, 12 might be represented by one glyph rather than a systematic combination of 1 & 2), but if the VMS or any other undeciphered script is designed this way, it would be very difficult to decipher and quite unusual for the time (or even for the 16th century).


RE: Vord length distribution - MarcoP - 28-06-2020

(28-06-2020, 10:33 AM)-JKP- Wrote: You are not allowed to view links. Register or Login to view.
MarcoP Wrote:Obviously positional characteristics of glyphs have little to do with languages or numbers and everything  to do with the writing systems used to represent them.


Positional characteristics of glyphs (if they are numeric) have a great deal to do with numbers.

31 is not the same as 13.  143 is not the same as 413. VI is not the same as IV. Without the position, you can't interpret the numbers.

Hi JKP, you wrote that "this is not typical of natural language, but is absolutely essential for numbers". What you are saying now seems to me to be the same for written languages and written numbers. 31 is different from 13 and 'no' is different from 'on'. I am still missing your point. Isn't position also typically relevant for written languages?


RE: Vord length distribution - Alin_J - 28-06-2020

(28-06-2020, 09:55 AM)Koen G Wrote: You are not allowed to view links. Register or Login to view.This is an interesting line of investigation.  But I see one big problem, which you mention already yourself: it depends on the transcription. 

This is a graph showing the word-length distribution for the 101 transcription, of the VM words in the word-vocabulary. In this transcription the average word length is 5 characters in length. Shown together is a fit of a normal distribution curve (as Anton stated before, the normal distribution is equal to a binomial distribution for large different number of variable values, or a non-discrete binomial distribution. Also, min(p,1-p) should be > 0.1). 

[attachment=4480]

The distribution is also clearly symmetrical and shows a good fit with the normal curve as well, as for the EVA transcription. So, I don't think the transcription format changes the distribution shape much, only perhaps the average length and standard deviation values.