The Voynich Ninja
Vord length distribution - Printable Version

+- The Voynich Ninja (https://www.voynich.ninja)
+-- Forum: Voynich Research (https://www.voynich.ninja/forum-27.html)
+--- Forum: Analysis of the text (https://www.voynich.ninja/forum-41.html)
+--- Thread: Vord length distribution (/thread-3261.html)

Pages: 1 2 3 4 5


RE: Vord length distribution - Koen G - 28-06-2020

(28-06-2020, 04:21 PM)Anton Wrote: You are not allowed to view links. Register or Login to view.What strikes the eye in the Dostoevsky curve is that the left slope is neat while the right is not. I wonder how the picture would change would the text be abbreviated... Maybe it is abbreviation that "normalizes" the distribution. Confused

I was thinking something like this too when I saw your last curve; the difference appears mostly in the longer words.
What about syllabification? I think syllables in some languages may have a distribution similar to Voynichese.


RE: Vord length distribution - Anton - 28-06-2020

(28-06-2020, 02:11 PM)ReneZ Wrote: You are not allowed to view links. Register or Login to view.In that case, both odd features would be explained.

Ah, got it. With each next thousand, the additional M digit increases the minimum length by one, while, at the same time, the number of different lengths within a thousand cannot be more than thousand.

(28-06-2020, 05:16 PM)Koen G Wrote: You are not allowed to view links. Register or Login to view.What about syllabification? I think syllables in some languages may have a distribution similar to Voynichese.

That's what Stolfi seems to build his argumentation upon, but as I said I could not fully understand his idea. When it comes to linguistics, especially beyond Russian and English, I do not feel very comfortable.


RE: Vord length distribution - MarcoP - 28-06-2020

For what is worth, these curves are based on the pages from the abbreviated Bonaventura manuscript I transcribed a while ago. The printed text contains 1107 word types, the transcribed text 1371. Unless I messed up something, it seems that in this case the effect is not making the curve closer to a Gaussian: also short words are abbreviated, making the left shoulder even steeper.


RE: Vord length distribution - Koen G - 28-06-2020

I'm more thinking about forced splitting of words in European languages. Like split before each vowel cluster, or something like that.

Th is is an ex ampl e s ent enc e.

^ a problem with this particular approach is for example the many single letters you'll get. But I'm surprised that in the example sentence I got one instance of duplication and one of quasi-reduplication already, and a tighter spread of word lengths.


RE: Vord length distribution - Anton - 28-06-2020

I think a random text (where each word is a random string constrained by length) will have normal distribution of word type frequency.

Does anybody have a sample?


RE: Vord length distribution - RobGea - 28-06-2020

Lol, I'm struggling along trying to follow, is this the kind of thing you're after:
Attachmnets: textfile and code to prduce it.

randomtext.txt ( 64,947 bytes )
Total amount of words 10000
Vocabulary : 8471
word length 1  ::  frequency 26
word length 2  ::  frequency 524
word length 3  ::  frequency 958
word length 4  ::  frequency 984
word length 5  ::  frequency 1012
word length 6  ::  frequency 978
word length 7  ::  frequency 951
word length 8  ::  frequency 997
word length 9  ::  frequency 1007
word length 10  ::  frequency 1034


.txt   randomtext.txt (Size: 63.42 KB / Downloads: 29)

.txt   rnd_text_generator3.txt (Size: 1.23 KB / Downloads: 19)


RE: Vord length distribution - Anton - 28-06-2020

Thanks... yes I get the same counts over your sample.

It turns out this is not the process which leads us to normal distribution.   Dodgy

Indeed, with random generators which produce uniformly distributed results the word type frequency distribution will be uniform.

To get what we want (binomial/normal) we can act in the success/failure fashion, kinda what Stolfi discussed.

Here's the word generation process.

Suppose there are N targets, let's say thirteen, and we are shooting those targets in succession. Let's say the chance to hit the target is 50%. Those targets which we manage to hit, we fill with letters, one letter per target. Does not matter which letters, let's say we pick them randomly. (This will ensure that we mostly generate new word each time, and not repeat words already generated). If we hit 5/13, this would be a 5-character word. If we hit 11/13, this is 11-character word. And so on. If we hit zero targets (which will not happen very often, but is a realistic case), we don't record the word at all.

And then we clear all 13 targets and repeat this process again to generate the next word. And this way we generate a vocabulary of words.

Then the word length distribution will be binomial.


RE: Vord length distribution - Alin_J - 29-06-2020

(28-06-2020, 07:08 PM)Anton Wrote: You are not allowed to view links. Register or Login to view.I think a random text (where each word is a random string constrained by length) will have normal distribution of word type frequency.

Does anybody have a sample?

I saw now that there are already some done experiments posted in this thread. But true, this will not give a normal distribution. If you treat space as just another character with a constant probability of occurrence sthis will give a probability p(n) where n is the length of a word, equal to
p(n)s*(1-s)^{n-1}
This is a power-law distribution which is asymmetrical (it has a long tail which is always decreasing). The average word-length will be 1/s -1. If you constrain the length you will only cut off the tail.

Edit: This will be a distribution with respect to all words, not word-types. I.e. shorter words will more likely be identical in a random string, therefore making the distribution word-type will decrease frequency of shorter words in the distribution.


RE: Vord length distribution - ReneZ - 29-06-2020

In this thread, the terms 'normal' and 'binomial' are arbitrarily mixed, which isn't a big problem because we're talking about an approximation of the real word length distribution.

However, it is worthwhile to think more in terms of 'binomial' because of the interesting property of the binomial distribution. I guess everyone is familiar with the simple trick to compute the binomial distribution for 'N' using a pyramid with the number 1 at the top, and moving down to N by adding the two numbers above, one to the left and one to the right.
(It is explained more visually You are not allowed to view links. Register or Login to view.).

What does this have to do with this thread?
Let me go in little steps.

Let's assume that we have a vocabulary with a binomial word length distribution.
Now we want to create a new vocabulary out of that, and we do this by optionally prefixing the words of the old vocabulary by the letter 'a', with a probability of 50%. (*)
The new vocabulary has twice the size of the old vocabulary, and its distribution is again binomial.
(note *: this is true under some conditions, and for ease of understanding, let's assume that the old vocabulary had no words starting with a).

The rule to build the pyramid can also be generalised, in a way that is not really practically useful, but makes the binomial property a bit more 'interesting'.
Rather than computing row 'N' from row 'N-1' by adding the numbers above with coefficients "1 1" one can also compute it from row 'N-2' by adding the numbers from that row with coefficients "1 2 1" or from row 'N-3' by adding the numbers from that row with coefficients "1 3 3 1".
These sequences of coefficients are themselves binomial distributions.

This means for our vocabulary example that, if one has two separate 'short' vocabularies each with a binomial distribution, one can make a new vocabulary in which each word consists of a prefix from one and a base from the other. This new vocabulary is then also binomial.

Since the 'base' in this example could have been the result of combining a binomial 'stem' with a binomial 'suffix',  we have now found one way to create a vocabulary of binomial distribution, namely by (arbitrarily) combining a prefix, a stem and a suffix which are each also binomial.

If we wanted to do this, and end up with a distribution where the shortest length is '1', then we could for example impose that the stem has to have at least one character, while the prefix and suffix can also be empty (i.e. start with level 0).

This can also be taken to the extreme, by having N components, in a fixed order, each of which may appear with 50% probability.


RE: Vord length distribution - Anton - 29-06-2020

(29-06-2020, 06:40 AM)ReneZ Wrote: You are not allowed to view links. Register or Login to view.In this thread, the terms 'normal' and 'binomial' are arbitrarily mixed,

No, definitely they are not mixed, and the more so "arbitrarily". From the very first post it's stated that the binomial distribution becomes close to normal with the increase in N. Which is the case with large vocabularies.

The fact is that the VMS word type frequency distribution is very close to normal. The question is then what's the root of that. Is that "natural" - which would be a point of interest for those who develop a natural language theory, or is that a result of some "manipulation" - which may, of course, include the case when the underlying mechanism produces something binomially distributed.

Yes, I've thought of the Pascal's triangle (which is no more Pascal's than the Voynich manuscript is Voynich's), but what you describe is a very interesting idea... Let's say decompose vords into shorter sustained subvords and look at their frequency distribution...

Yet all techniques, beginning with Stolfi's binary-decimal encoding and ending with suggestions in this thread imply a purely nomenclator solution. One generates a vocabulary of "invented" words, maps it to the vocabulary of his source plain text (thus creating the dictionary), and then for each word he needs to encode he refers to this dictionary. Mind he needs to include all word forms, like genders, tenses etc. This is possible, but extremely cumbersome.