The Voynich Ninja
Vord length distribution - Printable Version

+- The Voynich Ninja (https://www.voynich.ninja)
+-- Forum: Voynich Research (https://www.voynich.ninja/forum-27.html)
+--- Forum: Analysis of the text (https://www.voynich.ninja/forum-41.html)
+--- Thread: Vord length distribution (/thread-3261.html)

Pages: 1 2 3 4 5


RE: Vord length distribution - ReneZ - 29-06-2020

(29-06-2020, 01:27 PM)Anton Wrote: You are not allowed to view links. Register or Login to view.The fact is that the VMS word type frequency distribution is very close to normal.

This was exactly my point. Stolfi observed that it is close to binomial.


RE: Vord length distribution - Anton - 29-06-2020

Yes, but he could as well have observed that it is close to normal, which he did not. I suspect he was somehow stuck with binomial from the onset due to his (9,k-1) coin-toss idea.


RE: Vord length distribution - MarcoP - 29-06-2020

(28-06-2020, 05:20 PM)Anton Wrote: You are not allowed to view links. Register or Login to view.
(28-06-2020, 05:16 PM)Koen G Wrote: You are not allowed to view links. Register or Login to view.What about syllabification? I think syllables in some languages may have a distribution similar to Voynichese.

That's what Stolfi seems to build his argumentation upon, but as I said I could not fully understand his idea. When it comes to linguistics, especially beyond Russian and English, I do not feel very comfortable.

Hi Anton,
I don't know much about linguistics either, but here is how I understand what Stolfi wrote. Of course, I might have misunderstood something.
I managed to come rather close to Stolfi's results with the Vietnamese VIQR writing system. The Viet text he used is online You are not allowed to view links. Register or Login to view..
I converted it to UTF8 and removed punctuation.
This graph shows the distribution together with Voynichese in the Currier-D'Imperio transliteration. As one can see, the two curves are quite similar, with Viet words just being a little shorter. 

   

Words in the Viet file look like this (random sample):
tu'ng
do^'i
dde^`
ky`
lo`i
cheßo

The Latin characters correspond to vowels and consonants (sometimes two Latin characters form a digraph corresponding  to a single sound). All the other characters (Greek letters included) are modifiers that apply to the preceding Latin character. The Greek letters where introduced by Stolfi and replace other VIQR symbols that normally represent punctuation.
In Vietnamese, each word is a syllable. A syllable must contain a group of vowels that can be preceded and followed by groups of consonants, forming the well-known CVC structure. This happens in most (all?) languages and can also be observed in Vietnamese. The first word in the list above (tu'ng) includes a Consonant (t) a Vowel (u') and two more Consonants (ng).
Each of the three components CVC has a variable length, with only the Vowel component constrained to be non-empty. These three components vary independently. In Viet, you can have words like "i"(a single V) or "thu+o+`ng" (2C, 2V with modifiers, 2C). Since the structure of a syllable is basically limited by human physiology, it cannot be arbitrarily long: the maximum length (which depends on both the language and the writing system) could correspond to N in your success/failure example. 



Rene's last post and the idea that combining binomial distributions can result in a new binomial distribution suggested to me that also dictionaries with a larger but still limited number of syllables might be worth considering. I experimented with Shakespeare's English. I considered a,e,i,o,u,y as Vowels and all other characters as Consonants.
I created three files for 1 syllable (CVC), up to 2 syllables (CVCVC) and up to 3 syllables (CVCVCVC).
These are the longest words in each file:
1 strengths
2 heartstrings
3 transgressions

As can be seen from the following graph, the three files show symmetrical distributions.

   

The curve for 1 syllable English words (orange) is very similar to that for Viet (but with shorter words, since English does not include vowel modifiers).
I did not expect the curve for 3 syllables (green) to be such a good match for EVA-encoded Voynichese (ZL transliteration, no uncertain spaces). 

The 3-syllables plot is not too different from the curve for the whole Shakespeare text, but the difference seems to be exactly what is needed to make the curve symmetrical. The difference is of course due to the fact that the original text contains words up to six syllables long (e.g. 'impossibilities' 'unnecessarily').
   
So, unless I made some errors in the process, it seems that limiting the number of syllables can turn the distribution of a dictionary into a symmetrical shape. We have already seen that scribal abbreviations do not seem to work (the Bonavenutra text discussed above) and, as Jonas wrote, a simple threshold on word length also wouldn't work:
Alin_J Wrote:If you constrain the length you will only cut off the tail

One of the reasons why I find this result interesting is that it fits with You are not allowed to view links. Register or Login to view..  According to her analysis, Voynichese words are constrained to include at most three syllables. I was not aware that such a constraint could result in a binomial distribution of dictionary word lengths.
Of course Stolfi's "Chinese" (or more generally "monosyllabic") hypothesis has the advantage of pointing out actual linguistic texts with symmetrical distributions. I don't know if natural languages which limit words to two or three syllables exist. But Emma's approach has the advantage of being based on the actual structure of Voynichese words.


RE: Vord length distribution - ReneZ - 29-06-2020

If words are composed of arbitrary concatenations of several parts, then the distribution of each part does not have to be exactly binomial. As long as it is largely symmetric with a maximum in the middle, the process of combining them will make the result look close to binomial.

Note that (for example) a set of 18 prefixes, 36 stems and 18 suffixes generates 11,664 different words, slightly over the number of word types in the Voynich MS.

If their lengths are 0-3 , 1-5, 0-3 respectively, with individual distributions: (1,3,3,1) , (1,4,6,4,1) , (1,3,3,1), the resulting distribution would be 1-11 and binomial.

Unfortunately, the Voynich MS text does not really follow this model, but it show how a simple model can explain a couple of things.


RE: Vord length distribution - -JKP- - 29-06-2020

MarcoP Wrote:One of the reasons why I find this result interesting is that it fits with You are not allowed to view links. Register or Login to view..  According to her analysis, Voynichese words are constrained to include at most three syllables. I was not aware that such a constraint could result in a binomial distribution of dictionary word lengths.


I think it could be argued that otardaly, aralarar, ocfhorokear, otodaram, and oparairdly are four-syllable tokens ( You are not allowed to view links. Register or Login to view. ).

One might even argue that [font=Eva]oparairdly is five syllables (it depends on how the last three glyphs are interpreted).[/font]

[font=Arial Narrow]Also, on the rosettes folio: [/font]otochedy[font=Arial Narrow],[/font] okchdarar, otodeedy, opoeesal, okalolaiin, oparodam, saralkchedy [or s aralkchedy], opodchdal, sarchcphdy, oteoteedy, and possibly qolchedarar (half-space).

..
There are, as always, some that are questionable. For example, in some languages (like English), sairodam or otaraiin might be interpreted as three syllables. In other languages[font=Arial Narrow] (e.g., Indic languages), the "a" and "i" would be pronounced separately, thus creating four syllables.[/font]

The same consideration arises with tokens like qopcheedy. Is this qop-che-ed-y (4 syllables) or qop-chee-dy (3 syllables)? or ?
Is cheocphey interpreted as che-o-cphe-y? If so, it is 4 syllables.

   

..
Folio You are not allowed to view links. Register or Login to view. and the other dense-text folios have wide spaces and half-spaces. If the narrow spaces are ignored, you get 4- and 5-syllable tokens like pchedalshdy, yteechypchy, qotalshesy, qokeolkeedy, qokeykeedy, qokeeylkeol, tcheyqokeey, qokalshedy, pcholkchdy, oteeykshy, rotailshedy, qokeodair, oieshedy, chealainor, otaryly, oteeochey, qopchedy, ocheocthey, qoctheody, okeeolkeeodain, fchedykchedy, shoefcheeykechy (some of which do not have half-spaces, they are definitely one token). This is only a sampling. There are many more.

Another complication is that dalalchdar (debatably 3 or 4 syllables) is written with half-spaces, so is it one token or three?


It is true, however, that the majority of tokens fit within a three-syllable range. However, if aiin is interpreted as a + iin or ai + in (as it would be in a number of languages), this would increase the syllable count of a significant number of tokens and some of them would be 5 syllables.


RE: Vord length distribution - Anton - 29-06-2020

Hi Marco,

In the context of this discussion, when we speak of "syllables" in a language, do we mean phonetic syllables or their written representation? I have an impression that these two are messed, which is probably methodologically incorrect.

How can one syllable contain two (or more) vowels? In your example of "thuong" (I omit the modifiers), would not that be two syllables (thu + ong) instead of one?

How can Vietnamese contain only single-syllable words? Means all their words are very short? Then how do they manage to express complex notions like "hydraulics" or "referendum"?

In short, when it comes to syllables, I'm at a loss. Sad


RE: Vord length distribution - -JKP- - 29-06-2020

Even though I wrote a post about tokens that have a larger number of syllables than three (or even four), the whole notion of syllables in the linguistic sense concerns me when it is applied to the VMS. We don't know the value of a single glyph, or even if they represent letters.

It could be argued that something like alol is two syllables, but what if it represents a word like stet? Then it is one syllable. If it is aioi, then it could be three or four, depending on the language. If it represents 1 12 19 12, then the notion of syllables doesn't even apply.


RE: Vord length distribution - Alin_J - 29-06-2020

(29-06-2020, 04:11 PM)-JKP- Wrote: You are not allowed to view links. Register or Login to view.Folio You are not allowed to view links. Register or Login to view. and the other dense-text folios have wide spaces and half-spaces. If the narrow spaces are ignored, you get 4- and 5-syllable tokens like pchedalshdy, yteechypchy, qotalshesy, qokeolkeedy, qokeykeedy, qokeeylkeol, tcheyqokeey, qokalshedy, pcholkchdy, oteeykshy, rotailshedy, qokeodair, oieshedy, chealainor, otaryly, oteeochey, qopchedy, ocheocthey, qoctheody, okeeolkeeodain, fchedykchedy, shoefcheeykechy (some of which do not have half-spaces, they are definitely one token). This is only a sampling. There are many more.

Another complication is that dalalchdar (debatably 3 or 4 syllables) is written with half-spaces, so is it one token or three?

Regarding treating certain or uncertain spaces as word-separators, for the 101 transcription if you treat them as real separators you get 9905 different word-types but not treating them as real spaces (concatenating "word parts") you get 10641 different word-types, a significant number more. I have therefore subsequently treated uncertain spaces as real separators. My argument is that if these spaces would have been random mistakes/seemingly spaces they would have likely lead to more different word-types when treating them as separators, not less, because they would shop off many legit word-types into new ones. Instead, concatenating real word-types randomly should lead to more word-types which is what is also observed for the text.


RE: Vord length distribution - MarcoP - 29-06-2020

Hi Anton, thank you for your comments!
I am not an expert, so my answers may not be very reliable. Anyway, here they are.

(29-06-2020, 06:24 PM)Anton Wrote: You are not allowed to view links. Register or Login to view.In the context of this discussion, when we speak of "syllables" in a language, do we mean phonetic syllables or their written representation? I have an impression that these two are messed, which is probably methodologically incorrect.

The graphs and comments in my previous post were based on the written representation of syllables. As you say, this is most likely incorrect. Sadly, I am not knowledgeable enough to do any better. I am sorry. 

(29-06-2020, 06:24 PM)Anton Wrote: You are not allowed to view links. Register or Login to view.How can one syllable contain two (or more) vowels? In your example of "thuong" (I omit the modifiers), would not that be two syllables (thu + ong) instead of one?

As far as I know, in English and Italian consecutive vowels are not split into different syllables. See also what wikipedia says about You are not allowed to view links. Register or Login to view..
For instance, Shakespeare wrote 10-syllables verses and "doubt" (which includes two vocalic sounds) counts as a single syllable:

I doubt whether their legs be worth the sums
1    1     2      1    1    1    1    1   1


I guess it's the same in Vietnamese.

(29-06-2020, 06:24 PM)Anton Wrote: You are not allowed to view links. Register or Login to view.How can Vietnamese contain only single-syllable words? Means all their words are very short?

As can be seen from Stolfi's distribution curves, there are no long words in Vietnamese. The longest words I find in Stolfi's VIQR file are 9 characters long (e.g. "tru+o+`ng").


(29-06-2020, 06:24 PM)Anton Wrote: You are not allowed to view links. Register or Login to view.Then how do they manage to express complex notions like "hydraulics" or "referendum"?

According to this online dictionary, complex notions are expressed by multiple words:

hydraulics - động thủy học
referendum - trưng cầu ý dân

(29-06-2020, 06:24 PM)Anton Wrote: You are not allowed to view links. Register or Login to view.In short, when it comes to syllables, I'm at a loss. 

Yes, me too! I will leave further research to linguists!


RE: Vord length distribution - Anton - 29-06-2020

I referred to the Russian Wikipedia, it says a diphtong is a sound consisting of two components of which one is syllabic, the other is not. On the other hand, a diphtong is a group of vowels pronounced as a single syllable. A syllable, in turn, may consist of several sounds.

Sick 

... There's a couple of Russian sayings which are highly appropiate here:

"the Devil himself will break his leg here" and "you won't puzzle this out without 0.5L (of vodka)"