Vord length distribution - Printable Version +- The Voynich Ninja ( https://www.voynich.ninja)+-- Forum: Voynich Research ( https://www.voynich.ninja/forum-27.html)+--- Forum: Analysis of the text ( https://www.voynich.ninja/forum-41.html)+--- Thread: Vord length distribution ( /thread-3261.html) |

RE: Vord length distribution - -JKP- - 28-06-2020
(28-06-2020, 10:42 AM)MarcoP Wrote: You are not allowed to view links. Register or Login to view. In written languages, both combinations often exist (at and ta) and generally can appear in different parts of a word. For example attack, batter, begat, etc., beginning, middle, end. So, yes, position is important to comprehending the meaning, but the pattern itself occurs in many parts of the word. This is not how the VMS works. Certain glyphs can precede others but almost never follow them. Also, certain glyphs have a very high propensity for certain positions in the tokens. Some of these characteristics are also found in Roman numerals, which were the primary numeric system up until the very end of the 14th century when Indic-Arabic systems began to become more popular. With Roman numerals (as one example) there is a more rigid format. You might see MMDVIII but you are not going to see MVMII or MVIMI. Thus, positionality is more common (and less flexible) than words. Even supposing there are numeric symbols in the VMS, it doesn't have to be all numbers, letter-glyphs can be (and were) mixed with number glyphs. But if there are numbers and it's a system that was known at the time (rather than an invented one), the numeric portions are less likely to vary as much as a purely word-based system. The VMS is very heavy with these kinds of patterns together with the an/ain/aiin/aiiin patterns: [attachment=4479] [attachment=4481] RE: Vord length distribution - -JKP- - 28-06-2020
It's even possible that other "c shapes" such as s and ch/sh might be part of a sequence: [attachment=4482] The pattern che is common but how often do you see ech? RE: Vord length distribution - Anton - 28-06-2020
To make it clear, I was not arguing that the Voynichese is based on Roman numerals. I just wanted to quickly illustrate that various approaches to encoding words in the vocabulary (mind that all this discussion is within the framework of a "pure nomenclator" solution) can lead both to symmetric and asymmetric length distributions, and those symmetric can be close to normal. Now, there are quantitative methods to assess how close is the given curve to the normal curve, such as calculating skewness and excess, but even without delving in that it is seen from the figures that the VMS fits the normal curve better than the Roman numbers system does. Moreover, the larger the vocabulary size, the farther is the "Roman" distribution from normal, although it remains perfectly symmetric. Here's the pic for the vocabulary size of 20000. In this graph frequency is shown in absolute, not relative, figures (to get relative, divide by 20000) [attachment=4483] It's curious but it appears that there is the upper limit for the frequency of a particular word length, and it is 1000 (in my previous post where the size was 6818 the highest frequency was 934, i.e. not reaching the limit yet). This means that however large is the corpus of Roman numbers, there will be no more than 1000 of the same length. Which is counter-intuitive and really curious. About numerical form representation in general. Yes: inspired by the fact that the VMS distribution is approximated by binomial of (N,K-1), Stolfi introduced this idea. However, it's not been shown that non-numerical systems cannot yield the same result. About the second Stolfi article. Yes, I found it out yesterday via Rene's website after having written my post. I must say I fail to follow his linguistic interpretation, but it all looks tro me superfluous because he is too stuck with binomial. It's the more strange given that in this article (as opposed to the first article) he explicitly states that binomial will approach normal with large N, so I don't know why one should invent explanations for binomiality while the more obvious case would be just normality. So the question would be: what's there in a writing system that makes its vocabulary word length distribution to be normal? As Stolfi shows in his second article, there are some writing systems for natural languages that yield curves close to normal, such as Eastern-Asian romanizations. I'm not familiar with anything Eastern-Asian at all, but I have the intuitive feeling that a language with more inflections will have distribution closer to normal. This can be checked against the batch of Human Rights Declaration texts in various languages. (It is important to remember that each word form in this approach must be counted as a separate distinct word). RE: Vord length distribution - ReneZ - 28-06-2020
Anton, what is the graph in your previous post showing? How do you get to word lengths of up to 30? Could it be that your method to convert to Roman numerals keeps adding more M's for numbers greater than thousand? (For example 6001 = MMMMMMI ) In that case, both odd features would be explained. RE: Vord length distribution - Alin_J - 28-06-2020
(28-06-2020, 01:55 PM)Anton Wrote: You are not allowed to view links. Register or Login to view.It's curious but it appears that there is the upper limit for the frequency of a particular word length, and it is 1000 (in my previous post where the size was 6818 the highest frequency was 934, i.e. not reaching the limit yet). This means that however large is the corpus of Roman numbers, there will be no more than 1000 of the same length. Which is counter-intuitive and really curious. It just means that the limit is increasing exponentially (the number of possible combinations of words one character longer will be a multiple of the number of combinations of the shorter word length) and you are far from reaching the next step with that vocabulary size, but technically there is no limit. RE: Vord length distribution - Anton - 28-06-2020
(28-06-2020, 02:11 PM)ReneZ Wrote: You are not allowed to view links. Register or Login to view.Could it be that your method to convert to Roman numerals keeps adding more M's for numbers greater than thousand? (For example 6001 = MMMMMMI ) Yes, exactly. Twenty thousand something would contain twenty M's. But I don't see the explanation. Intuitively, this should flatten the right slope of the curve, kinda tending to uniform. Instead, it is the centre of the curve that becomes uniform. The graph shows the distribution of word length frequency in a vocabulary of size 20000, where each word is a distint, non-repeating (that is, all words are different), Roman numeral from 1 to 20000. RE: Vord length distribution - Anton - 28-06-2020
(28-06-2020, 02:15 PM)Alin_J Wrote: You are not allowed to view links. Register or Login to view.It just means that the limit is increasing exponentially (the number of possible combinations of words one character longer will be a multiple of the number of combinations of the shorter word length) and you are far from reaching the next step with that vocabulary size, but technically there is no limit. Not sure in that, I tried with size of 200000 (that's ten times larger), and the limit is still 1000 RE: Vord length distribution - Anton - 28-06-2020
Here's a quick check of word type length distribution for UDHR in Russian and Latin, I conveniently had source files already from my former entropy experiments and used those (all headings and punctuation removed, converted to lowercase). The results are not encouraging, besides far from normal, they are both multimodal. The vocabulary size is small though (700 for Russian, 672 for Latin). Larger texts should be taken... [attachment=4485] EDIT: In Stolfi, word types are not multimodal for any language RE: Vord length distribution - Anton - 28-06-2020
Here's chapter 2 of "Crime & Punishment" by Dostoevsky (in Russian). The vocabulary size is 2252. [attachment=4487] RE: Vord length distribution - Anton - 28-06-2020
What strikes the eye in the Dostoevsky curve is that the left slope is neat while the right is not. I wonder how the picture would change would the text be abbreviated... Maybe it is abbreviation that "normalizes" the distribution. |