As for about any other topic, good references about word lengths statistics can be found You are not allowed to view links.
Register or
Login to view. (4.6 Word length distribution).
In this case, the fundamental work is You are not allowed to view links.
Register or
Login to view..
Stolfi's research is mentioned in Reddy and Knight's well-known 2011 paper You are not allowed to view links.
Register or
Login to view.: they add some interesting observations.
Recently You are not allowed to view links.
Register or
Login to view. mentioned the corpus of texts in different languages collected by You are not allowed to view links.
Register or
Login to view..
I cleaned the texts by removing (most) punctuation and computed two simple statistical measures for word lengths:
- average value
- standard deviation (a measure of variability)
The VMS samples (blue) are:
The Currier D'Imperio transliteration (about half the ms is included)
Currier A and B using the EVA transliteration by Zandbergen and Landini (ZL_ivtff_1c.txt) with and without uncertain spaces.
This is the resulting plot:
[
attachment=4416]
Of course, the two measures are positively correlated: if a text has longer words it also has more room for variability.
The graph is dominated by a few extreme outliers, for instance:
KAL (Greenlandic) - From Cham's corpus description:
"Family: Eskimo-Aleut Notes: Polysynthetic language; words can be very long."
THA (Thai) does not use spaces between words, so the counts here correspond more to syllables in a sentence rather than sounds in a word
This plot restricts the average length to a range closer to the centre of the diagram:
[
attachment=4415]
As one can see, considering uncertain spaces in the VMS has a much smaller effect than the transliteration system used. The Currier-D'Imperio system joins several EVA sequences into single symbols, so that
daiin is encoded as 8AM and
chol as SOE.
Several languages have the same average word-length as the VMS. For instance Italian (ITA) and Middle English (ENM) are close to VMS-CD. Ancient Greek (GRC) and Mongolian (MON) are close to the EVA samples.
As JKP showed for Latin, the difference is that the other texts with comparable average word length all have considerably greater variability (Standard Deviation): they appear to be "higher" on the plot.
The Arabic abjad (ARA) is one of the text samples that comes closer the VMS (this is also discussed by Ruddy and Knight).
Stolfi observed that some languages which are really close to Voynichese word-length behaviour: pinyin Chinese (LZH), Vietnamese (VIE) and Tibetan (not in Cham's corpus). They appear at the bottom left of the graph. Stolfi could match Voynichese to these languages by using an encoding system (which he calls OKO) even more "compressive" than CD: this has the effect of further reducing the average length of words, so that the results are perfectly comparable with Chinese and similar languages.
One of Stolfi's plots (frequencies of different word-lengths):
[
EDIT: this plot appears to be based on word types (i.e. ignoring word frequencies) while the scatter-plot above was computed on word tokens]
[
attachment=4417]