Quote:It might be unusual for natural language, but such statistics are based on the assumption that the vord breaks in the VMS are actual word breaks... they may not be.
Exactly! This is an alternative for sure. The question then arises "if vords are not mapping of words (be they plain text or enciphered), then what are they?" However, two arguments are against this:
a) there are the labels
b) the corpus, as transcribed along the supposed spaces, satisfies the Zipf's law. This fact does
not provide evidence that the VMS is a meaningful text, as it is often asserted. However, the Zipf's law inherency to the language texts, if I remember correctly, is explained by the fact that longer words generally tend to be used less frequently. And Reddy & Knight, referring to Landini, report that the VMS text indeed follows this law for word lengths.
So this at least does not disprove the assumption that vords are some kind of mappings of real words.
Quote:Let's consider a pharmaceutical recipe book. A limited set of words would be repeated very frequently, such as "for", "take", "and", "then", "until", "healthy", and some words relative to quantities like numbers and common measurements (eg: one, two, or three drachms) might be frequent too, but the ingredients as well as the ailments, which would represent a large number of words in the text, would tend be much more unique.
It might yield some really interesting results to do a detailed comparison of vord frequency in specific sections of the Voynich such as the "pharma" and "recipe" sections with some actual medieval pharmaceutical recipe texts and see what happens.
I agree. Actually that's what I proposed above. But, apart from that, being not a linguist, I am at a loss to confirm or disprove offhand that it is OK for a (long) text of
any category to have, e.g., only 55 words mentioned 100+ times. It seems to me that, if this is practically possible, then not only the category of the text would be quite special (like a recipe book), but also the text flow itself would be highly conspective, omitting many words in writing that would have been present in oral speech - much like the obvious words are omitted when sending a telegram.
There is a book by Baayen on word frequency distributions, perhaps it answers some of the questions that we discuss; unfortunately I failed to locate it in open access, it seems I have to enter into expense and order it (and the Bennett's book as well

)
Quote:Regarding your latest observation: among these 55 most frequent vords that appear more than 100 times, are there some that stand out because they are particularly frequent? Are there vords that show up say, 1000 times in the Voynich?
What is most specific about this distribution is that there is a really huge number of unique vords. 4564 of 6818 (67%) of all vords are unique (which is even more than 50% suggested in Rene's website). Hence, in the linear scale, however narrow you make the leftmost bar of the histogram, this bar is really suppressing all others, and the plot is not very informative about what's going on in the rest of the distribution - that's the reason I didn't even attach it.
As to the high freq vords (100+ occurrences). They are distributed in the following way:
34 vords occur 101-200 times;
9 vords occur 201-300 times;
7 vords occur 301-400 times;
3 vords (
chedy,
aiin,
shedy) occur 401-500 times;
1 vord (
ol) occurs 501-600 times;
no vords occur 601-800 times;
1 vord occurs 801-900 times, and that vord is the famous
daiin with the count of 858.
I guess the count will vary based on the transcription and the methodology, e.g. Job's VQP gives the count of 864 to
daiin. The methodology that I used is explained above.