The Voynich Ninja

Full Version: Vord frequency histogram as an indicator of the text category
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Pages: 1 2 3 4 5
(31-03-2016, 11:45 PM)Anton Wrote: You are not allowed to view links. Register or Login to view....Only 55 Voynichese words occur more than 100 times.

Is that OK for any cohesive text, even if we mind that there might be different word forms?

This looks like a very limited basic vocabulary with a high amount of specific terms?!

It might be unusual for natural language, but such statistics are based on the assumption that the vord breaks in the VMS are actual word breaks... they may not be.
Anton,
Let's consider a pharmaceutical recipe book. A limited set of words would be repeated very frequently, such as "for", "take", "and", "then", "until", "healthy", and some words relative to quantities like numbers and common measurements (eg: one, two, or three drachms) might be frequent too, but the ingredients as well as the ailments, which would represent a large number of words in the text, would tend be much more unique.
It might yield some really interesting results to do a detailed comparison of vord frequency in specific sections of the Voynich such as the "pharma" and "recipe" sections with some actual medieval pharmaceutical recipe texts and see what happens.
Regarding your latest observation: among these 55 most frequent vords that appear more than 100 times, are there some that stand out because they are particularly frequent? Are there vords that show up say, 1000 times in the Voynich?
Quote:It might be unusual for natural language, but such statistics are based on the assumption that the vord breaks in the VMS are actual word breaks... they may not be.

Exactly! This is an alternative for sure. The question then arises "if vords are not mapping of words (be they plain text or enciphered), then what are they?" However, two arguments are against this:

a) there are the labels
b) the corpus, as transcribed along the supposed spaces, satisfies the Zipf's law. This fact does not provide evidence that the VMS is a meaningful text, as it is often asserted. However, the Zipf's law inherency to the language texts, if I remember correctly, is explained by the fact that longer words generally tend to be used less frequently. And Reddy & Knight, referring to Landini, report that the VMS text indeed follows this law for word lengths.

So this at least does not disprove the assumption that vords are some kind of mappings of real words.

Quote:Let's consider a pharmaceutical recipe book. A limited set of words would be repeated very frequently, such as "for", "take", "and", "then", "until", "healthy", and some words relative to quantities like numbers and common measurements (eg: one, two, or three drachms) might be frequent too, but the ingredients as well as the ailments, which would represent a large number of words in the text, would tend be much more unique.

It might yield some really interesting results to do a detailed comparison of vord frequency in specific sections of the Voynich such as the "pharma" and "recipe" sections with some actual medieval pharmaceutical recipe texts and see what happens.

I agree. Actually that's what I proposed above. But, apart from that, being not a linguist, I am at a loss to confirm or disprove offhand that it is OK for a (long) text of any category to have, e.g., only 55 words mentioned 100+ times. It seems to me that, if this is practically possible, then not only the category of the text would be quite special (like a recipe book), but also the text flow itself would be highly conspective, omitting many words in writing that would have been present in oral speech - much like the obvious words are omitted when sending a telegram.

There is a book by Baayen on word frequency distributions, perhaps it answers some of the questions that we discuss; unfortunately I failed to locate it in open access, it seems I have to enter into expense and order it (and the Bennett's book as well Smile )

Quote:Regarding your latest observation: among these 55 most frequent vords that appear more than 100 times, are there some that stand out because they are particularly frequent? Are there vords that show up say, 1000 times in the Voynich?

What is most specific about this distribution is that there is a really huge number of unique vords. 4564 of 6818 (67%) of all vords are unique (which is even more than 50% suggested in Rene's website). Hence, in the linear scale, however narrow you make the leftmost bar of the histogram, this bar is really suppressing all others, and the plot is not very informative about what's going on in the rest of the distribution - that's the reason I didn't even attach it.

As to the high freq vords (100+ occurrences). They are distributed in the following way:

34 vords occur 101-200 times;
9 vords occur 201-300 times;
7 vords occur 301-400 times;
3 vords (chedy, aiin, shedy) occur 401-500 times;
1 vord (ol) occurs 501-600 times;
no vords occur 601-800 times;
1 vord occurs 801-900 times, and that vord is the famous daiin with the count of 858.

I guess the count will vary based on the transcription and the methodology, e.g. Job's VQP gives the count of 864 to daiin. The methodology that I used is explained above.
While the Zipf law model is far from perfect, it gives at least a first approximation we can work with.
One aspect that seems not to be mentioned a lot is that the inverse frequency law does not really work for the highest frequency words, i.e. the line with slope -1 on a log-log scale tends to flatten a bit at the left.
However, we can leave that out of consideration for the moment.

If the text had 37,919 words (figure from Reddy and Knight), we can predict the frequency table for all word types. The predicted number can be obtained by rounding.
In this case, the cut-off is when the expected frequency of a word is less than 0.5.

We then find the following numbers:
Number of different words (word types): 7738

This is a bit higher than what Anton finds, but I could redo the stats with the number of word tokens in the Takeshi transcription, if he has that number. In any case, the frequency curve is very flat in this area, so the actual cut-off value is quite ill-defined.

Number of words that occur once: 5159
This is 67%, spot on.

101-200 times: 19
201-300 times:  7
301-400 times: 3
More: 9
Total over 100: 38

These numbers are actually lower than in the MS.
Hmm, strange to see that indeed.

The number of word tokens in the Takahashi's transcription is 38045, however I excluded "dubious" words, yielding the figure 34432 that I worked with.

However, there are only two 100+ "dubious" words - qokai!n and dai!n.
Is anybody willing to share a list of all first lines (or all paragraphs which start with a gallow)  in the text ?

folio nr - linenr-  Takeshi eva transcription
(04-04-2016, 05:32 PM)Davidsch Wrote: You are not allowed to view links. Register or Login to view.Is anybody willing to share a list of all first lines (or all paragraphs which start with a gallow)  in the text ?

folio nr - linenr-  Takeshi eva transcription


Sorry I do not have that. I do all extraction semi-manually through Job's VQP.

But assuming some not very difficult (for programmers, but not for me Smile ) programming, this could be done, because text files with different transcriptions are available out there. I suggest that you create a request in the Q2E subforum.
In the old days, when I still had a Unix machine on my desk, I found that one could do almost anything one wanted with a small set of Unix commands. "grep" was most helpful of all.
For the more complicated stuff, basic 'awk' and if necessary 'perl' scripts did the job.
(04-04-2016, 08:29 PM)ReneZ Wrote: You are not allowed to view links. Register or Login to view.In the old days, when I still had a Unix machine on my desk, I found that one could do almost anything one wanted with a small set of Unix commands. "grep" was most helpful of all.
For the more complicated stuff, basic 'awk' and if necessary 'perl' scripts did the job.

Since a Mac is Unix under the hood (BSD Unix) and I have a Mac... I still find those very useful tools and Perl is still a very viable and useful Web tool (and very good for parsing text).
(04-04-2016, 05:32 PM)Davidsch Wrote: You are not allowed to view links. Register or Login to view.Is anybody willing to share a list of all first lines (or all paragraphs which start with a gallow)  in the text ?

folio nr - linenr-  Takeshi eva transcription

The attached file contains the first lines of each paragraph, excluding the astrological folios (lines numbered starting at zero). Is that what you were looking for?
Pages: 1 2 3 4 5