The Voynich Ninja

Full Version: Vord frequency histogram as an indicator of the text category
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Pages: 1 2 3 4 5
Within the course of my Voynich research I noted that first vords of folios of at least some of the VMS sections are highly unique. Thus, 66% of first vords of botanical folios are unique. 60% of first vords of balneo folios are unique.

High as the above numbers are, the vords that are not first also exhibit quite a good deal of uniqueness. Thus, 33% of second vords, 34% of third vords and 35% of last vords of botanical folios are unique.

The above figures and their significance, as applied to the research direction that I pursue, will be explained in detail in my forthcoming article. For now, it matters that I got a feeling that many vords are unique on the whole - which may be characteristical for the Voynich text on the whole. To check this feeling, I referred to Rene's website which You are not allowed to view links. Register or Login to view.:

Quote:The number of word types that appear only once in the MS (so-called hapax legomena) is rather high (about half of all word types). A detailed comparison with plain texts in other languages should be made.

What occurs to me is that the distribution of vord frequency count would speak about the type of the underlying message rather than of the underlying language. If analyzed and compared with other sources, it would possibly tell us two things:

1) What is the range of matching text categories? Is it Bible, or herbal, or so on?

2) What is the range of matching text styles? Is it narrative, or reference book, or telegraph-style conspect ("a good glass in the bishop's hostel..."). Is it abbreviated in the medieval-style or not abbreviated?

The analysis of the vord frequency count distribution can be further supported (and the results thus narrowed down) by the vord entropy comparative analysis. As I know, word entropy is text-category-dependent.

Are there any results obtained upon the above aspects? If not, would anyone please undertake at least building a vord frequency histogram?

Should we bring forth this research activity as a Voynich task? What do you think?
Hi Anton,
When you say "unique" in the first paragraph, do you mean unique to the page or to the manuscript? (I'm assuming MS)
To save us looking it up, what's the average chance any Word would be unique, about 50%?

Whilst I think this angle may have some merit - and so do others such as Pelling, who was recently trying to matchtext to Italian poetry, etc - I'm not sure how to Word this for a task.

What would be the task summary and parameters?
Quote:When you say "unique" in the first paragraph, do you mean unique to the page or to the manuscript? (I'm assuming MS)

I mean unique to the manuscript, in other words its frequency count is 1.

Quote:To save us looking it up, what's the average chance any Word would be unique, about 50%?

Not sure if I understood your question. If you mean the VMS, then (as Rene suggests) roughly a half of all vords are unique, i.e. they are encountered only once in the VMS. If you mean "for a given text, what is the probability of a given word to be unique in that text", this would depend - as I guess, it would primarily depend on the category and style of the text, rather that on the language of the text. (I may be wrong in this). Actually that's the clue to what I am proposing.


For a formalized task, this would look something like:

1) Build the vord frequency histogram for the VMS
2) Select a set of contemporary documents of various categories and styles for comparison: such as herbals, astrological books, philosophical treatises, general narratives etc. Include versions with high degree of word abbreviation, if exist. (Each different abbreviation of the same word should be considered as a different word for the sake of the calculation!). Calculate their word frequency histograms.
3) Compare results of 1) and 2) and make conclusions.
4) Repeat steps 1) - 3) for word entropy calculations.

Actually, I'm not sure if anyone has not done all this before, that's why I am asking.
The idea is good. However, if there are empty symbols (describing the punctuation) or the gallows are the semantic markers, it will lead to a significant distortion of the results.
Thanks, I understand it better now.
I'm fairly sure I've seen this sort of research out there. I'll have to have a dig around.
Any such research will, as Wladimir points out, have to have a protocol for glyph interpretation (standardised transcription) in order to present a repeatable study.
Quote:Any such research will, as Wladimir points out, have to have a protocol for glyph interpretation (standardised transcription) in order to present a repeatable study.

That's true for any VMS text statistics study, without exception. With no better choice, one could start with the available EVA transcriptions.
One way to achieve repeatability is to extract transcription files using this tool:

You are not allowed to view links. Register or Login to view.
... Always assuming the ligatures in Eva are correctly used.
Well, all transcriptions are going to have errors, and the best we can hope is that they're not consistent enough to bias the statistics. This depends a lot on the exact type of statistics one is making.
Word frequency counts (as in this particular case) should not be affected so much by the alphabet used as, for example, character entropy statistics will be.
In the present case (word counts), handling of apparent word spaces is much more important, and this is a real issue.
In the Recipe section folios, 73% of first vords are unique.

I think I'll go ahead and do some checks for first vords of paragraphs.

First vords of folios are, by definition, first vords of paragraphs. So it may be that it is the paragraph beginning that implies vord uniqueness, and not the folio beginning.

OK, so I checked all first vords of second paragraphs in the Recipe section, and their uniqueness is 61%. Still high.

Let's see for botanical folios...
Pages: 1 2 3 4 5