(23-06-2020, 08:32 PM)RobGea Wrote: You are not allowed to view links. Register or Login to view.la divina commedia di dante alighieri
Total words: 97344
Vocabulary : 19893
Hapax : 13750
I get different values in the Divine Comedy. Can you upload the corpus ?
I use this one:
You are not allowed to view links.
Register or
Login to view.
Hi bi3mw.
You are quite right. I have indeed made a mistake here.

I used a different script for my Dante results, consistency is king for this kind of thing.
New results:
la divina commedia di dante alighieri
Total words: 97576
Vocabulary : 14680
Hapax : 8956
61.0% of vocab is hapax 10.8 words per hapax Totalwords/Vocab ratio 6.64:1
Hmm.. Sorry but my edited version is 539Kb 'too large to attach'.
I used the Project Gutenberg Italian version and just deleted the Gutenberg license stuff.
You are not allowed to view links.
Register or
Login to view.
Pliny is from here( just in case you wanted it ):
You are not allowed to view links.
Register or
Login to view.
My bad, i totally screwed that one up. Attached parser code (as text file) for completeness.
Thank you and well done for checking my stats bi3mw.
Replication is part of good science.
Thanks RobGea, I took the liberty of rewriting your text parser. You can now specify the file name on the command line. So you don't need to edit the python file every time you create a new text file
python3 ./text_parser_v002.py infile.txt
(24-06-2020, 12:43 PM)ReneZ Wrote: You are not allowed to view links. Register or Login to view.All this is based on the assumption that the Voynich MS words are the equivalents of complete words in some natural language.
This is a most natural assumption that is made almost automatically by most people (at least by most people presenting solutions) but I am not at all sure that it is correct.
There are arguments in favour of it and arguments against it, but both types are rather weak.
In favour: adherence (more or less) to Zipf's law
Against: unusual distribution of repeating word sequences.
There are far more arguments (without claim of completeness):
In favor (somehow it looks like language):
- it is structured
- adherence to Zipf's law
- context dependency
Against (it doesn't behave like natural language):
- repetition, repetition and repetition / low entropy value
- binomial word length distribution
- no word order / repeated phrases are missing
- Currier A vs. Currier B
- context-dependent self-similarity / deep correlation between frequency, similarity, and spatial vicinity of tokens / one single network of similar words
- random walk results / long-range correlations
- the line works as a functional unit
- no distinguishable semantic word categories (nouns vs verbs etc.)
- function words are missing (equally distributed words)
- no identifiable word roots
- no deleted glyph sequences
- the end of the lines nearly always fit into the available space
(25-06-2020, 06:35 AM)Torsten Wrote: You are not allowed to view links. Register or Login to view. - the end of the lines nearly always fit into the available space
These can also be "filler words" that mean nothing, independent of the rest of the text. Perhaps this is one of the reasons for the relatively high percentage of unique words in the VMS. To check this, one would have to look at these words at the end of lines or between plants (are they unique or not?).
(25-06-2020, 06:35 AM)Torsten Wrote: You are not allowed to view links. Register or Login to view.There are far more arguments (without claim of completeness):
In favor (somehow it looks like language):
- it is structured
- adherence to Zipf's law
- context dependency
My question was whether the Voynich MS "words" represent complete words in some language, not whether the text is (or seems) meaningful.
As long as we have no idea how the text was composed, we cannot judge whether it is meaningful or not.
Statistics are real data, but the interpretation of these real data almost always involve significant assumptions, for example that the visible words in the MS are really words.
I just want to point out that we don't know this, even when it seems reasonable.