The Voynich Ninja - [split] Percentage of word types that occur more than once

Pages: 1 2 3 4

(23-06-2020, 08:32 PM)RobGea Wrote: You are not allowed to view links. Register or Login to view.la divina commedia di dante alighieri
Total words: 97344
Vocabulary : 19893
Hapax : 13750

I get different values in the Divine Comedy. Can you upload the corpus ?

I use this one:
You are not allowed to view links. Register or Login to view.

Hi bi3mw.
You are quite right. I have indeed made a mistake here. Blush

I used a different script for my Dante results, consistency is king for this kind of thing.
New results:

la divina commedia di dante alighieri
Total words: 97576
Vocabulary : 14680
Hapax : 8956
61.0% of vocab is hapax 10.8 words per hapax Totalwords/Vocab ratio 6.64:1

Hmm.. Sorry but my edited version is 539Kb 'too large to attach'.

I used the Project Gutenberg Italian version and just deleted the Gutenberg license stuff.
You are not allowed to view links. Register or Login to view.

Pliny is from here( just in case you wanted it ):
You are not allowed to view links. Register or Login to view.

My bad, i totally screwed that one up. Attached parser code (as text file) for completeness.

Thank you and well done for checking my stats bi3mw.
Replication is part of good science.

Thanks RobGea, I took the liberty of rewriting your text parser. You can now specify the file name on the command line. So you don't need to edit the python file every time you create a new text file Wink

python3 ./text_parser_v002.py infile.txt

(24-06-2020, 12:43 PM)ReneZ Wrote: You are not allowed to view links. Register or Login to view.All this is based on the assumption that the Voynich MS words are the equivalents of complete words in some natural language.

This is a most natural assumption that is made almost automatically by most people (at least by most people presenting solutions) but I am not at all sure that it is correct.

There are arguments in favour of it and arguments against it, but both types are rather weak.

In favour: adherence (more or less) to Zipf's law
Against: unusual distribution of repeating word sequences.

There are far more arguments (without claim of completeness):

In favor (somehow it looks like language):
- it is structured
  - adherence to Zipf's law
  - context dependency
Against (it doesn't behave like natural language):
  - repetition, repetition and repetition / low entropy value
  - binomial word length distribution
  - no word order / repeated phrases are missing
  - Currier A vs. Currier B
  - context-dependent self-similarity / deep correlation between frequency, similarity, and spatial vicinity of tokens / one single network of similar words
  - random walk results / long-range correlations
  - the line works as a functional unit
  - no distinguishable semantic word categories (nouns vs verbs etc.)
  - function words are missing (equally distributed words)
- no identifiable word roots
  - no deleted glyph sequences
  - the end of the lines nearly always fit into the available space

(25-06-2020, 06:35 AM)Torsten Wrote: You are not allowed to view links. Register or Login to view. - the end of the lines nearly always fit into the available space

These can also be "filler words" that mean nothing, independent of the rest of the text. Perhaps this is one of the reasons for the relatively high percentage of unique words in the VMS. To check this, one would have to look at these words at the end of lines or between plants (are they unique or not?).

(25-06-2020, 01:15 PM)bi3mw Wrote: You are not allowed to view links. Register or Login to view.These can also be "filler words" that mean nothing, independent of the rest of the text. Perhaps this is one of the reasons for the relatively high percentage of unique words in the VMS. To check this, one would have to look at these words at the end of lines or between plants (are they unique or not?).

The existence of meaningless filler words would indicate some kind of encoding. Therefore this would be still a counter argument against a natural language.

Currier argued this way "The ends of the lines contain what seem to be, in many cases, meaningless symbols: little groups of letters which don’t occur anywhere else, and just look as if they were added to fill out the line to the margin. Although this isn’t always true, it frequently happens." (You are not allowed to view links. Register or Login to view.).

[font=Tahoma, Verdana, Arial, sans-serif]Words typical for a line final position can also occur in other positions (see for instance You are not allowed to view links. Register or Login to view.).[/font]

(25-06-2020, 06:35 AM)Torsten Wrote: You are not allowed to view links. Register or Login to view.There are far more arguments (without claim of completeness):

In favor (somehow it looks like language):
- it is structured
- adherence to Zipf's law
- context dependency

My question was whether the Voynich MS "words" represent complete words in some language, not whether the text is (or seems) meaningful.

As long as we have no idea how the text was composed, we cannot judge whether it is meaningful or not.
Statistics are real data, but the interpretation of these real data almost always involve significant assumptions, for example that the visible words in the MS are really words.

I just want to point out that we don't know this, even when it seems reasonable.

(25-06-2020, 03:27 PM)ReneZ Wrote: You are not allowed to view links. Register or Login to view.My question was whether the Voynich MS "words" represent complete words in some language, not whether the text is (or seems) meaningful.

I was also talking about the question if the Voynich MS "words" represent words in some natural language. [font=Tahoma, Verdana, Arial, sans-serif]I agree that this arguments neither allow the conclusion that the text is meaningful nor does they indicate a genuine linguistic structure (see also You are not allowed to view links. Register or Login to view.). However, this doesn't change the fact that arguments like "the text is structured", "the text follows Zipfs law", and "context dependency along with the illustrations" are used as arguments in favor of the natural language hypotheses.[/font]

For instance Amancio et al. are arguing that it is possible to distinguish between the Voynich text and its shuffled version: "We show that it is mostly compatible with natural languages and incompatible with random texts." (You are not allowed to view links. Register or Login to view.). Also Montemurro et al. are arguing this way: "The hypothesis that it is simply a nonsensical text either intended as a hoax or made up with any other purpose has debilitated in recent years, due to the increasing evidence of the text’s different levels of organizational structure. These regularities, in fact, are compatible with the presence of some kind of linguistic information." (You are not allowed to view links. Register or Login to view.).

Zipfs law was used as an argument by Landini: "Words in natural languages are not uniformly distributed but approach the so-called 'Zipf's law of word frequencies.' ... Figure 1a shows that the Voynich manuscript approximately follows this law." (You are not allowed to view links. Register or Login to view., p. 278). Montemurro et al. also refer to Landinis argument: "One of the strongest clues in this puzzle is the fact that the frequency of words in the Voynich text obeys Zipf’s law." (Montemurro et al. 2013).

Montemurro et al. also use context dependency as an argument: "These results support the conjecture that there is a match between the linguistic structure and the illustrations of the text." ([font=Tahoma, Verdana, Arial, sans-serif]Montemurro et al. 2013[/font]).

[font=Tahoma, Verdana, Arial, sans-serif]Just recently Lisa Fagin Davis adopted this type of argumentation: "Recent linguistic analyses suggest that Voynichese may represent a natural - and as yet unidentified - human language; it isn’t nonsense, and it isn’t an invented language like Elvish or Klingon." (You are not allowed to view links. Register or Login to view., p. 75). [/font]

Pages: 1 2 3 4