Yesterday, 04:40 PM
Yesterday, 08:30 PM
(Yesterday, 04:40 PM)Mauro Wrote: You are not allowed to view links. Register or Login to view.I imagine the pain St. Thomas Aquinas had to endure when he wrote his Summa Theologica with such a limited vocabulary...
Just to clarify for myself, the voynich transcript has a lot of different words. More than usually used in Latin? I am mean the whole text has 8000 different words according to forum.
Yesterday, 10:08 PM
(Yesterday, 08:30 PM)Kaybo Wrote: You are not allowed to view links. Register or Login to view.(Yesterday, 04:40 PM)Mauro Wrote: You are not allowed to view links. Register or Login to view.I imagine the pain St. Thomas Aquinas had to endure when he wrote his Summa Theologica with such a limited vocabulary...
Just to clarify for myself, the voynich transcript has a lot of different words. More than usually used in Latin? I am mean the whole text has 8000 different words according to forum.
The VMS (*) has 38411 words (**) in total, and 8424 unique words (***), that is to say 1 word type every ~4.56 word tokens. This is just a slightly higher percentage than for instance Caesar's De Bello Gallico (1 word type every ~4.67 word tokens). And consider De Bello Gallico is a much longer text (~51000 word tokens in total), and with longer texts the percentage of unique word types is expected to decrease. I'd rather say De Bello Gallico is slightly more varied in 'vocabulary' than the VMS, if any.
(*) Rf1a-n transcription, words with question marks removed
(**) word tokens
(***) word types
3 hours ago
(Yesterday, 10:08 PM)Mauro Wrote: You are not allowed to view links. Register or Login to view.The VMS (*) has 38411 words (**) in total, and 8424 unique words (***), that is to say 1 word type every ~4.56 word tokens. This is just a slightly higher percentage than for instance Caesar's De Bello Gallico (1 word type every ~4.67 word tokens).
If a language follows Zipf's law, the token/lexeme ratio in a sample cannot be a constant. As the number N of tokens (word occurrences) in a sample increases, the number M of lexemes (distinct words) grows like K*sqrt(N). More precisely like K*N**b where b is typically between 0.4 and 0.6. This formula is known as You are not allowed to view links. Register or Login to view..
So, when comparing the VMS lexicon size to that of other languages, it is important to use samples with the same number of tokens.
Assuming the exponent b is 0.5 for both languages, the interesting language parameter (independent of sample size) is K = M/sqrt(N), not M/N.
All the best, --stolfi
All the best
31 minutes ago
(3 hours ago)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view.If a language follows Zipf's law, the token/lexeme ratio in a sample cannot be a constant. As the number N of tokens (word occurrences) in a sample increases, the number M of lexemes (distinct words) grows like K*sqrt(N). More precisely like K*N**b where b is typically between 0.4 and 0.6. This formula is known as You are not allowed to view links. Register or Login to view..
So, when comparing the VMS lexicon size to that of other languages, it is important to use samples with the same number of tokens.
Assuming the exponent b is 0.5 for both languages, the interesting language parameter (independent of sample size) is K = M/sqrt(N), not M/N.
Thank you, I didn't know about Heaps' law.