20-03-2026, 12:30 AM
(19-03-2026, 09:39 PM)oeesordy Wrote: You are not allowed to view links. Register or Login to view.Adding to the challenge I see it has not been met as just in theory. What if it were the highest frequency word 165 in the book with the same amount of tokens as daiin for the word in the book as the suffix?
I am not sure I understood your remark. But anyway here is that Portuguese example I mentioned before. From a novel (Dom Casmurro, 1899):
| 87 antes = "before"
| 20 instantes = "instants"
| 2 interessantes = "interesting"
| 1 bastantes = "enough"
| 1 brilhantes = "brilliant"
| 1 centelhantes = "sparkling"
| 1 consoantes
| 1 constantes
| 1 dantes = "of before"
| 1 distantes
| 1 endossantes = "underwriting"
| 1 incessantes
| 1 restantes = "remaining"
| 1 semelhantes = "similar"
The numbers are only half of those of daiin, but it is a printed book that went through careful proof-reading, and it has a bigger vocabulary than the VMS.
But there are some things you must consider:
- Statistics are not properties of a language. There is no such thing as "the most common word in English", or "the frequency of the letter 'e' in English". Statistics are properties of of a specific text (or of a corpus, a collection of specific texts). Did you know that there is an whole novel in English that does not use the letter 'e' even once? And readers won't notice unless they are told?
- That said, languages can be amazingly different and complicated. Do not assume that something you think is true for European languages is true for all languages.
- And never just assume that something is true for any language, even for your native tongue. Check before making any such claim. You will often be surprised...
- The VMS is not a novel. It is not a treatise of philosophy or theology. It is not a historical or geographical account. It is not poetry. Thus it is pointless to compare its statistics to those of novels, treatises, histories, poems, etc. At best it makes sense to compare them to those of the Alchemist's Herbal or ancient pharmacopoeias.
- The VMS contains lots of errors, from various sources. We don't know how many yet. Currently I would guess that at least 5% of all tokens have a "spelling error" of some sort, and most if not all of the so-called "weirdo" glyphs are common glyphs that were mangled by the Scribe or other processes. Thus do not pay attention to words that occur only once. Or even twice.
All the best, --stolfi