(18-11-2025, 03:29 PM)quimqu Wrote: You are not allowed to view links. Register or Login to view.Pentateuch
English: 157604 tokens, 4703 unique.
French: 150940 tokens, 7570 unique.
German: 143146 tokens, 7317 unique.
Hebrew: 66311 tokens, 20976 unique.
Latin: 96870 tokens, 14001 unique.
Mandarin Other: 193335 tokens, 2267 unique.
Mandarin Union: 174380 tokens, 2178 unique.
Russian: 112011 tokens, 12443 unique.
Spanish: 138777 tokens, 8572 unique.
Vietnamese: 146634 tokens, 4213 unique.
Does this make sense?
It is expected that both numbers will vary a lot depending on the language.
Hebrew (like Arabic) has many You are not allowed to view links.
Register or
Login to view. that require paraphrasing when translated into other languages. For instance Arabic "yatakātabūna" = "both men wrote to each other".
Latin has no articles, and uses declensions instead of prepositions "of", "to", "from". English and Romance have articles and use prepositions instead of declensions. (And Portuguese, unlike Spanish, contracts many prepositions+articles into single words) German has articles and is halfway between English and Latin with respect to prepositions.
In English the subject pronoun is mandatory, whereas in most other IE languages it is implied by the verbal inflection: Italian "però canta" = "however
he sings". Also it uses auxiliary verbs instead of inflections for some tenses: Italian "canterà" = "[he] will sing", Portuguese "cantara" = "[he] had sung".
In Italian and Spanish, the oblique pronouns are often attached to the verb: Italian "portiamocelo" = "let's take it with us" In Portuguese they may be hyphenated to it. Or sometimes inserted in the middle of it: "cantá-lo-ei" = "I will sing it"
Russian, IIUC, has no articles an it too uses declensions instead of some prepositions.
A large fraction of nouns and verbs that are single words in European or Semitic languages languages are two-word compounds in Mandarin and Vietnamese. In traditional script (and in your files) those compounds are not marked; the two parts are written as separate words/characters. (They are often hyphenated in pinyin tanscriptions, but based on some Western language.) Moreover the words are a single sylalble with a rigid structure. Hence the number of tokens is expected to be higher and the number of lexemes, smaller.
German, on the other hand, has the habit or merging nominal phrases into single words: "Rinderkennzeichnungsfleischetikettierungsüberwachungsaufgabenübertragungsgesetz" =
"Law for the transfer of monitoring duties for the labeling of beef with information about cattle identification".
Also, in Mandarin and Vietnamese there is no clear division of roots into nouns, verbs, adjectives, and adverbs, which is a characteristic feature of Indo-European languages. They are somewhat like English (which I do not consider to be an IE language, partly for that reason), where you can say "that is a big stone", "they were going to stone him", "it is a stone building", "the floor was stone cold".
All the best, --stolfi