Philipp Harland > 18-11-2025, 12:48 AM
Is it truly research-grade, i.e. can it produce non-trivial results that couldn't be produced without it? It seems like it's working pretty well for the VMS.
Rafal > 18-11-2025, 01:42 PM
Jorge_Stolfi > 18-11-2025, 01:52 PM
(18-11-2025, 01:42 PM)Rafal Wrote: You are not allowed to view links. Register or Login to view.Would it mean that there are less patterns in languages with declension?
quimqu > 18-11-2025, 02:12 PM
(18-11-2025, 01:42 PM)Rafal Wrote: You are not allowed to view links. Register or Login to view.Quimqu, I would have one more question. How your methods treat languages with declension?
Take for example:
Latin: Homo homini lupus est
English: Man is a wolf to man
In English we have "man" word repeated twice. In Latin it is "homo" and "homini".
Do your methods know that "homo" and "homini" are in fact the same word? I guess they don't. They treat it the same as "homo" and "lupus", as totally different words, right?
Would it mean that there are less patterns in languages with declension?
quimqu > 18-11-2025, 03:29 PM
(17-11-2025, 02:50 PM)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view.I have the following versions of the Pentateuch
Jorge_Stolfi > 18-11-2025, 04:00 PM
(18-11-2025, 03:29 PM)quimqu Wrote: You are not allowed to view links. Register or Login to view.Pentateuch
English: 157604 tokens, 4703 unique.
French: 150940 tokens, 7570 unique.
German: 143146 tokens, 7317 unique.
Hebrew: 66311 tokens, 20976 unique.
Latin: 96870 tokens, 14001 unique.
Mandarin Other: 193335 tokens, 2267 unique.
Mandarin Union: 174380 tokens, 2178 unique.
Russian: 112011 tokens, 12443 unique.
Spanish: 138777 tokens, 8572 unique.
Vietnamese: 146634 tokens, 4213 unique.
Does this make sense?
Rafal > 18-11-2025, 04:19 PM
Quote:Does this make sense?
Rafal > 18-11-2025, 04:22 PM
Quote:It would be good to try to join the declined words (or plurals). I must think how to do it
quimqu > 18-11-2025, 04:49 PM
(18-11-2025, 04:22 PM)Rafal Wrote: You are not allowed to view links. Register or Login to view.Quote:It would be good to try to join the declined words (or plurals). I must think how to do it
I believe you won't do it in an easy way.
There may be several patterns of declension in a language and a lot of exceptions so writing some general rules rather wouldn't work. You would basically need for each language a database of words linking forms of the same word.
I don't know if such things exist.
Mauro > 18-11-2025, 07:27 PM
(18-11-2025, 04:49 PM)quimqu Wrote: You are not allowed to view links. Register or Login to view.I was thinking of calculating the word similarity and given a high threshold, join words. If I give a threshold of one or two character change, I could reduce the graph dimension and the number of different tokens. But this needs to be tested and proofed... I really don't know if this can be so easy... (Guess not).