Very interesting work!
(11-11-2025, 09:27 PM)quimqu Wrote: You are not allowed to view links. Register or Login to view.After segmenting all words, I counted how often each piece appeared at the beginning, middle or end of words, and finally I grouped them into prefixes, stems, sufixes or others.
What is the definition of "others"?
I would do the same analysis after erasing all occurrences of {
o,
a,
y}. I don't recall why exactly, but years ago I got convinced that their occurrences are somewhat independent of the arrangement of the other glyphs. That is, the rule for generating plausible Voynichese words would be to generate a string without those "circle" letters, and then insert circles between the non-circle glyphs - at most one in each spot.
I would also map
s to
r,
Ch and
Sh and
ee to
Ch,
k/
t/
p/
f to
k. While that may throw away information about differences between those merged letters, it would make the result less affected by transcription errors. And also simpler to understand because there would be far fewer prefixes, suffixes, etc.
Also, I would not work with the text (stream of tokens), but with the lexicon (word types without regard for their occurrences). Word occurrence frequencies are a distracting noise when studying the morphology of a language. The word structure is usually
more evident if we disregard word frequencies. For example, if in the
War of the Worlds we look at the frequencies of "other", "brother", and "mother" compared with "others", "brothers", and "mothers", we get very discrepant results, arguing against "s" being a suffix. But that's because the main character in the novel has a single brother... And see also my observation about word lengths in Voynichese and Vietnamese.
So I would make a "safe" lexicon with all word types that occur at lest N times (say, 3) and apply your prefix/core/suffix algorithm to that set. But be prepared to accept missing combinations (like br+others in the WoW lexicon).
Finally, I assume that the ['] in the stem column means empty stem. But why not also allow empty as an option in prefix and suffix? In monosyllabic languages, the "words" (syllables) can be parsed as prefix-stem-suffix combinations; but the prefix and suffix, more than the stem, can be empty.
All the best, --stolfi