The Voynich Ninja

Full Version: Opinions on: line as a functional unit
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Pages: 1 2 3 4 5 6 7 8 9 10 11 12 13
(11-11-2025, 06:10 PM)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view.Sorry for missing that point.  Do you mean that the first and second syllables together may not comprise the whole word? That is, they may leave some glyphs in the middle?  Or even overlap?

The words are formed strictly according to the table. The above example then looks like this:

cheo/ck/hy
cheo/ct/hy
cheo/ik/hy
etc.

cheo/hy does not occur as a word at all (without middle syllable:0 )
Hello,

I did some experiments also with silabes with the Voynich EVA last week. I broke each word into smaller, frequent character sequences (I call them segments). Then I counted every possible sequence of 2grams - 5grams that appear often in the corpus, to get the basic "building blocks". I used then a simple algorithm to find the best way to split each word into those pieces (it tries every possible way to split a word into smaller pieces and then chooses the split that gives the highest total score; each possible piece has then a score based on how often it appears in the whole text.). After segmenting all words, I counted how often each piece appeared at the beginning, middle or end of words, and finally I grouped them into prefixes, stems, sufixes or others. This are the top results by category:

[attachment=12215]

You can see that there are way less sufixes than the rest. In total I got:

Prefixes: 69
Stems: 38
Sufixes: 22
Others: 804

There are curiosiies like "qo", "qok" and "qoke" that appear as different prefixes. They appear as prefixes because the algorithm looks at frequency and position, not meaning. Since “qo”, “qok”, and “qoke” often occur at the beginnings of different words, all of them score high as starting segments. The model treats each frequent beginning separately instead of assuming one is part of another.
(11-11-2025, 09:27 PM)quimqu Wrote: You are not allowed to view links. Register or Login to view.There are curiosiies like "qo", "qok" and "qoke" that appear as different prefixes.

I have comparable results (see link to website in post #53).
(11-11-2025, 09:37 PM)bi3mw Wrote: You are not allowed to view links. Register or Login to view.
(11-11-2025, 09:27 PM)quimqu Wrote: You are not allowed to view links. Register or Login to view.There are curiosiies like "qo", "qok" and "qoke" that appear as different prefixes.

I have comparable results (see link to website in post #53).

Yes, I see. But do you think it is right? I mean we have simmilar results but my logic tells me that qo should be the prefix and ke should get into the stem, isn't it? That's why I didn't post these results earlier. I am trying to understand them.
(11-11-2025, 10:46 PM)quimqu Wrote: You are not allowed to view links. Register or Login to view.Yes, I see. But do you think it is right? I mean we have simmilar results but my logic tells me that qo should be the prefix and ke should get into the stem, isn't it? That's why I didn't post these results earlier. I am trying to understand them.

Hmm, I only found “ke” as a stem in combination with “qo” as a prefix once. However, only with “ol” as the syllable before it, which can also be a prefix.
What often appears in my tables is “qoke/ey.” Here, “ke” is part of the prefix because it occurs frequently.  As I said, calculations are based strictly on frequency.
[attachment=12232]

Clarification: I am not attempting to define “true” syllables or morphemes in the linguistic sense. I am using the terms “prefix,” “stem,” and “suffix” only because I cannot think of any better ones.
Very interesting work!

(11-11-2025, 09:27 PM)quimqu Wrote: You are not allowed to view links. Register or Login to view.After segmenting all words, I counted how often each piece appeared at the beginning, middle or end of words, and finally I grouped them into prefixes, stems, sufixes or others.

What is the definition of "others"?

I would do the same analysis after erasing all occurrences of {o,a,y}.  I don't recall why exactly, but years ago I got convinced that their occurrences are somewhat independent of the arrangement of the other glyphs.  That is, the rule for generating plausible Voynichese words would be to generate a string without those "circle" letters, and then insert circles between the non-circle glyphs -  at most one in each spot.

I would also map s to r, Ch and Sh and ee to Ch, k/t/p/f to k.  While that may throw away information about differences between those merged letters, it would make the result less affected by transcription errors.  And also simpler to understand because there would be far fewer prefixes, suffixes, etc.

Also, I would not work with the text (stream of tokens), but with the lexicon (word types without regard for their occurrences).   Word occurrence frequencies are a distracting noise when studying the morphology of a language.  The word structure is usually more evident if we disregard word frequencies.  For example, if in the War of the Worlds we look at the frequencies of "other", "brother", and "mother" compared with "others", "brothers", and "mothers", we get very discrepant results, arguing against "s" being a suffix.  But that's because the main character in the novel has a single brother... And see also my observation about word lengths in Voynichese and Vietnamese.

So I would make a "safe" lexicon with all word types that occur at lest N times (say, 3) and apply your prefix/core/suffix algorithm to that set.  But be prepared to accept missing combinations (like br+others in the WoW lexicon).

Finally, I assume that the ['] in the stem column means empty stem.  But why not also allow empty as an option in prefix and suffix?   In monosyllabic languages, the "words" (syllables) can be parsed as prefix-stem-suffix combinations; but the prefix and suffix, more than the stem, can be empty.

All the best, --stolfi
We are going away from line as functional unit, but I have another noob question. The manuscript has a lot of different words, like as many as you would expect from such a text length in Englisch. How can this be explained? If we look at these lists, we see a lot of fixed things and it seems everything looks similar, but it can be that way if we have over 6000 different words. 

If we include that into the thought of line as a functional unit, were in a line we have the most words that appear only once or very rare and at which position in a line we have repeating words. 

Were can we find which prefix, stem or suffix or others in a line context? Are specific prefix always in the first 3 words or in the last 3 words ...and so on.
(12-11-2025, 01:58 AM)Kaybo Wrote: You are not allowed to view links. Register or Login to view.Were can we find which prefix, stem or suffix or others in a line context?

Emma May Smith's blog addresses the topic of letter / word positions ( in a line ), albeit mainly in relation to individual glyphs. For example, here:

Line Position Mapping
You are not allowed to view links. Register or Login to view.

Linestart Words
You are not allowed to view links. Register or Login to view.
Edit: I need to check the numbers I posted. I will come back soon.
Edit II: explanation in next post.

(11-11-2025, 11:57 PM)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view.What is the definition of "others"?

“Other” means all segments that don’t behave clearly as prefixes, stems, or suffixes. These are pieces that appear in many different positions inside words or are too rare to show a consistent pattern. In short, “other” is the general group for fragments that don’t fit any specific role.

(11-11-2025, 11:57 PM)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view.Finally, I assume that the ['] in the stem column means empty stem.

No, it is EVA char ' wich is: [attachment=12240]
In the earlier version of the segmentation model, the balance between prefixes and suffixes was off. The algorithm relied on thresholds that were too high, so many true endings were misclassified as “other.” This happened because certain segments, like “dy” or “in,” also appeared inside words, which made the system treat them as ambiguous. To fix this, I relaxed the thresholds and added a small condition on context variety, allowing the model to recognize more genuine suffixes while keeping prefixes stable.

The table shows how these word parts behave depending on their position in a line. Each section corresponds to a different type of line (for instance, the first line of a paragraph or a single-line label), and each row represents a word’s position in that line. The “prefix_start” values show how often words begin with a prefix, while “suffix_end” shows how often they end with a suffix. By comparing these, you can see clear patterns: in paragraphs, prefixes tend to cluster at the start of lines, and suffixes are more common toward the end. This does not occur to the single text lines, but it happens again to labels with more than a word. Curious....

[attachment=12243]
Pages: 1 2 3 4 5 6 7 8 9 10 11 12 13