Here are some more data about prefix, sufix, stem and other position within the text and line types of the MS.
The tables below summarize how the word segments behave in different positions and line types. The goal with these tables is to see if prefixes and suffixes appear in consistent places, which would indicate a sort of structure (so no random segmentation).
Prefix frequency by line position (fraction of words whose first segment is a prefix):
[
attachment=12262]
It seems that prefixes are most frequent at the start of lines (around 0.75–0.80) and that they decrease gradually towards the end.
Suffix frequency by line position (fraction of words whose last segment is a suffix):
[
attachment=12263]
It seems that suffixes grow stronger toward the end of the line. This pattern is stable across groups.
Global balance of segment types:
[
attachment=12264]
About half the detected segments behave as suffixes, one third as prefixes, and very few as neutral stems. This seems to confirm that the segmentation algorithm identifies quite strong word boundaries.
Stability of the model (parameter sweep):
[
attachment=12266]
this shows how many segments are labeled as prefix/suffix/stem/other under different thresholds. Changing thresholds (r_prefix and r_suffix) barely changes the number of suffixes, which means suffix identification is stable; prefix and “other” counts shift slightly. It’s a basic sensitivity test(good models should be stable under moderate threshold variation).
Prefix-sufix co-occurrence:
[
attachment=12268]
Most words follow a
prefix–prefix or
prefix–suffix pattern. This might mean that the structure is repetitive and rule-like: words often start with a fixed opening element and end with a stable closing element.
Suffix coverage: 0.597% of all words end with one of the learned suffix patterns, showing that the suffix model generalizes well across the text.
@quimqu: Although it's outside the scope of your current research, can you say anything about the (average) lengths of the individual word parts ?
Some suffix are predominant in some parts in the manuscript. Like "edy" is missing in the first 25 folios than it is used in folio 26 very often. Then its starts to change from folio to folio, but mostly there is heavy use or no use of this suffix.
Has nothing to do with line as a function, but I want to mention it here, because this imbalance has also an impact to all statistical analysis.
(12-11-2025, 11:30 PM)bi3mw Wrote: You are not allowed to view links. Register or Login to view.@quimqu: Although it's outside the scope of your current research, can you say anything about the (average) lengths of the individual word parts ?
Hello, I think this is what you are asking for. Isn't it?
[
attachment=12278][
attachment=12279]
(13-11-2025, 11:44 AM)quimqu Wrote: You are not allowed to view links. Register or Login to view.Hello, I think this is what you are asking for. Isn't it?
Yes, that's exactly what I meant. Thank you. What I don't quite understand is the uneven distribution (counts) of the individual word segments. Is the number of stems really that low ?
(13-11-2025, 12:01 PM)bi3mw Wrote: You are not allowed to view links. Register or Login to view.Yes, that's exactly what I meant. Thank you. What I don't quite understand is the uneven distribution (counts) of the individual word segments. Is the number of stems really that low ?
Yes, stems are low. Here is a summary of them:
=== PREFIX_BASES ===
['edy', 'qok', 'ol', 'aiin', 'ch', 'eey', 'eo', 'che', 'qo', 'qot', 'ot', 'ok', 'qoke', 'sh', 'she', 'chol', 's', 'dar', 'cho', 'y', 'dal', 'yt', 'qoka', 'yk', 'l', 'oke', 'o', 'sho', 'ote', 'qokal', 'chor', 'shol', 'okal', 'qol', 'cheol', 'otal', 'r', 'otar', 'lche', 'dch', 'op', 'opch', 'okar', 'lk', 'dol', 'cth', 'so', 'qote', 'ched', 'okee', 'd', 'pch', 'sheol', 'qokee', 'sol', 'da', 'otol', 'qokol', 'okol', 'sar', 'do', 'shor', 'ckh', 'char', 'yche', 'qopch', 'pol', 'tch', 'to', 'lo', 'ych', 'ykee', 'oe', 'tol', 'sal', 'dor', 'cph', 'dai', 'shee', 'okeol', 'q', 'dshe', 'po', 'sor', 'qokch', 'lshe', 'lkee', 'kch', 'otch', 'kol', 'qotal', 'tsh', 'qotch', 'sa', 'chal', 'ko', 'yte', 'shed', 'okch', 'chckh', 'otee', 'tar', 'cthol', 'c', 'kee', 'ctho', 'chee', 'of', 'qe', 'chear', 'qotol', 'a', 'lol', 'ke', 'otor', 'ksh', 'oc', 'yshe', 'dsh', 'sch', 'dche', 'qop', 'lor', 'ches', 'cthor', 'yke', 'te', 'olkee', 'oteol', 'oee', 'opche', 'lch', 'p', 'ypch', 'pche', 'psh', 'ro', 'okor', 'ykar', 'kor', 'olk', 'cheal', 'tal', 'ofch', 'oto', 'tor', 'chcth', 'qockh', 'lke', 'oko', 'rch', 'olor', 'lt', 'chl', 'rol', 'tcho', 'ytch', 'daii', 'kchol', 'pched', 'ysh', 'octh', 'yo', 'rar', 'ockh', 'ral', 'ct', 'keol', 'aral', 'kai', 'ra']
=== SUFFIX_BASES (pass 2) ===
['dy', 'in', 'ey', 'hy', 'ry', 'aly', 'ees', 'om', 'oly', 'es', 'an', 'ly', 'oy', 'ed', 'ho', 'as', 'im', 'eor', 'eol', 'py']
=== STEM_BASES ===
["'", 'ee', 'fch', 'f', 'oi', 'fa', 'i', 'cfh', 'ek', 'cf', 'ld']
=== OTHER_BASES ===
['old', 'he', 'ii', 'fc', 'j', 'ph', 'kh', '\ue020', 'eee', '\ue008', 'ik', 'z', 'ec', 'yy', 'hh', 'fh', 'h', 'oda', 'th', '\ue009', 'ala', 'eeed', 'ted', 'olc', '\ue019', '\ue026', 'eckh', 'ora', 'ots', 'kaii', 'hk', 'hc', 'hep', '\ue032', '\ue03e', 'eeod', 'no', '\ue00d', '\ue01c', 'eke', '\ue02f', '\ue03f', '\ue006', '\ue00a', '\ue00b', '\ue012', '\ue01b', 'pho', 'otc', '\ue023', '\ue025', '\ue028', 'ydaii', 'kho', '\ue031', 'ola', '\ue03b', '\ue03c', 'keee', 'edaii', 'kea', 'hoda', '\ue043', 'iii', '\ue048', 'okc', '\ue04a', '\ue04b', 'orc', '\ue04c', 'raii', 'taii', '\ue04f', 'hdai', '\ue050', 'phe', 'eec', 'ytc', 'eek', 'paii', '\ue054', '\ue056']
=== SUMMARY ===
Prefix: 171
Stem: 11
Suffix: 20
Other: 82
These are the top affixes with the thresholds defined. Note that unicode like /ue04c are the special glyphs in EVA transliteration.
The number of stems is very small because the classification rules strongly favor prefix and suffix behavior. A segment is only marked as a stem when it appears mostly in the middle of words and also shows enough contextual variety on both sides. In the Voynich data, most recurring segments either behave like fixed openings or fixed endings, while very few appear consistently in the internal position with enough contextual diversity. As a result, almost all segments are pulled into the prefix, suffix, or other categories, leaving only a small core that genuinely functions as stems.
I am not fully confident about the outputs, because we cannot know how genuine these prefixes and suffixes are. Since the underlying language is unknown, the algorithm may be capturing structural patterns in the text rather than true linguistic affixes.
(13-11-2025, 12:55 AM)Kaybo Wrote: You are not allowed to view links. Register or Login to view.Some suffix are predominant in some parts in the manuscript. Like "edy" is missing in the first 25 folios than it is used in folio 26 very often. Then its starts to change from folio to folio, but mostly there is heavy use or no use of this suffix.
Has nothing to do with line as a function, but I want to mention it here, because this imbalance has also an impact to all statistical analysis.
Hello, thank you. This is right, but the algorithm does not take "edy" as a suffix:
=== SUFFIX_BASES (pass 2) ===
['dy', 'in', 'ey', 'hy', 'ry', 'aly', 'ees', 'om', 'oly', 'es', 'an', 'ly', 'oy', 'ed', 'ho', 'as', 'im', 'eor', 'eol', 'py']
This is the distribution of prefixes and sufixes per folio and side (note the peak at You are not allowed to view links.
Register or
Login to view. but due to the only 3 words in the page):
[
attachment=12281]
(13-11-2025, 01:47 PM)quimqu Wrote: You are not allowed to view links. Register or Login to view.The number of stems is very small because the classification rules strongly favor prefix and suffix behavior. A segment is only marked as a stem when it appears mostly in the middle of words and also shows enough contextual variety on both sides.
Ah, I see, that's a completely different approach than I took in my calculations. I searched for prefixes and suffixes (strictly by frequency) and interpreted everything in between as the stem. Words consisting only of a prefix and suffix were counted separately. This also occurred relatively often.
Particularly interesting are the words in which the first 2-3 possible stems make up the vast majority of the words counted.
(13-11-2025, 02:05 PM)quimqu Wrote: You are not allowed to view links. Register or Login to view.Hello, thank you. This is right, but the algorithm does not take "edy" as a suffix:
Edit: I also have “edy” among the 25 most common suffixes.
(13-11-2025, 02:05 PM)quimqu Wrote: You are not allowed to view links. Register or Login to view. (13-11-2025, 12:55 AM)Kaybo Wrote: You are not allowed to view links. Register or Login to view.Some suffix are predominant in some parts in the manuscript. Like "edy" is missing in the first 25 folios than it is used in folio 26 very often. Then its starts to change from folio to folio, but mostly there is heavy use or no use of this suffix.
Has nothing to do with line as a function, but I want to mention it here, because this imbalance has also an impact to all statistical analysis.
Hello, thank you. This is right, but the algorithm does not take "edy" as a suffix:
=== SUFFIX_BASES (pass 2) ===
['dy', 'in', 'ey', 'hy', 'ry', 'aly', 'ees', 'om', 'oly', 'es', 'an', 'ly', 'oy', 'ed', 'ho', 'as', 'im', 'eor', 'eol', 'py']
This is the distribution of prefixes and sufixes per folio and side (note the peak at You are not allowed to view links. Register or Login to view. but due to the only 3 words in the page):
I picked "edy" because it was the most common suffix in your first table. No matter if it is a suffix or not, words with "edy" appear only on some page and lack completely on others. I would be interested in ideas why? In how many different words we find "edy"? Is it just in 10 very common words? Than the distribution makes sense, because these words are associated with a specific topic and its just a coincidence that the cluster. But is "edy" in more words...then its maybe a hint that it has a specific meaning like "use" that is attached to the word of the subject that need to be used...or something like that.
Is it possible to check for word groups? So if there are any words that appear several times in a fix 3 word combination?
(13-11-2025, 11:40 PM)Kaybo Wrote: You are not allowed to view links. Register or Login to view.In how many different words we find "edy"
As mentioned, “edy” is also listed among the suffixes in my heatmap (11th from the left). By clicking on the vertical fields, you can see all possible combinations.