I've been running code to test something very specific: to know if when a line is cut in the Voynich it simply "has to" put a word of a certain type, or if there is a finer choice within that type.
First I took all the lines and converted them into pairs of words. Those that are separated by a real line break, and those that are within the same line (as a control). Two comparable groups.
Then I removed the most obvious signal: typical prefixes and suffixes that we already know mark the beginning or end of a line. This is important because otherwise the model only detects trivial things like "words that end in X tend to go at the end".
Once this strong signal was removed, I looked at what was left. This is where the idea of "family" comes in: similar words (similar length, same beginning and end, etc.) as variants of the same type.
The key question: within the same family, do all variants behave the same?
The result is clear: no.
Within a family, there are words that appear much more often than they should in line-breaking positions, and others that appear less often. For example, very common forms like "daiin" tend to avoid these positions, while less common variants of the same family appear more often.
This is not a small effect. Even after removing the obvious start/end markers, the real word still ranks near the top among ~100 candidates about 60–65% of the time (top-1), and almost always within the top 5 (~95–99%). But when you force the model to choose only within the same frequency band or family, this advantage drops sharply, and many “equally similar” variants are clearly not interchangeable.
This means that it is not enough to say "here is a word of this type". The system also decides which specific variant to put there.
And this effect does not disappear when you remove the obvious patterns. There is still structure. Not only at the level of the whole word, but also in internal fragments.
The conclusion: the text is not generated by randomly choosing words from a similar set. There is a kind of internal selection. First, a family of possible forms is chosen, and then, within this family, a specific variant is chosen according to the context—especially near line boundaries.
In other words: lines are not just closed with "words of the right type," but with specific forms within that type. This points to a richer structure than would appear if you were just looking at frequencies or overall similarities.
PS. All tests are available in the Kaggle notebook I attached in a You are not allowed to view links.
Register or
Login to view..