The structure of the Voynich text and how it may be generated

Index
The structure of the Voynich text and how it may be generated
RE: The structure of the Voynich text and how it may be generated

quimqu > 12-04-2026, 09:42 PM

(12-04-2026, 09:27 PM)DG97EEB Wrote: You are not allowed to view links. Register or Login to view.Different Kaggle?

You can register and look by yourself.
RE: The structure of the Voynich text and how it may be generated

Jorge_Stolfi > 12-04-2026, 09:56 PM

Dear DG97EEB, thanks a lot for doing the tests. I don't mind if the programs were generated by AI. Their code may be buggy, but human-written code may be buggy too, so we must check the results anyway.

(12-04-2026, 06:38 PM)DG97EEB Wrote: You are not allowed to view links. Register or Login to view.I use six thematic sections (Botanical, Astrological, Balneological, Rosettes, Pharmaceutical, Stars)

As a general remark, I think that any statistical analysis would be easier to understand if the it was confined to just one section, like Herbal. While in theory one could focus on the "Herbal" lines of the tables only, presenting results for all sections at the same time, as a big "wall of numbers", is rather distracting and makes it hard to draw conclusions. Once we understood the results for Herbal, we could look at the other sections, if desired.

Quote:and, separately, five scribes identified by Lisa Fagin Davis.

Ditto. That is a separate theory that is best checked separately.

Quote:The Currier A/B dialect distinction is tagged at the folio level by scribe, not by section.
## Test 1: Re-breaking the lines
[...] preserved folio boundaries (so no token crosses a folio edge) and measured four things at each break width:
- **Gallows-initial**: first character of the first word is k, t, p, or f (note: the character-level analysis below shows this is driven by p and t; k is slightly depleted at line start)
- **Hapax-initial**: first word appears only once in the corpus

Every effect weakens or vanishes when line breaks are moved:
The "gallows-initial" effect is driven by p (17.8×) and t (4.8×).

From the above I understand that the program joined the whole text of each folio into a single string of words, and then ran that through the trivial line-breaking algorithm. Is that so?

That is not a good way to do the test, because we know that the first line of each paragraph is special, and the first word of those lines is very special -- for being a sentence-initial word, for being part of a special sentence, for being subject to the "ornate letter" treatment, etc. Most likely that is the name of the plant that is described in that parag. Thus it is usually a hapax, with puffs (p/f gallows), unusual morphology, etc.

If the entire text of a folio is treated like a single string of words and re-justified, most of those special parag-initial words will end up inside lines. Thus, if that what Claude's code does, it is not surprising that the number of line-initial puffs and hapaxes went down. Basically, what the exercise confirmed is PAAFU, not LAAFU: that the first word of each paragraph is special -- and that anomaly obviously is not due to line-breaking bias.

What LAAFU says is that the stats of the first and last words of parag body lines (excluding parag head and tail lines) are different from those or other positions in the line. And these anomalies are what, in my view, may be partly or wholly caused by the line-breaking bias.

So the proper way to test the LAAFU theory, I think, is to do that exercise for each paragraph, discarding the first line of the original parag and the first line of the re-justified parag. My conjecture is that a good part of the "LAAFU anomalies" that are seen in the body lines of the original parags will still be present in the body lines re-justified parags, even though the original line breaks got buried inside the middle of the lines.

Quote: Width 4

Those widths are EVA character counts? If so,even width 20 is too small, as the line-initial words will run into the line-final ones. Try using 1.618 or 0.618 times the average line width of the original parag. Note that if the new width is a simple multiple or sub-multiple of the original one (like 50% or 200%), many of the new line breaks may fall on or near the original ones.

Quote: -m final drops from 8.6× (14.6% vs 1.7%) to roughly 1.2× [...] Line-end enrichment is concentrated in -m and -g, while -l, -r, and -o are actively depleted.

The prevalence of m at the end of lines is a known anomaly. It is not caused by the trivial line-breaking algorithm per se, but my conjecture is that m is an abbreviation (possibly of iin) that the scribe could use where space is tight. Thus the m-anomaly is not evidence of LAAFU, as this theory is generally understood.

One could perhaps include this conjecture in the test by expanding every m into iin, and modifying the trivial line breaking algorithm with the clause "if the next word does not fit in the current line, but it ends in iin, and the same word with m would fit, write down the latter instead." And occasionally abbreviate iin in the middle of lines too, with the right probability.

Quote:The AC gap (0.173 within vs 0.051 across) closes to roughly 0.15 both ways

I do not quite understand what this means, but it looks like this too could be an effect of parag boundaries being destroyed and/or the m quirk above.

Quote:These effects are tied to the real line breaks. Lines are functional units, not word-wrap.

I don't think we can make this conclusion if the test indeed destroyed parag boundaries and did not exclude parag-initial lines.

All the best, --stolfi
RE: The structure of the Voynich text and how it may be generated

quimqu > 12-04-2026, 09:58 PM

(12-04-2026, 02:30 PM)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view.But you don't have to believe the theory. In fact, the scientific method works best when you set out to check a theory that you don't. Because then you try harder to find a good test that will disprove it.

In fact, this is what I did in this You are not allowed to view links. Register or Login to view..

The theory I don't believe: "supposing that the words were created artifically, if the text was built by creating similar words in similar contexts, then words that look alike should also have similar neighbours". As i still believe there is some meaning and maybe a natural language hidden in the Voynich, I tried to check this generative theory (which I don't believe).
RE: The structure of the Voynich text and how it may be generated

DG97EEB > 12-04-2026, 10:20 PM

(12-04-2026, 09:56 PM)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view.Dear DG97EEB, thanks a lot for doing the tests. I don't mind if the programs were generated by AI. Their code may be buggy, but human-written code may be buggy too, so we must check the results anyway.

(12-04-2026, 06:38 PM)DG97EEB Wrote: You are not allowed to view links. Register or Login to view.I use six thematic sections (Botanical, Astrological, Balneological, Rosettes, Pharmaceutical, Stars)

As a general remark, I think that any statistical analysis would be easier to understand if the it was confined to just one section, like Herbal. While in theory one could focus on the "Herbal" lines of the tables only, presenting results for all sections at the same time, as a big "wall of numbers", is rather distracting and makes it hard to draw conclusions. Once we understood the results for Herbal, we could look at the other sections, if desired.

Quote:and, separately, five scribes identified by Lisa Fagin Davis.

Ditto. That is a separate theory that is best checked separately.

Quote:The Currier A/B dialect distinction is tagged at the folio level by scribe, not by section.
## Test 1: Re-breaking the lines
[...] preserved folio boundaries (so no token crosses a folio edge) and measured four things at each break width:
- **Gallows-initial**: first character of the first word is k, t, p, or f (note: the character-level analysis below shows this is driven by p and t; k is slightly depleted at line start)
- **Hapax-initial**: first word appears only once in the corpus

Every effect weakens or vanishes when line breaks are moved:
The "gallows-initial" effect is driven by p (17.8×) and t (4.8×).

From the above I understand that the program joined the whole text of each folio into a single string of words, and then ran that through the trivial line-breaking algorithm. Is that so?

That is not a good way to do the test, because we know that the first line of each paragraph is special, and the first word of those lines is very special -- for being a sentence-initial word, for being part of a special sentence, for being subject to the "ornate letter" treatment, etc. Most likely that is the name of the plant that is described in that parag. Thus it is usually a hapax, with puffs (p/f gallows), unusual morphology, etc.

If the entire text of a folio is treated like a single string of words and re-justified, most of those special parag-initial words will end up inside lines. Thus, if that what Claude's code does, it is not surprising that the number of line-initial puffs and hapaxes went down. Basically, what the exercise confirmed is PAAFU, not LAAFU:  that the first word of each paragraph is special -- and that anomaly obviously is not due to line-breaking bias.

What LAAFU says is that the stats of the first and last words of parag body lines (excluding parag head and tail lines) are different from those or other positions in the line. And these anomalies are what, in my view, may be partly or wholly caused by the line-breaking bias.

So the proper way to test the LAAFU theory, I think, is to do that exercise for each paragraph, discarding the first line of the original parag and the first line of the re-justified parag. My conjecture is that a good part of the "LAAFU anomalies" that are seen in the body lines of the original parags will still be present in the body lines re-justified parags, even though the original line breaks got buried inside the middle of the lines.

Quote: Width 4

Those widths are EVA character counts? If so,even width 20 is too small, as the line-initial words will run into the line-final ones. Try using 1.618 or 0.618 times the average line width of the original parag. Note that if the new width is a simple multiple or sub-multiple of the original one (like 50% or 200%), many of the new line breaks may fall on or near the original ones.

Quote: -m final drops from 8.6× (14.6% vs 1.7%) to roughly 1.2× [...] Line-end enrichment is concentrated in -m and -g, while -l, -r, and -o are actively depleted.

The prevalence of m at the end of lines is a known anomaly. It is not caused by the trivial line-breaking algorithm per se, but my conjecture is that m is an abbreviation (possibly of iin) that the scribe could use where space is tight. Thus the m-anomaly is not evidence of LAAFU, as this theory is generally understood.

One could perhaps include this conjecture in the test by expanding every m into iin, and modifying the trivial line breaking algorithm with the clause "if the next word does not fit in the current line, but it ends in iin, and the same word with m would fit, write down the latter instead." And occasionally abbreviate iin in the middle of lines too, with the right probability.

Quote:The AC gap (0.173 within vs 0.051 across) closes to roughly 0.15 both ways

I do not quite understand what this means, but it looks like this too could be an effect of parag boundaries being destroyed and/or the m quirk above.

Quote:These effects are tied to the real line breaks. Lines are functional units, not word-wrap.

I don't think we can make this conclusion if the test indeed destroyed parag boundaries and did not exclude parag-initial lines.

All the best, --stolfi

You are right,  the original test destroyed paragraph boundaries and conflated PAAFU with LAAFU. I have re-run it your Herbal A section only (f1–f56), body lines only (excluding all paragraph-initial lines), re-broken within each paragraph. Widths are in tokens (words), not EVA characters. Average Botanical body line is 7.1 tokens. 140 of 146 paragraphs have ≥2 body lines; no data is lost.

| | Lines | Gal. init | Gal. else | Ratio | Hap. init | Hap. else | -m final |
|---|---|---|---|---|---|---|---|
| **Para-initial** | **144** | **79.2%** | **6.9%** | **11.5×** | **62.5%** | **30.0%** | **5.6%** |
| **Body (real lines)** | **1,570** | **16.1%** | **8.1%** | **2.0×** | **26.4%** | **16.1%** | **12.7%** |
| Re-broken width 5 | 2,130 | 8.4% | 9.6% | 0.9× | 17.6% | 17.5% | 3.6% |
| Re-broken width 7 | 1,504 | 8.3% | 9.6% | 0.9× | 15.4% | 18.0% | 3.9% |
| Re-broken width 11 | 931 | 10.3% | 9.4% | 1.1× | 17.7% | 17.7% | 4.4% |
| Re-broken width 14 | 716 | 9.2% | 9.4% | 1.0× | 16.3% | 18.0% | 5.2% |

PAAFU is real and strong with 79.2% of paragraph-initial words are gallows-initial (plant names). But body lines still show a 2.0× gallows ratio and 12.7% -m final, independently of PAAFU. Under re-breaking within paragraphs, all body-line effects vanish (gallows ratio → 0.9–1.1×, hapax flat, -m drops to 3.6–5.2%).

Is that what you had in mind?
RE: The structure of the Voynich text and how it may be generated

Juan_Sali > 12-04-2026, 11:05 PM

There are more positional caractheristics: words starting with 8ch 8sh sa and so are more than average the first word of a line,
even more in the herbal section.
RE: The structure of the Voynich text and how it may be generated

tavie > 12-04-2026, 11:13 PM

This week has been about clearing up a spamming invasion (thanks to those who reported posts) so it's taken me a while to get to this.

Can I ask

1. Let's please make sure we keep LAAFU/line patterns/line breaking algorithm discussion relevant to Quimqu's work, i.e. how their model deals or fails to deal with LAAFU/line patterns. If you want to discuss or test LAAFU in a way that isn't in relation to Quimqu's model, please create your own thread or use one of the many existing LAAFU threads, or at least check with Quimqu to make sure they are okay with this.

2. Please be really careful around AI usage here. Our rule is that AI assisted theories are prohibited on the forum. It's not "AI assisted theories posted only by new members are prohibited here." But we run the risk of it appearing like that when people use AI to develop their own proposed theory or to test theories. Some threads/posts have been making me nervous this week in that regard. We don't want to look like we've got a two tier system, and it can make it harder to explain to new posters why their threads are locked, especially when sometimes it's not straightforward for me to pinpoint the crucial difference between the approaches. You are not allowed to view links. Register or Login to view. if you want to discuss this further and make - and explain - any proposals.
RE: The structure of the Voynich text and how it may be generated

Jorge_Stolfi > 12-04-2026, 11:35 PM

(12-04-2026, 10:20 PM)DG97EEB Wrote: You are not allowed to view links. Register or Login to view.Widths are in tokens (words), not EVA characters.

But that is still not right. The line-breaking bias is expected to occur only if the line width is defined either geometrically ("15 cm") or in terms of characters. If the limit is a fixed number of words, the decision of whether to break the line or not will not depend on the length of the next word (which is what creates the line-breaking bias).

All the best, --stolfi

RE: The structure of the Voynich text and how it may be generated

quimqu > 13-04-2026, 01:00 PM

I looked at pairs of very similar words (Levenshtein ≤1). For each pair, I compared how similar their contexts are on the left and on the right. If a system is directional, you expect differences between “prev” and “next”. If it is more balanced, both sides should behave similarly.

First result: the Voynich text is unusually symmetric. In this table:

- Mean abs asymmetry (fuzzy): the average absolute difference between how similar two words are in their left context versus their right context, allowing approximate (Levenshtein ≤1) matches.

- Median abs asymmetry: the median of those absolute left–right differences, giving the typical asymmetry while being less affected by extreme cases.

Corpus	Mean abs asymmetry (fuzzy)	Median abs asymmetry
voynich	0.055	0.045
timm_generated	0.103	0.080
ambrosius_latin	0.100	0.083
chinese	0.127	0.086
docta_ignorantia	0.142	0.118
tirant_cat	0.162	0.133
culpepper_en	0.187	0.157
materia_medica_en	0.220	0.199
simplicissimus_de	0.223	0.236
alchemical_latin	0.257	0.214

In the Voynich text, when two words look similar, their left and right environments also look similar, and to almost the same degree. The difference between left and right is small.

In natural texts, this is not the case. Similar words often behave differently depending on direction. That is normal. Grammar is directional, and words do not play the same role before and after.

Timm sits in between. It does show similar-looking words with similar contexts, but the symmetry is weaker. There is more difference between left and right. A reasonable explanation could be in how the text is generated. If you generate step by step, each new word depends more on what comes before than on what comes after. That alone can create left-right asymmetry. I think Timm is consistent with that.

The Voynich result points to something more balanced. Not necessarily language, but not a simple left-to-right (or right-to-left) process either. More like a local system where choices are constrained by both sides, or where direction matters less.

This does not prove anything on its own, but I think it is a clean difference.

RE: The structure of the Voynich text and how it may be generated

quimqu > 13-04-2026, 04:30 PM

I have seen one interesting thing: when you take a normal text and cut it into artificial lines, something simple happens. Words at the edges of the line behave differently. The first word has no left context. The last word has no right context. This alone creates a small asymmetry.

You can see it clearly if you measure how predictable a word is from the previous one versus the next one. In natural texts, once you force them into lines, a very regular pattern appears. It’s always the same shape, and it’s not very strong. It comes from the cut, not from the language itself.

The Voynich behaves differently. It also shows positional effects, but they are stronger and the shape is not the same. The beginning of the line is much more constrained than expected, and the end behaves differently from what we see in normal texts.

If the Voynich were just a flat text that we are slicing into lines, it should look like the natural texts after segmentation. But it does not.

Position	Natural texts (segmented)	Voynich
First word	Moderate asymmetry (edge effect)	Strong asymmetry
Start	Near neutral	Clearly asymmetric
Middle	Almost flat	Near flat
End	Mild opposite effect	Strong opposite effect
Last word	Clear edge effect	Weaker than expected

So the asymmetry itself is not surprising. You get it for free when you introduce line breaks in natural language texts. What matters is how it looks. In natural texts, it is weak, smooth, and very predictable. In the Voynich, it is stronger and shaped differently.

Filename: file_00000000805c71f6bd82b3226212a341.png Size: 35.02 KB 13-04-2026, 04:53 PM

A simple way to measure directionality is to compare how well a word can be predicted from its previous word versus its next one, using entropy or prediction accuracy and taking the difference. If you take a normal text and cut it into artificial lines, you introduce broken contexts at the edges, so the first word has no real past and the last word has no real future. This alone creates a small and very stable pattern: asymmetry at the edges and near symmetry in the middle, which looks almost identical across different natural texts. The Voynich also shows positional effects, but they are stronger and not shaped the same way.

That makes it unlikely that the line structure is just a passive formatting layer. It looks more like something that actively constrains how the text is generated.

RE: The structure of the Voynich text and how it may be generated

Jorge_Stolfi > Yesterday, 03:19 AM

(13-04-2026, 04:30 PM)quimqu Wrote: You are not allowed to view links. Register or Login to view.I have seen one interesting thing: when you take a normal text and cut it into artificial lines, something simple happens. Words at the edges of the line behave differently. The first word has no left context. The last word has no right context. This alone creates a small asymmetry.

You can see it clearly if you measure how predictable a word is from the previous one versus the next one. In natural texts, once you force them into lines, a very regular pattern appears. It’s always the same shape, and it’s not very strong. It comes from the cut, not from the language itself.

As I wrote above, line breaking should produce such anomaly -- but not just because "the previous word has no context".

Again: the trivial line-breaking algorithms: "if the next word fits into the space that remains on the line, write it there, else break the line and write it there." This algorithm has the effect that the average length of the first word on each line is greater than the global average length, while the average length of the last few words of each line is less than average.
This effect should cause most word and character statistics be dependent on position along the line. Including entropy and conditional entropy.

This happens with any text and any language -- as long as the maximum line width is a fixed number of millimeters or of characters (including spaces).   It does not occur if the line length limit is a fixed number of words.

So the important question is: how did you generate the blue graph? By breaking the running text after N words, or after N characters?

Quote:The Voynich behaves differently. It also shows positional effects, but they are stronger and the shape is not the same.  The beginning of the line is much more constrained than expected, and the end behaves differently from what we see in normal texts.

That makes it unlikely that the line structure is just a passive formatting layer. It looks more like something that actively constrains how the text is generated.

The line-breaking algorithm used by the scribe is probably more complicated than the trivial one above. Those extra details could contribute to the line-start and line-end anomalies of the VMS. But before we discuss them, please answer the question above. It is absolutely critical.

All the best, --stolfi
Next Oldest Next Newest

The structure of the Voynich text and how it may be generated

Index

RE: The structure of the Voynich text and how it may be generated

RE: The structure of the Voynich text and how it may be generated

RE: The structure of the Voynich text and how it may be generated

RE: The structure of the Voynich text and how it may be generated

RE: The structure of the Voynich text and how it may be generated

RE: The structure of the Voynich text and how it may be generated

RE: The structure of the Voynich text and how it may be generated

RE: The structure of the Voynich text and how it may be generated

RE: The structure of the Voynich text and how it may be generated

RE: The structure of the Voynich text and how it may be generated