The Voynich Ninja

Full Version: The structure of the Voynich text and how it may be generated
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Pages: 1 2 3 4 5 6
(11-04-2026, 08:52 AM)quimqu Wrote: You are not allowed to view links. Register or Login to view.This is a sort of process that lead me to this kind of model, that I don't think it might be the real thing, ... I am not telling that it is a random generation or gibberish.

Fine, but (as I noted in my Voynich Day talk) it is impossible to do "agnostic" analysis as you think you are doing.  Any analysis you do will imply certain assumptions about the nature and creation of the text.

In your case, you are implicitly assuming that the text was generated by some uniform mechanical process, hence that it is "gibberish" -- the sequence of letters does not encode some message that the Author wanted to express, but is created autonomously by the process.  And you are also assuming that there are relatively few errors. And that the author himself set the pen to the vellum.  Because you hope that the kind of analysis you are doing will give significant information about that process -- like the mutation rate and mutation process, when the copying resets, etc.

But now suppose that instead the text is indeed a herbal (serious or fantastic) in some natural language, with the spelling or encryption being basically one-to-one on the words.  That is, each word of that ur-language is always written as the same Voynichese word.

Then the type of analysis you are doing obviously makes no sense.  Its results, whatever they are, will not help decipher the manuscript.   Each parag will be an independent unit, and the order of the parags will be essentially "random".   It would make no sense to collect statistics about tokens that are d tokens apart, if they are in different parags.  

In fact, I expect that any statistic will be unchanged if the parags are randomly scrambled, keeping each parag as a rigid unit. If some statistic is affected by this scrambling,that fact will not give any useful clue about the text; we will never understand why that happens until we decipher the manuscript in some other way.

Moreover, each parag will consist of sentences that have a non-trivial heterogeneous structure.  Check You are not allowed to view links. Register or Login to view..  Say, first there will be the name(s) of the plant, then a description of it, then some conditions that it cures or benefits that it provides, then the mode of preparation, then dosage, etc.  Each of these sentences will have its own vocabulary, word frequencies, word order, etc. And each of these will have highly variable length, and may be omitted or duplicated in some parags,  Therefore, even if you limit the statistical analysis to tokens within the same parag, the counts for token pairs that are d tokens apart may be jumbling information about all those sentence types.

As another point, the use of Levenstein or edit distance in the statistics implies the assumption that any character may be replaced, deleted, inserted, or transposed with equal probability, independent of the character or of its position within the word.  That implicitly assumes that the variations are due to certain reasons (like intentional mutation, or errors in the cipher computation) but not to others (like grammatical inflection, sandhi, similar sounds, or similarity of shapes).  Which in turn implies certain assumptions about how the manuscript was created.

All this to show that any statistics that you choose to collect will tacitly imply some "Origin Theory".  

Thus you had better be conscious of those assumptions, and decide whether they are likely or not, before you spend any time collecting tons of statistics that may be useless.

Quote:Word choice is constrained by position in the line or paragraph

The dependency on position in the paragraph is quite expected if the text is indeed a herbal, as discussed above.  

As was pointed out many times recently, the dependency on position in a line may well be an artifact of the basic line-breaking algorithms.  Namely, the first word after a line break tends to be longer than average, while the 1-3 words  before a line break tend to be shorter than average. 

This pattern has been verified in texts in any language, independently of the nature of the text and of the distance between the "rails" (edges of the text area).  If the token length distribution is different at those places, the word frequencies will be different too, and then the character and digraph frequencies will be different too.  

Is this phenomenon enough to explain the "LAAFU" effects seen in the VMS?  I don't know, but that is a question that is worth investigating.   One way would be to join all the lines of each parag (excluding the first 1.5 lines)  into a single running text, then break it into lines with different rail spacing.  Say, if the original line has N glyphs on average, then re-format the paragraph into lines with at most c*N glyphs, where c = 1.618 or 0.618, and see if those claimed "LAAFU anomalies" are still observed. (Recognize the number?  Can you tell why that is a good choice for c?).

More generally, collecting statistics with no defined goal is very likely to be a waste of time.  Seriously consider following the scientific method instead: formulate a hypothesis about the nature of the text (like "it is a herbal with one-to-one word encoding", or "it is gibberish generated by a fixed copy-and-mutate style algorithm") , then try to devise the simplest analysis that is likely to clearly disprove that hypothesis, if it is false (rather than a test that is likely to succeed it if it is true).

All the best, --stolfi
@ Stolfi That's so true...
Hello Jorge,

I fully understand  and support what you say in your previous post, but I have some points to clarify:

(12-04-2026, 08:14 AM)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view.In your case, you are implicitly assuming that the text was generated by some uniform mechanical process, hence that it is "gibberish" 

I am not assuming. I am testing. I am not getting into the meaning or how to generate the words. I could.have made the model to simply pick the real Voynich words from a bag of words, but I was checking what happens if it generates new words, in order to test if they are easily constructed (they are not, of course) and if the text features are well explained. I repeat, I am just testing and trying to get the maximum information of what we have (the MS).

(12-04-2026, 08:14 AM)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view. 
More generally, collecting statistics with no defined goal is very likely to be a waste of time.  

Well, I am not in the stage of creating a therory and testing it, I really have no clues. I am in the stage of observation, also part of the scientific method.
(12-04-2026, 10:18 AM)quimqu Wrote: You are not allowed to view links. Register or Login to view.Well, I am not in the stage of creating a therory and testing it, I really have no clues. I am in the stage of observation, also part of the scientific method.

But you don't have to believe the theory.  In fact, the scientific method works best when you set out to check a theory that you don't.  Because then you try harder to find a good test that will disprove it.

Like, someone who believes in the LAAFU theory will naturally try to compute more statistics that reveal anomalies at the two extremities of the lines.  But someone who do not believe in the theory would think first of doing tests that show that the apparent anomalies are due to other causes.  Like the line-breaking bias.  Hence my proposed test of redoing the line breaks with different margin widths.

Thorsten recently claimed that the change from "Language A" to "Language B" is gradual, and is due to the drift that one expects from the copy-and-mutate method.   Someone who believes that claim may, for instance, look for ways to rearrange the bifolios of Herbal-A and/or Herbal-B so as to make successive pages maximally similar to each other.   And I bet that there is indeed a rearrangement that makes the transition more gradual than the current one.   But since I don't believe that theory, I am more interested in tests that could disprove it, if it is false.   Any idea?   Could your kind of analysis do that?

All the best, --stolfi
There are so many more factors in the VMS that influence the statistics. Here are just a few: Which glyphs belong together and which don’t (aiin, daiin, dy, qo, etc)? Are the blank spaces actually blank spaces or not? Are the line breaks the start of new sentences or not? Are the glyphs transcribed correctly or incorrectly? Are the glyphs severely flawed or not? If you try to calculate all of that, you end up in the devil’s kitchen… and there, everything is just a foul-smelling soup…

@ quimqu But I can see your point—actually, that’s exactly how you should proceed when you don’t have a theory: just run some tests, see what results you get, and then see if you can disprove those results. That way, over time, you’ll get a good picture of what the VMS structure looks like. But, does it help? No!!! Big Grin
(12-04-2026, 02:30 PM)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view.
(12-04-2026, 10:18 AM)quimqu Wrote: You are not allowed to view links. Register or Login to view.Well, I am not in the stage of creating a therory and testing it, I really have no clues. I am in the stage of observation, also part of the scientific method.

But you don't have to believe the theory.  In fact, the scientific method works best when you set out to check a theory that you don't.  Because then you try harder to find a good test that will disprove it.

Like, someone who believes in the LAAFU theory will naturally try to compute more statistics that reveal anomalies at the two extremities of the lines.  But someone who do not believe in the theory would think first of doing tests that show that the apparent anomalies are due to other causes.  Like the line-breaking bias.  Hence my proposed test of redoing the line breaks with different margin widths.

Thorsten recently claimed that the change from "Language A" to "Language B" is gradual, and is due to the drift that one expects from the copy-and-mutate method.   Someone who believes that claim may, for instance, look for ways to rearrange the bifolios of Herbal-A and/or Herbal-B so as to make successive pages maximally similar to each other.   And I bet that there is indeed a rearrangement that makes the transition more gradual than the current one.   But since I don't believe that theory, I am more interested in tests that could disprove it, if it is false.   Any idea?   Could your kind of analysis do that?

All the best, --stolfi

Dear Jorge,

I ran the below with Claude and ran my "failure protocol" which hopefully picked up any obvious issues, but I'm being transparent here and curious to see what's actually broken if anything. I expect Quimpu is using GPT/Claude as well, and given our discussion on the other thread, I remain curious if this can add value.

The main tables below use ZLZI. A robustness section at the end shows that every finding replicates across all three transcriptions. All measurements use raw EVA characters. No theory or grammar is assumed — just character identity, word boundaries, and line boundaries.

I use six thematic sections (Botanical, Astrological, Balneological, Rosettes, Pharmaceutical, Stars) and, separately, five scribes identified by Lisa Fagin Davis. The Currier A/B dialect distinction is tagged at the folio level by scribe, not by section.


## Test 1: Re-breaking the lines

You suggested removing the real line breaks and re-breaking the text at different widths. If line-position effects are genuine production signals, they should disappear when the breaks move. If they are artefacts of margin alignment, they should reappear at any break point.

I preserved folio boundaries (so no token crosses a folio edge) and measured four things at each break width:

- **Gallows-initial**: first character of the first word is k, t, p, or f (note: the character-level analysis below shows this is driven by p and t; k is slightly depleted at line start)
- **Hapax-initial**: first word appears only once in the corpus
- **-m final**: last word on the line ends in the character m
- **AC reset**: Pearson correlation of consecutive word lengths, measured separately within lines and across line boundaries

### Results

| Line breaks | Gal. init | Gal. else | Hap. init | Hap. else | -m final | -m penult | AC within | AC cross |
|---|---|---|---|---|---|---|---|---|
| **Real lines** | **20.5%** | **6.0%** | **21.5%** | **12.2%** | **14.6%** | **1.7%** | **0.173** | **0.051** |
| Width 4 | 9.1% | 7.3% | 13.9% | 13.0% | 2.6% | 2.3% | 0.156 | 0.147 |
| Width 6 | 9.9% | 7.3% | 14.6% | 13.0% | 2.9% | 2.4% | 0.151 | 0.167 |
| Width 8 | 10.9% | 7.3% | 14.6% | 13.1% | 2.8% | 2.4% | 0.155 | 0.141 |
| Width 10 | 11.1% | 7.4% | 14.3% | 13.1% | 3.3% | 2.4% | 0.151 | 0.172 |
| Width 12 | 12.1% | 7.3% | 15.8% | 13.0% | 3.5% | 2.3% | 0.154 | 0.150 |
| Width 15 | 13.0% | 7.4% | 14.9% | 13.1% | 4.0% | 2.3% | 0.153 | 0.158 |
| Width 20 | 15.5% | 7.3% | 16.2% | 13.1% | 3.7% | 2.1% | 0.152 | 0.171 |

Every effect weakens or vanishes when line breaks are moved:

- Gallows-initial drops from a 3.4× ratio (20.5% vs 6.0%) to roughly 1.3–1.5× under re-breaking
- Hapax-initial drops from 1.8× to roughly 1.1×
- -m final drops from 8.6× (14.6% vs 1.7%) to roughly 1.2×
- The AC gap (0.173 within vs 0.051 across) closes to roughly 0.15 both ways

A residual gallows-initial effect persists at wider re-break widths (particularly width 20: 15.5% vs 7.3%). Only 10.6% of width-20 boundaries coincide with real line boundaries, which is insufficient to explain the residual (the expected rate from coincidence alone would be ~9%). The residual may reflect local clustering of gallows-initial tokens near real line boundaries, but I have not fully accounted for it. The key finding is the dramatic collapse of all four effects at narrower widths, where coincidence with real boundaries is negligible.

These effects are tied to the real line breaks. Lines are functional units, not word-wrap.

### Which characters drive the effects?

**Line end:**

| Final char | Line-end % | Elsewhere % | Ratio | Count |
|---|---|---|---|---|
| g | 1.7% | 0.1% | 21.9× | 74 |
| m | 14.6% | 0.9% | 15.8× | 636 |
| d | 2.5% | 1.5% | 1.7× | 110 |
| y | 39.8% | 40.8% | 1.0× | 1,738 |
| n | 15.1% | 16.2% | 0.9× | 660 |
| l | 11.3% | 16.2% | 0.7× | 493 |
| r | 9.3% | 15.7% | 0.6× | 405 |
| o | 1.4% | 3.7% | 0.4× | 61 |

**Line start:**

| Initial char | Line-start % | Elsewhere % | Ratio |
|---|---|---|---|
| p | 8.5% | 0.5% | 17.8× |
| t | 8.9% | 1.8% | 4.8× |
| y | 14.1% | 3.8% | 3.7× |
| f | 0.8% | 0.2% | 3.1× |
| d | 15.1% | 8.7% | 1.7× |
| k | 2.3% | 3.5% | 0.7× |
| c | 4.5% | 20.4% | 0.2× |
| a | 0.5% | 6.2% | 0.1× |

The "gallows-initial" effect is driven by p (17.8×) and t (4.8×). The character k is slightly depleted (0.7×). Line-end enrichment is concentrated in -m and -g, while -l, -r, and -o are actively depleted.

---

## Test 1b: LAAFU by scribe and by thematic section

### By Davis scribe

| Scribe | Lines | Tokens | Gal. init | Gal. else | Ratio | -m fin | AC w | AC c |
|---|---|---|---|---|---|---|---|---|
| S1 (Dialect A) | 1,489 | 10,448 | 17.9% | 7.3% | 2.5× | 9.7% | 0.152 | 0.021 |
| S2 (Dialect B) | 1,101 | 9,501 | 19.8% | 4.9% | 4.0× | 11.2% | 0.154 | 0.037 |
| S3 (Dialect B*) | 1,232 | 12,007 | 27.1% | 6.1% | 4.4× | 22.6% | 0.181 | 0.083 |
| S4 | 449 | 3,871 | 11.6% | 4.9% | 2.4× | 16.9% | 0.157 | 0.059 |
| S5 | 95 | 842 | 25.3% | 7.9% | 3.2× | 15.8% | 0.151 | 0.082 |

All five scribes show the LAAFU pattern: gallows enriched at line start (2.4–4.4×), -m enriched at line end, AC reset at boundaries. The production habit is shared regardless of dialect.

Scribe assignments are Davis's (2020, preliminary). Currier classified Scribe 1 as Dialect A, Scribes 2 and 3 as Dialect B (Scribe 3 writes the Stars section, which Currier called "modified B"). Scribes 4 and 5 were not separately identified by Currier.

### By thematic section

| Section | Lines | Gal. init | Gal. else | Ratio | -m fin | AC w | AC c |
|---|---|---|---|---|---|---|---|
| Botanical | 1,748 | 21.2% | 7.9% | 2.7× | 12.0% | 0.154 | 0.038 |
| Astrological | 320 | 6.6% | 3.7% | 1.8× | 12.2% | 0.154 | 0.089 |
| Balneological | 789 | 15.3% | 4.1% | 3.7× | 7.5% | 0.147 | 0.026 |
| Rosettes | 187 | 24.6% | 6.4% | 3.8× | 28.9% | 0.159 | 0.028 |
| Pharmaceutical | 238 | 20.6% | 3.5% | 5.9× | 13.0% | 0.146 | 0.000 |
| Stars | 1,084 | 26.5% | 6.1% | 4.3× | 22.5% | 0.188 | 0.077 |

The Astrological section has the weakest gallows-initial effect (6.6% vs 3.7%, ratio 1.8×) — consistent with its unusual circular/radial layout. All other sections show ratios of 2.7× or higher.

---

## Test 2: Vocabulary distance by scribe

I measured pairwise vocabulary distance (Jensen-Shannon divergence, log base 2) at the folio level, grouped by Davis scribe. Folios with fewer than 10 tokens were excluded.

| Comparison | Mean JSD | Folios / pairs |
|---|---|---|
| Within Scribe 1 | 0.802 | 113 fol / 6,328 pairs |
| Within Scribe 2 | 0.686 | 42 fol / 861 pairs |
| Within Scribe 3 | 0.695 | 32 fol / 496 pairs |
| Cross S1 ↔ S2 | 0.860 | 4,746 pairs |
| Cross S1 ↔ S3 | 0.861 | 3,616 pairs |
| Cross S2 ↔ S3 | 0.719 | 1,344 pairs |

Scribe 1 (Dialect A) is equally distant from Scribe 2 and Scribe 3 (0.860 vs 0.861). Scribes 2 and 3 (both Dialect B variants) are much closer to each other (0.719) than either is to Scribe 1. Within-S1 variance is higher (0.802) because Scribe 1 spans multiple thematic sections (Botanical + Pharmaceutical), while Scribes 2 and 3 are each concentrated in one section.

### A-ness by scribe

A-ness = distance to S2 centroid / (distance to S1 centroid + distance to S2 centroid). Higher = more A-like:

| Scribe | A-ness | Folios |
|---|---|---|
| S1 (Dialect A) | 0.522 ± 0.007 | 113 |
| S4 | 0.496 ± 0.009 | 30 |
| S5 | 0.482 ± 0.011 | 7 |
| S3 (Dialect B*) | 0.470 ± 0.013 | 32 |
| S2 (Dialect B) | 0.454 ± 0.017 | 42 |

Scribe 4 (Currier's "Astrological, mostly A") sits near the midpoint (0.496). Currier classified this section as "mostly A," but the overall vocabulary profile is intermediate between Dialects A and B. This may reflect that Currier's A/B distinction was based on specific features (frequency of particular symbol groups, unattached finals) rather than overall vocabulary distance.

### Drift within Scribe 1

If dialect drift is real, later Scribe 1 folios should be closer to Dialect B than earlier ones. Scribe 1 covers 112 folios: 96 Botanical (f1–f56) and 16 Pharmaceutical (f88–f102).

| Sample | Folios | r (position vs dist to S2) | Critical r (p=0.05) |
|---|---|---|---|
| S1 Botanical only | 96 | +0.070 | ±0.201 |
| All S1 | 112 | −0.057 | ±0.186 |

Neither correlation is significant. Within the Botanical section (96 folios of continuous text), there is no drift toward Dialect B (r = +0.070, slightly in the wrong direction). Across all Scribe 1 folios, r = −0.057 (p = 0.55). No drift is detected at any scope.

---

## Robustness across transcriptions

All key results replicate across ZLZI, Takahashi (TTVE), and your own transcription (JSLI):

| Metric | ZLZI | TTVE | JSLI |
|---|---|---|---|
| Lines (≥2 tokens) | 4,366 | 4,393 | 1,250 |
| Tokens | 36,669 | 37,072 | 9,358 |
| Gallows ratio (real lines) | 3.4× | 3.7× | 2.7× |
| Gallows ratio (width 6 re-break) | 1.4× | 1.5× | 1.3× |
| -m final (real lines) | 14.6% | 16.1% | 13.9% |
| -m final (width 6 re-break) | 2.9% | 3.0% | 3.3% |
| AC within | 0.173 | 0.174 | 0.176 |
| Drift r (S1 Botanical folios) | −0.050 | −0.049 | −0.045 |

No conclusion depends on the choice of transcriber.

## Summary

| Question | Answer | Key numbers |
|---|---|---|
| Are lines functional units? | Yes | All four effects vanish under re-breaking |
| Is LAAFU shared across scribes? | Yes — all five scribes show it | Gallows ratio 2.4–4.4×; -m enrichment 9.7–22.6% |
| Are Dialects A and B separable? | Yes | Cross S1↔S2 JSD = 0.860; within-S2 = 0.686; S2↔S3 = 0.719 (B variants closer to each other) |
| Is A-to-B gradual drift? | No drift detected | r = +0.070 within S1 Botanical (96 fol); r = −0.057 across all S1 (112 fol); neither significant |
| How does Scribe 4 (Astrological) fit? | Near the midpoint | A-ness = 0.496; Currier said "mostly A" but vocabulary is intermediate |


Best regards,
Edward
Hi Edward,

(12-04-2026, 06:38 PM)DG97EEB Wrote: You are not allowed to view links. Register or Login to view.I expect Quimpu is using GPT/Claude as well, and given our discussion on the other thread, I remain curious if this can add value.

I use GPT for translating and helping me put things understandable in English. My English is not that good to explain complex ideas, data or results. The code is made by me and it runs on Kaggle. I am not sure how good GPT can run good code (quite long code), but I don't feel confortable making it run my python scripts.
(12-04-2026, 07:51 PM)quimqu Wrote: You are not allowed to view links. Register or Login to view.Hi Edward,

(12-04-2026, 06:38 PM)DG97EEB Wrote: You are not allowed to view links. Register or Login to view.I expect Quimpu is using GPT/Claude as well, and given our discussion on the other thread, I remain curious if this can add value.

I use GPT for translating and helping me put things understandable in English. My English is not that good to explain complex ideas, data or results. The code is made by me and it runs on Kaggle. I am not sure how good GPT can run good code (quite long code), but I don't feel confortable making it run my python scripts.

Sure, but Kaggle is an AI platform... I also write code with Claude and run it in Termux... Same same...
(12-04-2026, 08:18 PM)DG97EEB Wrote: You are not allowed to view links. Register or Login to view.Sure, but Kaggle is an AI platform... I also write code with Claude and run it in Termux... Same same...

?? Kaggle is not an AI platform, please don't spread missinformation. Kaggle is a place where you can find datasets, write code, and run machine learning experiments. But there is no AI there, you execute your code step by step and helps much in data science scripts. It does not "create" anything and it was born years before this AI boom.

Termux is a terminal (Linus like). I don't see where you find AI there. If Claude writes your code, there is the AI. But if you don't know anything about the language it writes for you, it is difficult to understand what it is doing.
(12-04-2026, 09:26 PM)quimqu Wrote: You are not allowed to view links. Register or Login to view.
(12-04-2026, 08:18 PM)DG97EEB Wrote: You are not allowed to view links. Register or Login to view.Sure, but Kaggle is an AI platform... I also write code with Claude and run it in Termux... Same same...

?? Kaggle is not an AI platform, please be informed.
[Deleted]
Pages: 1 2 3 4 5 6