The Voynich Ninja

Full Version: A One-Page Ledger Method for Generating Voynich-Like Text
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Pages: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
(23-05-2026, 02:44 PM)Dunsel Wrote: You are not allowed to view links. Register or Login to view.The lines represent ED1 relationships. Two nodes are connected if one form can be transformed into the other by a single insertion, deletion, or substitution.  The actual physical lengths of the lines are not meaningful by themselves. The graph layout uses a spring-force algorithm that tries to pull highly connected regions together while pushing weakly connected regions apart.

Thanks!  Yes, I am familiar with that spring-force method for automatic graph drawing.

I wonder if one could use the width of the lines to convey some useful information? 

Like, let A and B be two words at edit distance 1, so that they would be connected by a line in that graph.  Let Fr(x,y) denote the estimated frequency of the biword x.y (occurrences of word type x immediately followed by word type y)in the text in question.  We could compute a "semantic similarity" S(A,B) of the two words by comparing the distributions Fr(A,y) with Fr(b,y), of of Fr(x,A) with Fr(x,B), or both. Then one could draw that graph with line width proportional to S(A,B).

For more meaningful results, you should consider only one section at a time, and only those sections with substantial text: Herbal A and B, Bio, and Starred Parags.  And exclude the pages with undefined topic, like f1r, f66r, f85r1, f86v5, f86v6, f86v3, etc., plus the bottom of You are not allowed to view links. Register or Login to view. (the star-less parags).  And beware of quantization and sampling noise when comparing the two distributions.

All the best, --stolfi
(23-05-2026, 03:22 PM)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view.I wonder if one could use the width of the lines to convey some useful information? 

I fed your suggestions into codex. It took it a bit to get it, "I think" correct.  Let me know if you see any discrepancies and I'll fix them.  

Here's chol. I picked herbal scribe 1 so there's no chedy.

[attachment=15717]

Full herbal scribe 1:

[attachment=15718]

And it's text report:

Context folio range: You are not allowed to view links. Register or Login to view. to f66v; scribe: 1; mode: combined context; minimum similarity: 0.00; tokens used after word filters: 6617

Total vocabulary: 2053
Total tokens: 6617
Number of ED1 components: 248
Largest component size: 1795
Percent vocabulary in largest component: 87.43%
Percent tokens in largest component: 96.10%
Top 20 largest components by forms and token coverage:
    1:  1795 forms ( 87.43%),    6359 tokens ( 96.10%), top: daiin, chol, chor, dy, chy, shol, sho, cthy, dain, dar
    2:    3 forms (  0.15%),      3 tokens (  0.05%), top: okodar, qokodar, qokorar
    3:    2 forms (  0.10%),      2 tokens (  0.03%), top: cfhodar, cfholdar
    4:    2 forms (  0.10%),      2 tokens (  0.03%), top: cheoeees, cheoiees
    5:    2 forms (  0.10%),      2 tokens (  0.03%), top: oaorar, oporar
    6:    2 forms (  0.10%),      2 tokens (  0.03%), top: oeesody, oeesordy
    7:    2 forms (  0.10%),      2 tokens (  0.03%), top: okchaldy, opchaldy
    8:    2 forms (  0.10%),      2 tokens (  0.03%), top: pchooiin, pchroiin
    9:    2 forms (  0.10%),      2 tokens (  0.03%), top: qockhom, qockhor
    10:    2 forms (  0.10%),      2 tokens (  0.03%), top: sarar, satar
    11:    2 forms (  0.10%),      2 tokens (  0.03%), top: sheekal, sheekol
    12:    1 forms (  0.05%),      1 tokens (  0.02%), top: aiios
    13:    1 forms (  0.05%),      1 tokens (  0.02%), top: cfarsa
    14:    1 forms (  0.05%),      1 tokens (  0.02%), top: chaies
    15:    1 forms (  0.05%),      1 tokens (  0.02%), top: chakod
    16:    1 forms (  0.05%),      1 tokens (  0.02%), top: charochy
    17:    1 forms (  0.05%),      1 tokens (  0.02%), top: chckhaly
    18:    1 forms (  0.05%),      1 tokens (  0.02%), top: chckhan
    19:    1 forms (  0.05%),      1 tokens (  0.02%), top: chckom
    20:    1 forms (  0.05%),      1 tokens (  0.02%), top: chdlety

New UI:
Removing parts of pages would be a bit of a challenge. If enough pages are selected, removing a full page like You are not allowed to view links. Register or Login to view. shouldn't make a huge difference.

[attachment=15719]
  • Uploaded Ed1_Network_Mapper_GUI.py to the repo which is creating these charts and output.  Not in the README descripion yet.
  • Updated Create_Mappings_Gui_v1.py.  This creates the mappings file used by the network mapper.  It will now accept either one of my json transcriptions or a text file. It strips Gutenberg headers if you wish to use those.
  • Voynich_Transcription_Export_v2.py creates the json transcriptions I use.  NOTE... most of my files are not set up to use non EVA transcriptions even though this file will export them.
(23-05-2026, 04:12 PM)Dunsel Wrote: You are not allowed to view links. Register or Login to view.I fed your suggestions into codex. It took it a bit to get it, "I think" correct.

Thank you!

I cannot tell if the line thicknesses are right of wrong, but they suggest other more specific tests.

Could you please do daiin, on the Starred Parags section?

All the best, --stolfi
(23-05-2026, 07:09 PM)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view.
(23-05-2026, 04:12 PM)Dunsel Wrote: You are not allowed to view links. Register or Login to view.I fed your suggestions into codex. It took it a bit to get it, "I think" correct.

Thank you!

I cannot tell if the line thicknesses are right of wrong, but they suggest other more specific tests.

Could you please do daiin, on the Starred Parags section?

All the best, --stolfi

Recipe/Stars full backbone

[attachment=15729]

daiin network

[attachment=15724]

and I ran another with your update on Scribe 2 chedy.  Here, you can see the darker connection lines

[attachment=15728]
(23-05-2026, 07:09 PM)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view.
(23-05-2026, 04:12 PM)Dunsel Wrote: You are not allowed to view links. Register or Login to view.I fed your suggestions into codex. It took it a bit to get it, "I think" correct.

Thank you!

I cannot tell if the line thicknesses are right of wrong, but they suggest other more specific tests.

Could you please do daiin, on the Starred Parags section?

All the best, --stolfi

Ah, figured out your contexts.  Here's all 3.  Both, left and right.

Both
[attachment=15732]

Left
[attachment=15730]

Right
[attachment=15731]

And since the distinction is subtle, I put all 3 into an animated gif. 

[attachment=15734]
(21-05-2026, 03:29 PM)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view.Sorry, I don't understand this argument. 

Take for example
  56 otedy 56 oteedy
   2 ytedy 12 yteedy


If, after a suitable warm-up period, the words otedy and oteedy are equally frequent (as shown), and the mutation process can create ytedy from otedy, it should also create yteedy from oteedy. Then ytedy and yteedy should be equally frequent too.  But their ratio is only 1:6.

As I see it, the only ways your model would create the above counts are (1) the mutation of the prefix o->y is sensitive to whether the suffix is edy or eedy, or vice-versa, or (2) the seed text had those four words in those approximate skewed ratios (maybe no ytedy at all), and the mutation rules cannot create enough ytedy from otedy or from yteedy to raise the ytedy:yteedy ratio above 1:6.  Isn't that so?

If (1) is the case, then the method is even more complicated than it seemed at first.

If (2) is the case, then the method must be relying a lot more on the seed text being "Voynichese-like".  Which essentially replaces the question "how could the Author have generated the VMS text" to "how could the Author have generated a seed text with the same vocabulary and word frequencies as as the VMS text". 

No?

No, there are far more words than just these four. They exist in a multidimensional network of dozens of related forms. The scribe doesn't choose between "otedy" and "oteedy" in isolation — he chooses among the entire visible pool of similar words:

-kedy-keedy-key-keey
o-okedy (118)okeedy (105)okey (63)okeey (177)
y-ykedy (23)ykeedy (30)ykey (8)ykeey (58)
ot-otedy (155)oteedy (100)otey (57)oteey (140)
yt-ytedy (24)yteedy (28)ytey (13)yteey (28)

And the network extends far beyond these sixteen forms. Here is a larger sample for the ok-/k-/t-/ot- prefix group alone (not even including the y- or qo- variants):

ok-k-t-ot-
-aiiinokaiiin (4)kaiiin (3)taiiin (1)otaiiin (1)
-aiinokaiin (212)kaiin (65)taiin (42)otaiin (154)
-ainokain (144)kain (48)tain (16)otain (96)
-anokan (5)kan (3)tan (1)otan (5)
-aiirokaiir (6)kaiir (—)taiir (—)otaiir (4)
-airokair (22)kair (14)tair (13)otair (21)
-arokar (129)kar (52)tar (43)otar (141)
-ailokail (1)kail (1)tail (—)otail (1)
-alokal (138)kal (23)tal (20)otal (143)
-amokam (26)kam (9)tam (—)otam (47)
-osokos (8)kos (3)tos (4)otos (4)
-orokor (34)kor (26)tor (23)otor (46)
-olokol (82)kol (37)tol (48)otol (86)
-ooko (8)ko (2)to (2)oto (9)
-yoky (102)ky (25)ty (16)oty (115)
-eyokey (63)key (14)tey (11)otey (57)
-eeyokeey (177)keey (44)teey (20)oteey (140)
-eeeyokeeey (27)keeey (11)teeey (1)oteeey (8)
-cheyokchey (32)kchey (21)tchey (22)otchey (31)
-chyokchy (39)kchy (29)tchy (24)otchy (48)
-shyokshy (10)kshy (5)tshy (5)otshy (4)

Twenty-one suffix rows, four prefix columns, eighty-four cells — and nearly every cell is filled with an attested word. There is no 1:1 relationship between two similar Voynich words. Each word exists in a network of variants. The frequency of any specific word reflects its entire history of being produced and being used as a source for further copying — across all production sessions and across the full evolutionary gradient.

The "ok-" and "ot-" forms are consistently more frequent than "k-" and "t-" forms — because "ok-" and "ot-" are the full forms while "k-" and "t-" are the prefix-reduced variants. But within each prefix group, the ratios aren't uniform either — "okeey" (177) is more frequent than "okey" (63), "okaiin" (212) is more frequent than "okain" (144), reflecting which forms the scribe happened to use as sources more often.

Some cells are empty — "kaiir" (—), "tail" (—), "tam" (—). Not because a rule forbids them but because the scribe never happened to produce them. Different contingent choices would fill different cells.

That is what I mean by "frequent words are more likely to be selected as copying templates, generating more variants." Not that each word generates its variants at equal rates — but that the entire network of similar words feeds back on itself, with frequent forms generating more variants and rare forms generating fewer.

(For a more complete word grid documenting these networks across the entire VMS vocabulary, see You are not allowed to view links. Register or Login to view. or [Timm 2014](You are not allowed to view links. Register or Login to view.), pp. 66–82)
(22-05-2026, 12:03 AM)ReneZ Wrote: You are not allowed to view links. Register or Login to view.Why would a realistic copy-mutate system stay conservative? There is nothing that dictates this. 
The number of rules observed over tens of thousands of words are quite complex, and they are indeed rules.

This is, in a way, backwards logic.
We see that the word variations are very strict. Therefore, if the text was generated by modifying previous words, it would have to have followed strict rules. That is the correct direction of the logic.
There is no reason to assume that there would be very strict rules (which are then broken somewhat gradually).

EDIT:
Let's do some rough counts.
The word chedy could be considered to have four characters. 
Limiting to edit distance 1:
Each of these could be changed into another, leading to 4 times, say, 20 options.
Each of these could be deleted, leading to 4 more.
A new character could be added in each of 5 slots, so 5 times 20 more.
6 pairs could be swapped (not sure if that counts as edit distance 1).
We are close to 200 alternatives. 
Possibly 10 exist.

We can consider two alternative methods for a creation of a meaningless text using word permutations.

Method A: 
first, a vocabulary is set up using word patterns and their variations
then, a text is composed by somewhat aribitrarily picking words from this vocabulary/dictionary

Method B:
a text is generated by creating new words from previous ones 'on the fly'
then, the resulting vocabulary is the collection of all these words

It should be clear that the very limited set of allowed permutations much better fits with method A than method B.

The calculation of 200 possible edit-distance-1 modifications treats each EVA character as an independent substitutable unit. But the scribe doesn't work in EVA — he works at the stroke  level or at the level of whole prefix- and suffix groups. The natural stroke-level modifications of "chedy" are far fewer than 200:

- "ch" → "sh" (one stroke) = "shedy"
- "e" → "ee" (one stroke) = "cheedy"
- "edy" → "ey" (replace edy with ey) = "chey"
- "ch-" → "ok-" (replace the prefix) = "okedy"

The scribe didn't use an algorithm to find all theoretically possible modifications. He did what humans tend to do. Why bother to invent new modification rules if the set generated so far is already sufficient? Therefore he repeated with the repertoire of modifications rules he had already used before. The "200 possible" alternatives include modifications like replacing "e" with "k" or "d" with "q" — changes no scribe would make because they cross stroke-family boundaries and replacement patterns. The gap between 200 and 10 isn't evidence for a pre-designed vocabulary. It's evidence that the modification process operates either at the stroke level or at the level of whole prefix or suffix groups, but not the EVA character level.

And more fundamentally: it is impossible to produce all things that are theoretically possible. Every choice determines what comes next. After writing "chedy", the scribe sees "chedy." He modifies it to "shedy". Now "shedy" is visible. He modifies that to "shey." Each step constrains the next. The 190 "unmade" modifications were never live options — the scribe was never at a point where those were the natural next step from what was visible on the page. This is why on one page '-edy' forms dominate while on the next page '-ol' or '-aiin' forms do — the visible source words differ, producing different paths. Note: On You are not allowed to view links. Register or Login to view. (You are not allowed to view links. Register or Login to view.) there is even one paragraph with "-ody" words and one paragraphs with "-edy"-words.  

This is true of any text. In theory Shakespeare could have written an uncounted number of possible plays. He wrote his version of Hamlet. We don't ask "why didn't he write one of the other possible plays or a different version of Hamlet?" because each word chosen determined what came next. The same applies here. The variants we see are the path taken. The variants we don't see are paths the scribe was never standing at the fork to take — or chose not to.
(26-05-2026, 08:50 AM)Torsten Wrote: You are not allowed to view links. Register or Login to view.The calculation of 200 possible edit-distance-1 modifications treats each EVA character as an independent substitutable unit. But the scribe doesn't work in EVA

I also was not using Eva in my example, by treating 'ch' as a single unit.

Interestingly, it does not even make a great difference.
from ch to sh in Eva is a single change, as is the equivalent S to Z in Currier.
Similarly, from ch to cth is a single change, as is the equivalent S to X in Currier.

One could argue that the same is true if one looks at the writing, which is of course the correct thing to do.

My main point was that there are a great majority of possible changes that are (apparently) forbidden.
This existence of a very large set of relatively strict rules strongly suggests, that there is still something else going on. It is not just a matter of creating meaningless words based on small changes to previous words.

The rules are non-trivial too. A relatively simple (potential) change from e to a is allowed in some contexts but not in others.
One can introduce a f intruding in a ch, before a ch , but not before one or two e 's.

Etc, etc.
(26-05-2026, 09:08 AM)ReneZ Wrote: You are not allowed to view links. Register or Login to view.My main point was that there are a great majority of possible changes that are (apparently) forbidden.
This existence of a very large set of relatively strict rules strongly suggests, that there is still something else going on. It is not just a matter of creating meaningless words based on small changes to previous words.

I will again argue that the Voynich did not need a huge set of strict rules. When I started trying to understand how a human scribe could produce the Voynich Manuscript, I kept one principle in mind: the method had to be simple enough for a real person to execute repeatedly over hundreds of pages.  I think many people are looking at glyph combinations and assuming something grammatical or deeply linguistic must be happening. I am suggesting that much of what we see may instead be emergent behavior arising from a constrained local production system.

Here is my proposed ledger for Scribe 1. It is redacted to remove the weighting I used in the code. The glyphs on the left are still sorted by starting-letter frequency, from most common to least common.   And, if I got the conversion to bbcode correct, each column's letter should be sorted by scribe preference: most commonly used left to right.

Pick a starting glyph, then choose a legal prefix follower:

c → h

Now go to the H row and choose a legal midfix:

h → o

Then choose a legal suffix from the O row:

o → l

That produces:

c → h → o → l = chol

When mutating a word, check the ledger to validate it:

s → h → o → l = shol

That is also valid. Mutation complete.


GlyphPrefix FollowersMidfixSuffix
ai r l m n t ci l r m c d k o e f s p t yr l m n s y d t i
ch t k p f s o ch k t p fh o y
da o c y s e l k d i p t g ra o e c s y l d r py a l s g o m d r
ee t k s l c a oo e a k c d s t y p f iy o s n a e d r p g m t k
ho e a c k d t y s h p r l iy o e s l d a r m g h
iki r n k d t l o s m c e pn r m s l t d y i
kc o e s a y de h c o a s y dy o a h g
lo c s k y d t pd o c a s k e t y f q p ry s d o g a r m p
ok t l r d p c e s a f i y m o q hd k t l c r i e a s p o f y m ql r s m d y t k g p n f o c e
qo k e y c po
ro a c y n k sa o c d i s e k r yy o l e d g r
sh o a c y e s k q f th a e c o s t k dy h o e a s m n
tc o e s a y d hh c o e a s y d ly o h a s
yk t c d p s o f a e ld k t c s p a e od s l r t
fc o s ah c o a y d sy o
ga
pc o s y a dh c o a y s d ey
nd oy o d m e
mo aa o do y m d g
x
vo

The actual writing process is even simpler than the ledger itself.  The scribe does not generate words from scratch most of the time. He works from a small visible pool of words already on the current page plus a few nearby source pages or sheets.  Most pages in Scribe 1 are created from 20 to 40 words copied and/or mutated from this source pool.  When you copy a word, there's no rule, just copy the word. When you mutate a word, change 1 letter and validate it with the ledger.  Once you have started a page, copy and mutate words already on that page. 60–75% of tokens are same-page copy/mutate.


Copy/Mutate challenge
I am going to ask you to give this a try. A challenge, if you will. Your goal is simple. You already know what languages look like, even ones you don't speak. Try to make a gibberish page that looks like a real language.

Here's the rules:

Take any four pages from the Scribe 1 herbal section. Make a list of 20 words from those pages. That becomes your working pool.  Create a new page by hand.  Let's say, 85-100 words. 

Start the page with a gallows-initial word. You can simply add a gallows to one of the pool words if the ledger allows it.  If you want a really cool word, pick one that has a gallows as the 3rd or 4th character.

Then:
  • directly copy one of the words from the pool and write it down
  • mutate one of the pool words by a single change
  • validate the mutation against the ledger and if it's legal, write it down. If not, try again.
  • copy another word from the pool
  • mutate another

You just added 2 new words to your pool through mutation.  Now, you can copy them onto the page you're creating or mutate them and add to your pool.  Do not think in terms of “inventing a language”. Think in terms of maintaining a locally consistent visual ecology.  Don't keep repeating vowels or consonants just because the ledger says you can. Try to make words look like real words, make them pronounceable.  And here, we dive into phonetics which I don't go into.  But, there's nothing to stop us from using it.  I use pronunciation to sound out Voynich words.  We all hear daiin or chedy or chol in our heads as sounds.  If we do that, then I suspect the scribe did as well.

Other rules are simple and not strict. If you think your next line will be new paragraph you can make this line shorter if you want.  When you start a new paragraph, do the gallows trick and add add a gallows to the beginning of a word.  It looks cool. If you make a mistake, and write down the wrong letter that's not in the ledger, it's no big deal. Don't backspace, add it to the ledger as a new legal step. 

After a surprisingly small number of lines, the page starts feeding itself. Newly created words become source material for later words. The vocabulary begins to grow recursively and locally. And, here's something I bet you'll find.  You and I have enough knowledge of the Voynich that you will not need the ledger for every mutation.  We already know what those word families look like. You can mutate daiin into dain or aiin. 

We know what word length distribution looks like.  Don't put a bunch of 2 letter words in a row.  Add some variation to their length.  Just basic, make it look as realistic as you can.  

You do this for enough pages and it begins to flow like... a language.

That is the core of what I am proposing. Not a 200-page rule book. Not a giant hidden grammar. A small legality system plus local copy-and-mutate behavior operating over a visible working set.


A new theory
And as an addendum, you got me to thinking. Looking at the Voynich and it's illustrations and all the bizarre oddities we all see... maybe, just maybe, there was one strict rule. Another pint.
  • Woops, dropped my hot dog on the sheet.  Ah well, ketchup adds authenticity. Another pint... 
  • Naked women holding phallic symbols. Oh, I hope the Pope never sees this. Another pint...
  • Wait, what the hell letter did I just write? Ok, I'll call that a wierdo. It's novelty.  Another pint...
  • Damn, now I wrote the wrong letter in that word.  Oh well, add that one to the ledger.  Another pint...
  • Ohmmm... qokeedy qokeedy qokedy qokedy qokeedy... Shit, I hate when Gregorian chant gets stuck in my head. Another pint...
  • My god, that plant looks like a pensioner who watched a couple of Bob Ross videos painted it. Another pint...

And that sir, is about the best damn theory I've ever come up with to explain the Voynich.  The Copy/Mutate + Pint Theory.  The necessity of alcohol usage in the creation of the Voynich.
(26-05-2026, 02:02 PM)Dunsel Wrote: You are not allowed to view links. Register or Login to view. A small legality system plus local copy-and-mutate behavior operating over a visible working set.

Either I misunderstand the ledger, or it produces a huge number of unattested words, ones that would be expected if using the proposed copy+mutate method. For example, dain is a common word, so its simple mutations should be common, correct?

The following are simple mutations of dain and seem to pass the ledger test, while never appear in the whole manuscript, as far as I know:

dein (doesn't appear in the MS)
daon (doesn't appear in the MS)
diin (absent as a word, otaldiin appears once)
gain (doesn't appear in the MS)
main (doesn't appear in the MS)
daiy (doesn't appear in the MS)

and I think I can generate 10+ more.

Edit: I removed two examples from the list - dais and daid, I probably skipped over them when testing, they do appear at least once each. In any case, dain appears more than 400 times in the MS. 'y' is a very common word ending character. daiy is valid according to the ledger, I would expect dozens if not a hundred of daiy's in the MS, why aren't there any?
Pages: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19