The Voynich Ninja

Full Version: A One-Page Ledger Method for Generating Voynich-Like Text
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Pages: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
(27-05-2026, 01:50 PM)rikforto Wrote: You are not allowed to view links. Register or Login to view.
(27-05-2026, 12:29 PM)Dunsel Wrote: You are not allowed to view links. Register or Login to view.You keep moving into paleography, and that's not what I'm doing here. I'm not a paleographer and would be a fool to claim such.

I think trying to generate text without a theory that accounts for the You are not allowed to view links. Register or Login to view., either implicitly or explicitly, is going to stand out to people who have taken the time to understand the script as missing key details. I appreciate that most of us are laypersons, and the experts talk about the problems of "silo-ing", but the text does not appear to be wholly independent of the paleography. It is a fair criticism to say that your one page ledger doesn't address core features of the text. To be sure, I don't think you have to adopt the CLS wholesale---I have some quibbles with how he treats EVA <l>, for instance---and Cham was not the first to observe the phenomenon, nor was his statement definitive. Likewise, there might be other ways to approach the issues raised by the CLS without relying on it specifically. However, the basic paradigm, that the first half of words have symbols based on EVA <e> and the second on EVA <i>, seems to hold. Your ledger system fails to capture these features and, to my eye, that looks quite far off the text. I don't think it's a much of a defense from these criticisms to say your approach is incomplete as much as it is a recognition that they have a lot of merit.

I looked over that CLS and again, that's paleography. Not my bailiwick. But here's what I see in it. He is saying that Voynich glyphs are made of component strokes and are constrained. That makes copying, mutation, word families and word constraints more plausible, not less.  CLS could very well fit under a ledger model as a lower level constraint system.  Does my generator violate CLS? Yes.  Does that mean the constraint system of my ledger is fundamentally wrong? No. It may mean that, if CLS is correct, then it's not taking the lower level constraint system of CLS into account. CLS is zoomed way in at the character creation level.  I am zoomed way out at the production level.

Furthermore, he jumps to some really dubious conclusions: "Since the Voynich Manuscript’s text does not seem to fit a natural language in these tests, nor is it random, then it must be artificial, in which case there is no reason for CLS not to fit."  That's a logical fallacy called a false dichotomy.  Simply because something doesn't fit description A, then it must be description B.  He's concluding that there is no option C or D or any other.  He doesn't test against a shorthand or a mnemonic structure or various cryptographic structures like Naibbe.  Furthermore, I saw no stroke by stroke comparison to an actual period manuscript. Claiming the Voynich scribe had stroke habits without disproving other manuscripts of having stroke habits is a huge gaping hole.

So, if my ledger doesn't conform to CLS, well, that's because CLS is not an established fact in my opinion. 

Is his theory right or wrong? I don't know. There's not enough facts there for me to make a decision. Do I feel obligated to include it in order to prove my point?  Nope, not yet anyway.

Edit: The best way to describe this is, I am artistically impaired. Data mining? I got that. Show me the numbers.
(27-05-2026, 03:10 PM)Dunsel Wrote: You are not allowed to view links. Register or Login to view.Does my generator violate CLS? Yes.  Does that mean the constraint system of my ledger is fundamentally wrong? No. It may mean that, if CLS is correct, then it's not taking the lower level constraint system of CLS into account.

Crucially, however, your ledger, the one we are supposedly talking about, does not address these kinds of "lower-level" constraints. And that quite simply raises questions about the applicability of these findings to the VMS.

I also think you're selling the CLS short. These aren't "stroke habits", but a fundamental observation about how letterform and letter order correlate. With some exceptions---and a good deal of the paper is spent defining those exceptions---letters with a base of e precede letters with a base of i. This kind of ordering of letters is utterly atypical. Cham arguably could have done a better job linking this to the bigram entropy findings, which have amply shown that period manuscripts did not order letters like this, but there are few writings systems where words from one half of the alphabet show up in the former part of the word and the remainder in the latter, and certainly not in European corpora, and I think we can extend his write-up some charity on that score. Even if you think it's just a "stroke habit", it's fair to say a good ledger should account for it, and a fair criticism to note it doesn't.

His conclusion doesn't much matter to the point here, which is that your ledger does not respect the letter-ordering phenomenon. A ledger that does not substantially reproduce letter order is failing to capture one of the more striking parts of Voynichese text
(27-05-2026, 03:54 PM)rikforto Wrote: You are not allowed to view links. Register or Login to view.Crucially, however, your ledger, the one we are supposedly talking about, does not address these kinds of "lower-level" constraints. And that quite simply raises questions about the applicability of these findings to the VMS.

I also think you're selling the CLS short. These aren't "stroke habits", but a fundamental observation about how letterform and letter order correlate. With some exceptions---and a good deal of the paper is spent defining those exceptions---letters with a base of e precede letters with a base of i. This kind of ordering of letters is utterly atypical. Cham arguably could have done a better job linking this to the bigram entropy findings, which have amply shown that period manuscripts did not order letters like this, but there are few writings systems where words from one half of the alphabet show up in the former part of the word and the remainder in the latter, and certainly not in European corpora, and I think we can extend his write-up some charity on that score. Even if you think it's just a "stroke habit", it's fair to say a good ledger should account for it, and a fair criticism to note it doesn't.

His conclusion doesn't much matter to the point here, which is that your ledger does not respect the letter-ordering phenomenon. A ledger that does not substantially reproduce letter order is failing to capture one of the more striking parts of Voynichese text

In order to map the basic statistics of the Voynich, those lower level constraints are mostly irrelevant. If I want to produce a word length distribution, I don't need to know which hand the scribe held the quill in. IF I were telling you that I can exactly reproduce the Voynich, then I better have some pretty amazing details that may even go down to the stroke level. I am not saying that at all.  

I don't think I'm selling anything short. There are issues with his interpretation and in the world of science, charity is not something to be handed out. Yes, he did a lot of good work on the Voynich and he's providing tables that are images, but he's not providing the data. None of the code he used or created is available for me to independently come to the same conclusions. I can't run any of those tests.  I'd have to guess.  His solution for describing his math is to point me to a Wikipedia page. Now, I am not saying he's wrong. What I'm saying is, that to convince me that it's something I need to consider for my ledger, he needs to provide much better proof. 

And I have done a lot of work on the Voynich. But I'm not using logical fallacies to try to convince you of what I'm suggesting. Instead, I'm giving everyone access to the exact data and code I used to produce the results. Download it, test it, you don't like the results, there's tons of knobs to turn and see if you can get better results. Still not happy?  Shove my code into to Codex and create your own generator. You think CLS is valid?  Fine, shove some code into my generator to emulate it and see what happens.  If you manage to produce perfect Voynich, great! I'll be happy with that result. But until he produces some code that explains how those words are put together, I'm not going to try to guess at his methods.

And my ledger does honor LEGAL letter ordering based on what's in the Voynich. No, it is not perfect.  I am not trying to produce a "striking" result.  If I had a striking results, I wouldn't be on here, I'd be talking to a publisher.  What I'm trying to produce is a plausible STATISTICAL result.
In my paper and in this forum I believe I have further developed the work of Timm & Schinner and provided explanations for the following:

Dense local edit-distance connectivity
  • Why so many Voynich words differ by only one mutation (ED1).
  • Why families cluster locally.
Copy/mutate production behavior
  • Words being derived from nearby prior words rather than independently invented.
  • Local propagation of forms through insertion/deletion/substitution.
Restricted glyph adjacency
  • Why some glyph combinations are common and others effectively impossible.
  • Legal transition structure within tokens.
Word family ecology
  • Why forms like daiin/daiir/dair/etc. behave as mutation neighborhoods.
  • Persistence and expansion of lexical cores.
Positional behavior
  • Different behaviors at line start, paragraph start, internal positions, etc.
  • Gallows concentration patterns.
Gallows distribution
  • Why gallows cluster at paragraph and line starts.
  • Why they behave differently from ordinary glyphs.
Sheet-level locality
  • Why source relationships collapse strongly at sheet/quire level rather than purely page-to-page.
Two-sheet / three-sheet source packet behavior
  • Why many Scribe 1 pages reduce to a very small dominant source pool.
Local lexical continuity
  • Why neighboring folios share mutation neighborhoods and recurring cores.
Currier A vs Currier B (Scribe 1 vs Scribe 2+) regime separation
  • Different lexical environments and mutation ecologies between early and later manuscript regions.
Currier-like regime drift
  • Statistical shifts across manuscript regions without requiring a language change.
Word length distributions
  • Approximate Voynich-like token length behavior.
Zipf-like statistical structure
  • Non-random frequency falloff emerging from recursive reuse and mutation.
Vocabulalry development
  • How a copy/mutate system alone can generate a vocabulary size comparable to the Voynich.
Hapax generation
  • How a copy/mutate system can generate hapax token counts comparable to the Voynich.
High repetition without exact monotony
  • Why the text is repetitive but not trivially repetitive.
Human-feasible manuscript production
  • A practical workflow a medieval scribe could actually execute repeatedly with limited tools and memory requirements.
Mutation residue
  • Why some forms appear isolated or weakly connected after many mutation generations.
Cross-page persistence of lexical cores
  • Why certain high-frequency forms remain stable over long spans.
Emergence of pseudo-language structure
  • How language-like statistics can emerge without underlying semantic language encoding.
Why the manuscript resists simple random models
  • The text is structured, but the structure may arise from constrained generation rather than natural language.
Why Voynichese can look internally coherent
  • Recursive reuse naturally creates apparent grammatical consistency.
Why new forms rarely become completely illegal
  • Mutation constrained by adjacency legality prevents explosive randomness.
Why the manuscript feels “self-referential”
  • Because production continually feeds on prior output.
How a small seed can bootstrap a large corpus
  • Recursive expansion from limited initial material.
Why generated text can resemble Voynich statistically without semantic decoding
  • Statistical resemblance does not require translation or plaintext recovery.

However because my paper, my posts and my generator do not even TRY to explain:
  • Exact CLS ordering behavior
  • Stroke-level paleography
  • Scribal motor habits
  • Full glyph-class asymmetry
  • Exact entropy profile reproduction
  • Semantic meaning
  • Encoding/decoding of plaintext
  • Perfect Voynich reproduction
  • Every positional phenomenon
  • Every rare glyph behavior
  • Exact Currier separation
  • Illustration-text relationships

...some responses have declared the work a failure.


I did not set out to produce a perfect reconstruction of the Voynich Manuscript. I set out to investigate whether a constrained copy/mutate system with limited working rules could reproduce a reasonable proportion of the manuscript's statistical and structural behavior in a very human doable format. At this point I believe the answer to that question is yes, since much of the criticism now being directed at this work concerns aspects of the Voynich which the current model was never intended to reproduce.

It seems that no matter how many times I state that this is a statistical model, I keep getting dragged into debates about why its visual appeal is lacking.

Therefore, unless future criticism addresses the actual statistical and generative scope of this work - namely the copy/mutate + ledger model and constrained copy/mutate generation in general - rather than demanding full visual, paleographic, or semantic reconstruction, I am unlikely to spend much more time responding to those objections.

While I do greatly appreciate feedback, I am primarily interested in criticism directed at the actual claims and scope of the model, rather than features that are well beyond that scope. I also welcome suggestions on how the system could be improved in the future to better reproduce the manuscript visually, but I will not consider the current model a failure simply because it does not yet do so.
I played around with your “Ledger Generator” for a bit. Specifically, I parsed my 485 tables on You are not allowed to view links. Register or Login to view. into JSON and converted them to your format. The whole thing is (obviously) syllable-based. Does the result look “Voynich-like” enough, or not?

============================================================
PAGE 48
============================================================

otaiin sheckeey otey chinal okeey okachey ytol cholshkaiin
chinal olkydy okeey oteydy chearain qoty olkydy cholshkaiin
qokear qokal chcphedy aiildy aiildy cholshkaiin oteolkeeody chdy
ykaiin sheydy sheeodain shoiin shody qofchdaiin qokedy ytal
qotey shofshedy ytedy cheodchy daiin olol olkydy chinal
chedyar ykchy okchdy sheydy choraiin cheekey olkeshey qotchkeey
chcpheor olol okcheody qokainal otedeey sheorol otey qoorchy
qokainal ololal ytain shckhchy choal qoeol chosaroshol choal
oteochedy olor chinal okeoldy daky sheydy okeolar cheoikhy
qopchedal sheeydy cheeytal chty qokedyol olfsheoral cheoetey shepchedy
chokody shechy shedy shoheaiin okeodal aiir choky sheaiin
sholtchey olkeal otey chcfhy qotir otedy qockhal

============================================================
PAGE 49
============================================================

shiinol sheody shekody okiin okeeodar qokeey chcphhdy chky
shockhhy shiinol ytaiin oteol oteedchey okar qokchol cheey
otolal qokeo cheey ykeedy qoeol otcheedaiin sheolkchy cheeydy
otolor aiirody qokeedyol chpchy okaiinol okalaiin daiir okchy
qokadyol sheeoky aiirody qoteeedy oleeolar qokainal qoty otal
cheol oteeschey okeeol oteedain otoaiin okar choor shoor
oteedy oteeo chodaal charal qokeol ykedy shoky cheey
chedy qoteol qool shiinol cheor sheckeey sheain dalalody
sheckhdy oleedar okedy chotoey oldyol olcphy otshaiin olol
otechdy cheoor daiinal qopdaiin qoty chpsheedy cheor okedy
chedy otedy otolopaiin okachey ykolpshy qoksheedy choty qokeeydy
sholkeedy oteeydy okeal chedyol chocfhy ytol chokchey
(27-05-2026, 06:24 PM)Dunsel Wrote: You are not allowed to view links. Register or Login to view.In my paper and in this forum I believe I have further developed the work of Timm & Schinner and provided explanations for the following:

I think you may have missed my last reply because it was at the bottom of a page, but I'd also like to say here a bit about why I personally don't find this research direction interesting (I'm talking about my perspective only) in the context of Timm & Schinner.

There are two different aspects of Timm & Schinner work for me:

1) The finding that some of the features of the manuscript can be explained by the process of copying and mutating past word forms. 

2) Attempts to build a text-generator that produces Voynich-like text via copying and mutating.

I find the first one very interesting, especially when considering the manuscript as a ciphertext.

Copy and mutate can be an essential optimization when using certain cipher types for longer texts. For example, consider a homophonic substitution with nulls. If there is a need to repeat a plaintext phrase or word combination used previously, it can be both easier and more secure to just copy it over from already prepared ciphertext while changing a few homophonic assignments and rearranging/replacing the nulls. This way you don't have to focus on remembering all character assignments and at the same time you make sure that the ciphertext in two locations differs significantly enough, to avoid creating two identical pieces of ciphertext that could compromise the cipher. If your draft is the ciphertext written directly under the plaintext, finding already encoded words and word combinations and copying them over with homophonic/null adjustments is arguably the fastest way to encode large texts. A substantial portion of text can be enciphered this way. If I wanted to encipher this post, each time I had to write "ciphertext" I would just look up the previous place where I wrote "ciphertext", then I would change a couple of letters to other homophones, add/discard a null or two and write it down. Because of this possible scenario, I have a lot of interest for the autocitation research that focuses on the individual potentially verifiable examples of the autocitation.

However, I consider the second part, complete text generation by copy and mutate as the primary method, to be a dead end, for the reasons I explained in my previous posts: even if the manuscript is meaningless and not a cipher and was indeed generated using copy and mutate, it's overwhelmingly likely there will never be a conclusive way of proving this or finding the correct set of rules.
(28-05-2026, 01:13 PM)bi3mw Wrote: You are not allowed to view links. Register or Login to view.I played around with your “Ledger Generator” for a bit. Specifically, I parsed my 485 tables on You are not allowed to view links. Register or Login to view. into JSON and converted them to your format. The whole thing is (obviously) syllable-based. Does the result look “Voynich-like” enough, or not?

I have stated that I am artistically impaired. Does it look Voynich like to me?  Yes, but I have learned that my eyes can easily fool me when it comes to identifying Voynich.  Does it statistically match the Voynich?  I can almost instantly see the lack of short and 1 character words and some long words. My first guess, the length distribution would be off which will likely throw any zipf curve off.  And I can easily see you're using Scribe 2 and Scribe 1.  You have both <ed> and <ho> on those pages.  Look at my next reply to oshfdk.  You'll see I tried the same syllable approach without even knowing about your work and the immediate issues it had.

One thing I will suggest if you plan to keep digging is, limit your work initially to scribe 1 or Currier A pages only.  Scribe 2+ uses the same underlying system but the combinations are different enough to throw off any data that examines the whole Voynich or specific sections.  If you haven't already, look at the links to posts I made here about <ed>.  Links to those posts are in the op.
(28-05-2026, 01:56 PM)oshfdk Wrote: You are not allowed to view links. Register or Login to view.I think you may have missed my last reply because it was at the bottom of a page, but I'd also like to say here a bit about why I personally don't find this research direction interesting (I'm talking about my perspective only) in the context of Timm & Schinner.

I very well may have missed that post and my apologies for doing so.

There is nothing in my work that says it can't be a cypher.  I have mentioned in another reply that a whole production of gibberish is very unsatisfying.  I still have this underlying hope/belief that it does contain actual content.  My working theory is that it's mnemonic which you could say is a form of cypher.  The point of the generator is to explore the copy/mutate possibility. To locate the sources for these pages, the parent words, and offer a possibility.  If the sheet/quire source method I think I've discovered leads to a decipherment, I'm all in and I truly hope it helps you in that quest. Let me know what I can do to further that.  And you are correct, it may very well be a dead end. But, in my opinion, if you don't know the neighborhood the best way to find out if that street is a cul-du-sac is to actually drive down that street and look.  And so far, I am aware of methods that can produce Voynich looking text that fails statistically.  I'm trying to approach from the other end.  To succeed statistifcally and THEN match visually.

Also, you piqued my curiosity with your bigram smashing generator idea.  I actually pursued that yesterday.  I had codex create a small machine learning script that would run through my data and, instead of bigrams/trigrams, it tried to break words into syllables and create a set of weighted tables.  The thought there was, a scribe would understand pronunciation and syllables rather than linguistic terms like bigram, n-gram, etc.  And I've always had this idea that the Voynich can be pronounced. I then rearranged your generator idea using those ML results.  Here's what I got.

============================================================

PAGE 97
============================================================

chckhol qotchody okchees chal chos shody ykchckhey otshes
chody okchees oteoky chckheey ypshor oteodaiin ypshor choty
chees shor chckhol chochor chopchor shey choschochor chees
chopchol qokeeor chal chody qokeor qotchaiin qokchar cheees
cheeaiin chees shey chopydaiin chody otar okor ypshody
yshody chopchos ykchcthey shokydaiin ykchey cheeey chos chees
chckhor choky qokchol ykchey chos shody yshor chochor
otol qokchaiin qokchor shody chcthod qotchaiin qokeeoky cheeaiin
qokchol chos chochor qocthod chopchol shey qokcheey ykeees
qokchor chochor qotody okchy chodalchy shey otaiin chody
chckhey okchody chckheol chody chal chckhey shoteodaiin qotchaiin
shotar chcthol yfod chcthol chal yckhey qotchody

============================================================
PAGE 98
============================================================

ykchey yfod qotchaiin chcthaiin chckhey shoteeaiin otol qotchaiin
qokeey chcthody chckhey shes chody choty chopydaiin chckhor
shodan otchodalchy chcthor chees chocthy shey chopchey qokeey
chcthol shey ykchopchey okshor chokchol otchodalchy chopor okchal
chockhy chodalchy ykchopchey otchockhy shod chcthal chocthy qokeeodaiin
qokody chopydaiin cheees chckhey chos ykam qotchody shey
qochckhol okshor chockhody shotaiin ykeees chckhoal cheeaiin otchoty
qokodalchy chody otchoty shod cheees okeeaiin shor qokchaiin
chees qockhol chal chody ykam chees chcthaiin ykchal
qokeor qotchody qochckhol qotchaiin choty qotody chal qocthey
cheeey qokcheey qocthey chckheey yckhey qokchor chckhody qokchaiin
shor chees qokchor chckhey chopydaiin qokeaiin shod

That uses no seed page, just "syllables".  It does use a type of ledger with a lot of weighting.  So you can see, it does LOOK a lot more like Voynich than my generator. However, the statistics were WAY off.  In particular, the vocabulary size and hapax count.  I could likely have continued working on it and gotten those numbers closer to Voynich, but, it would have required some pretty serious memory juggling to produce it unless... they had a method for pronouncing Voynich words which we will likely never prove.

Code:
============================================================

GLOBAL STATISTICS
============================================================

Tokens                : 9500
Types                  : 379
TTR                    : 0.0399
Hapax                  : 59
Chunk uses            : 24337
Chunk types used      : 271

Word length distribution

4  1830
5  1239
6  1622
7  2159
8  1520
9  791
10  211
11  127
12  1

Top 30 words

chody          383
shey            358
chckhey        310
cheees          298
chos            293
shor            271
chees          211
qokchor        209
chal            168
chckhol        165
qokeey          160
shody          156
choty          147
chckhody        144
shod            124
otol            118
cheeaiin        109
qotchaiin      108
chcthod        104
chodalchy      103
shes            102
chcthey        100
chockhy        99
cheeey          99
qokchaiin      90
qokchol        89
okor            85
qokcheey        84
qokeeor        80
chopchol        78

Top 30 chunks used

ch        3333
sh        1735
qo        1705
ey        1406
ckh        1056
ody        926
or        920
ok        793
ee        779
ot        726
cho        722
cth        687
ol        604
kch        591
aiin      570
es        562
y          448
ke        414
od        376
al        360
yk        349
os        315
chy        274
tch        259
ees        249
op        249
eey        203
eo        191
ar        186
ty        185

Top 30 character bigrams

ch    6311
ho    4805
he    2686
ee    2417
ok    2375
qo    1967
sh    1960
od    1924
ey    1923
hc    1590
ot    1453
or    1345
kc    1329
ck    1324
kh    1324
dy    1142
ai    951
es    928
ol    875
in    866
ct    851
th    851
ha    840
ii    835
ke    759
oc    673
al    603
eo    529
hy    525
da    460

Top 30 character trigrams

cho  2659
hod  1602
chc  1590
kch  1329
ckh  1324
hey  1293
qok  1231
che  1214
ody  1142
sho  1113
hee  1095
okc  1061
hor  995
hck  926
cth  851
aii  834
iin  759
ees  753
she  745
oke  719
hct  664
eee  661
kho  626
hol  579
khe  568
kee  566
eey  547
cha  522
qot  428
otc  419
>>>
(28-05-2026, 03:21 PM)Dunsel Wrote: You are not allowed to view links. Register or Login to view.I have stated that I am artistically impaired. Does it look Voynich like to me?  Yes.  Does it statistically match the Voynich?

I'm not sure exactly which statistic you're thinking of, but you can check it yourself.
(28-05-2026, 05:09 PM)bi3mw Wrote: You are not allowed to view links. Register or Login to view.
(28-05-2026, 03:21 PM)Dunsel Wrote: You are not allowed to view links. Register or Login to view.I have stated that I am artistically impaired. Does it look Voynich like to me?  Yes.  Does it statistically match the Voynich?

I'm not sure exactly which statistic you're thinking of, but you can check it yourself.

Well, it would take a longer run and a comparison to the Voynich.  Already, word length distribution is going to be off if the rest of the pages look like those.  The Voynich has a vocabulary of known unique words.  Any generator is going to have to create a similar vocabulary.  Hapax tokens.  The Voynich has a LOT of words that only occur once anywhere.  You'd have to match that rough count compared to the Voynich.  And, here's the big one.  If you create 10,000 pages of Voynich text, does it collapse into Markov chain nonsense?  That alone isn't a failure of the generator but, if it remains reasonably stable even generating that many pages, then it becomes a production system that doesn't collapse.  Those are the basic numbers.  Then you have other things to consider like gallows usage. Some words start with a gallows, some have internal gallows.  Words with initial gallows tend to start pages and "paragraphs."  Are you matching those numbers closely?  Voynich has just tons of possible statistics.  A generator needs to match as many as possible and hopefully... it does it through emergent behavior and not by being forced.  Bigram/Trigram counts. Do yours match the Voynich "mostly"?
Pages: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19