Options

A One-Page Ledger Method for Generating Voynich-Like Text

Index
A One-Page Ledger Method for Generating Voynich-Like Text
RE: A One-Page Ledger Method for Generating Voynich-Like Text

oshfdk > 26-05-2026, 05:04 PM

(26-05-2026, 04:44 PM)Dunsel Wrote: You are not allowed to view links. Register or Login to view.I think the Vounich is implausible. If it is copy mutate, the best you can hope for is statistically comparable
Everyone hopes for an exact explanation or reproduction. What if there isn't one?

If there isn't one, then there isn't one. I understand that a lot of people see the need for some closure, but personally I'm ok with not knowing how the manuscript was created, at least for now.

(26-05-2026, 04:44 PM)Dunsel Wrote: You are not allowed to view links. Register or Login to view.Try my challenge in a previous reply. See if you can make statistically correct Voynich with it.

I think I did exactly that in my reply, it broke down immediately. I took a very common word "dain" and I identified that introducing some simple changes to it that pass the ledger produces words that not only are unattested in the manuscript, but also don't conform to CLS framework (which is ok in rare cases, but not generally). This challenge doesn't produce statistically correct Voynichese.
RE: A One-Page Ledger Method for Generating Voynich-Like Text

Dunsel > 26-05-2026, 05:19 PM

(26-05-2026, 05:04 PM)oshfdk Wrote: You are not allowed to view links. Register or Login to view.I think I did exactly that in my reply, it broke down immediately. I took a very common word "dain" and I identified that introducing some simple changes to it that pass the ledger produces words that not only are unattested in the manuscript, but also don't conform to CLS framework (which is ok in rare cases, but not generally). This challenge doesn't produce statistically correct Voynichese.

No, you're trying to make it look Voynich. Don't. Just try to make it look like a language. Create your own visually appealing language.
RE: A One-Page Ledger Method for Generating Voynich-Like Text

oshfdk > 26-05-2026, 06:09 PM

(26-05-2026, 05:19 PM)Dunsel Wrote: You are not allowed to view links. Register or Login to view.No, you're trying to make it look Voynich. Don't. Just try to make it look like a language. Create your own visually appealing language.

So, if I understand it correctly, you give me a set of rules (copy + mutate and the ledger) and you want me to add on top of this a set of my own preferences and then produce some text and you claim that this text would bear statistical resemblance to Voynichese no matter what my preferences are? This is clearly wrong. Suppose my preferences are: most common words should be 1-3 characters long and repeat often, there should be only a few of them, longer words should variate more but occasionally also repeat close to one another.

Seeds: daiin,sary,qokain,otedy,okol

Text: sar oko qokair oko daiir sar qotedy daiiny or qokainy sar oko san qotey o daiir otedysary oko saiin qotain or o sary daiir qotair sar oko otedy otedysar daiiny okol sary qotady sar oko okedy o daiir qokain daiin qokairy qotey o daiir otesy sar oko qokair.

And naturally, if I wanted to make it look like a real language, I'd make sure that "sar oko" and "or" and "o" keep repeating thought the whole text, while "ar" and "sa" do not appear often, because that's the way most (all?) languages work when it comes to short words. If you think about it, I have to fight against copy + mutate here, defaulting to plain copy in most cases and feeling unnecessary restricted by mutation rules to keep going, because copy + mutate is not a good way to create plausible looking language. Just creating a bunch of random common words and word patterns and sticking a new long random word now and then would be much easier.

Now, I think this looks much more similar to a real language than Voyncihese. Because most real languages have plenty of grammar words and particles that often repeat and have structures that also repeat. Languages like Latin have a lot of suffixes that repeat all over the place, etc. So naturally, if I wanted to imitate a language, I would make it look like a language and not like a weird codebook of number like sequences. Of course it goes without saying that statistically this won't look similar to Voynichese, and the distribution of tokens and types would be different.
RE: A One-Page Ledger Method for Generating Voynich-Like Text

oshfdk > 26-05-2026, 06:14 PM

In fact, this way of creating a fake language is much simpler and produces a much more plausible result:
1) create a list of common words, about 20-30, write them down
2) create 10-20 common combinations of these common words, write them down
3) start writing mixing common words and common word combinations with random longer words, which would occasionally repeat
4) done

There is nothing wrong for a text in a natural language to have some repetitions, it's expected. In some styles of texts (recipes, log entries) it's even mandatory. There is nothing wrong with that, there is no need to fight against this and the result looks very much like an unknown language. The statistics would differ significantly from Voynichese, but maybe they would be closer to real languages, and what is more, no-one in the XV century would go over the text with a scientific calculator and measure token type ratios.
RE: A One-Page Ledger Method for Generating Voynich-Like Text

Dunsel > 26-05-2026, 08:42 PM

(26-05-2026, 06:14 PM)oshfdk Wrote: You are not allowed to view links. Register or Login to view.In fact, this way of creating a fake language is much simpler and produces a much more plausible result:
1) create a list of common words, about 20-30, write them down
2) create 10-20 common combinations of these common words, write them down
3) start writing mixing common words and common word combinations with random longer words, which would occasionally repeat
4) done

There is nothing wrong for a text in a natural language to have some repetitions, it's expected. In some styles of texts (recipes, log entries) it's even mandatory. There is nothing wrong with that, there is no need to fight against this and the result looks very much like an unknown language. The statistics would differ significantly from Voynichese, but maybe they would be closer to real languages, and what is more, no-one in the XV century would go over the text with a scientific calculator and measure token type ratios.

Your python generator per your exact specification:

And I hope you don't mind but I did have to toss a few rules in there like... the alphabet to make your random words with. I used the Voynich alphabet. A max word length so it just didn't keep randomly jamming letters together. A minimum word length for your 'longer words' and some weighting so it didn't just keep repeating your longer random words. Number of pages, words per line, words per page, things python can't just guess at very well. I even tossed in a seed so you can reproduce it every time the same way

Code:
# oshfdk_literal_generator.py # No external files required. # # Literal version of: # 1) create 20-30 common words # 2) create 10-20 common combinations # 3) mix common words + common combinations + random longer words # 4) random longer words occasionally repeat # # No mutation. # No ledger. # No syllables. # No hidden morphology. import random from collections import Counter SEED = 42 PAGES = 100 WORDS_PER_PAGE = 95 WORDS_PER_LINE = 8 # Standard EVA-ish alphabet, sorted alphabetically. VOYNICH_ALPHABET = "acdefhiklnoqrstxy" LONG_WORD_MIN_LEN = 5 LONG_WORD_MAX_LEN = 12 LONG_WORD_REPEAT_CHANCE = 0.25 RECENT_LONG_WORD_WINDOW = 10 COMMON_WORDS = [ "daiin", "dain", "ol", "or", "chol", "chedy", "qokain", "qokeedy", "qotedy", "otedy", "shey", "cthey", "cthol", "shol", "chor", "dair", "qokair", "saiin", "aiin", "okain", "sary", "okol", "qol", "qokal", "chdy", ] COMMON_COMBINATIONS = [ ["qokain", "daiin"], ["chol", "chedy"], ["qokeedy", "qokedy"], ["ol", "chedy"], ["dain", "chol"], ["shey", "qokain"], ["cthey", "daiin"], ["shol", "chor"], ["qotedy", "qokain"], ["otedy", "ol"], ["saiin", "daiin"], ["qokair", "dair"], ["chol", "daiin"], ["cthol", "chedy"], ["or", "chol"], ] def make_random_long_word(): length = random.randint(LONG_WORD_MIN_LEN, LONG_WORD_MAX_LEN) return "".join(random.choice(VOYNICH_ALPHABET) for _ in range(length)) def generate_page(): page = [] recent_long_words = [] while len(page) < WORDS_PER_PAGE: mode = random.choice(["common_word", "combination", "long_word"]) if mode == "common_word": page.append(random.choice(COMMON_WORDS)) elif mode == "combination": combo = random.choice(COMMON_COMBINATIONS) page.extend(combo) else: if recent_long_words and random.random() < LONG_WORD_REPEAT_CHANCE: word = random.choice(recent_long_words) else: word = make_random_long_word() recent_long_words.append(word) if len(recent_long_words) > RECENT_LONG_WORD_WINDOW: recent_long_words.pop(0) page.append(word) return page[:WORDS_PER_PAGE] def char_ngrams(words, n): counts = Counter() for word in words: if len(word) < n: continue for i in range(len(word) - n + 1): counts[word[i:i + n]] += 1 return counts def analyze(all_words): token_count = len(all_words) type_count = len(set(all_words)) word_freq = Counter(all_words) char_bigram_counts = char_ngrams(all_words, 2) char_trigram_counts = char_ngrams(all_words, 3) print() print("=" * 60) print("GLOBAL WORD STATISTICS") print("=" * 60) print() print(f"Tokens : {token_count}") print(f"Types : {type_count}") print(f"TTR : {type_count / token_count:.4f}") print(f"Hapax : {sum(1 for c in word_freq.values() if c == 1)}") print() print("Top 25 word tokens") print() for word, count in word_freq.most_common(25): print(f"{word:15} {count}") print() print("=" * 60) print("CHARACTER BIGRAMS") print("=" * 60) print() for bg, count in char_bigram_counts.most_common(30): print(f"{bg:5} {count}") print() print("=" * 60) print("CHARACTER TRIGRAMS") print("=" * 60) print() for tg, count in char_trigram_counts.most_common(30): print(f"{tg:5} {count}") def print_page(page_num, page): print() print("=" * 60) print(f"PAGE {page_num}") print("=" * 60) print() for i in range(0, len(page), WORDS_PER_LINE): print(" ".join(page[i:i + WORDS_PER_LINE])) def main(): random.seed(SEED) all_words = [] for page_num in range(1, PAGES + 1): page = generate_page() all_words.extend(page) print_page(page_num, page) analyze(all_words) if __name__ == "__main__": main()

Example output:

============================================================

PAGE 1
============================================================

alkkfe scadik sktlahso qokeedy qokedy chdy chol chedy
cthol cthey otedy ol chol daiin qokal qotedy
qokain cthol saiin cthol chedy idckndkerl saiin daiin
qokeedy qokedy shey qokain okol qokair dair khtrlko
qokeedy shey dain chol qokain ixrtflfkls fyxdcefh rrtyla
oenshtaly qokair sary cthol chedy fqhyaoxa cthey ol
chedy qokeedy xdffxh qotedy qokain inrqtyte qokeedy shey
aiin ixrtflfkls ol dcodyklx saiin qokal xsieesqs shol
chor ixrtflfkls cthol kiitfs qotedy ol chedy chor
xsieesqs ol chedy shol chor ol chedy or
chol chedy qokain daiin dain chol dain chol
qokair dair finiccoccxyy dain oenshtaly okain okol

============================================================
PAGE 50
============================================================

chol daiin aynxasdcici aiin qokain daiin qokair dair
aiin qotedy qokain cthol chol aynxasdcici qotedy qokain
qotedy qokain qokeedy qokedy chdy olrentat ihecreiinin shol
chor sary olrentat qokair otedy qokain shey dair
trhxxnkdea okain daiin or chol otedy cthey daiin
ecqiisq qotedy saiin daiin otedy or qokeedy qokedy
shol chor or xhaed ecqiisq axdfidrra dair cthol
chedy cthey daiin daiin lnxila ktacsxkc shey qokain
okain shol chor qotedy qokain cthey oqshrqt chol
daiin qokeedy qokedy sqsdlfoqtthc qotedy qokain axdfidrra firdsey
trhxxnkdea sqsdlfoqtthc oxeoqryr aiin qokair dair qokair dair
qokeedy iayexfd rnieiaaa chol chedy chedy oqshrqt

============================================================

PAGE 100
============================================================

qsqfrayh qokeedy qokedy ldfyyrsedffi ol saiin daiin qsqfrayh
chol chedy aiin chol daiin onrotll or chol
qokair dair chedy saiin daiin dain chol or
chol or onrotll dirtiyaofc saiin itsyne xdlrfhadl kolceeht
saiin cthol qokain qokair or chol or chol
shol chor qokain daiin xsektrfqaxs shexfc qotedy qokain
cneyrkckhrn lstnedrcoka chol daiin chol daiin lstnedrcoka aknfoetcedqe
cthol shexfc qokain daiin dain chol or or
chol ol shexfc xsektrfqaxs cthol chol or otedy
ol ytqly chol shexfc aknfoetcedqe ol chedy asqtko
dair shey qokain cthey daiin cneyrkckhrn daiin qokal
otedy chdy chol xdlrfhadl akotexxdxaic cthey daiin

Report:

Code:
============================================================ GLOBAL WORD STATISTICS ============================================================ Tokens : 9500 Types : 1800 TTR : 0.1895 Hapax : 1330 Top 25 word tokens chol 748 daiin 736 chedy 589 qokain 530 ol 427 qokeedy 289 or 277 saiin 276 otedy 275 cthol 268 dair 268 shey 247 cthey 246 qokair 246 shol 242 qotedy 231 chor 230 dain 227 qokedy 185 aiin 107 qokal 98 okol 93 sary 89 chdy 84 okain 80 ============================================================ CHARACTER BIGRAMS ============================================================ ai 2538 in 2011 ol 1901 dy 1729 ch 1711 qo 1704 ed 1632 ok 1615 ho 1546 da 1291 ii 1179 he 1136 ka 1029 ir 580 sh 573 te 571 ot 569 th 561 or 561 ey 557 ct 550 ke 538 sa 433 ee 346 ar 179 al 159 ko 154 ry 147 hd 139 fa 85 ============================================================ CHARACTER TRIGRAMS ============================================================ edy 1571 qok 1348 hol 1260 dai 1237 iin 1126 aii 1122 cho 986 oka 961 kai 863 ain 839 hed 594 che 591 air 522 cth 516 ote 515 ted 510 hey 493 oke 485 eed 293 kee 290 sai 277 tho 268 she 256 the 248 sho 244 qot 232 hor 230 ked 190 kal 102 kol 96

Your zipf curve:

Your word length distribution chart.

Oh, and for that run, some more stats I had codex retrieve.

Fixed starting vocabulary: 26 words
Random-long word occurrences: 2,342
Distinct random-long words: 1,774
Random-long words that repeated: 444
Single-use random-long words: 1,330
Extra repeat events beyond first use: 568
Highest repeat count for one random word: 5
RE: A One-Page Ledger Method for Generating Voynich-Like Text

bi3mw > 26-05-2026, 10:55 PM

I would say the biggest problem in the script is on line 60.

Code:
"".join(random.choice(VOYNICH_ALPHABET) for _ in range(length))

Each character is generated independently there. However, Voynich words exhibit strong positional dependencies. Certain characters only appear at the beginning or end of a word, such as qo-, -dy, and -in for example. Statistically, the result is therefore immediately recognizable as random noise.

One possible solution might be to assign weights to the individual parts of the word.
RE: A One-Page Ledger Method for Generating Voynich-Like Text

Dunsel > 26-05-2026, 11:28 PM

(26-05-2026, 10:55 PM)bi3mw Wrote: You are not allowed to view links. Register or Login to view.I would say the biggest problem in the script is on line 60.

Code:
"".join(random.choice(VOYNICH_ALPHABET) for _ in range(length))

Each character is generated independently there. However, Voynich words exhibit strong positional dependencies. Certain characters only appear at the beginning or end of a word, such as qo-, -dy, and -in for example. Statistically, the result is therefore immediately recognizable as random noise.

You are ABSOLUTELY correct. But, that's not my generator. User You are not allowed to view links. Register or Login to view. gave instructions on how to create a fake language and I followed those instructions to create that generator. The output of mine is on page one of this post. He believes that the copy and mutate system I've described with rules is too cumbersome and not a good way to make a pseudo language and that just inserting random words periodically is.

(26-05-2026, 10:55 PM)bi3mw Wrote: You are not allowed to view links. Register or Login to view.One possible solution might be to assign weights to the individual parts of the word.

PRECISELY... which is what my ledger does. Weighted letters.

The point I hope he gets is that making a pseudo language, if that's what the Voynich is, requires some rules. Reducing those rules to a minimal set that still produces something statistically like a language is what I'm trying to prove. Random words just don't work. In order to make that generator work, yes, that random word generator has got to go. But, what rules do you apply when creating a new word or modifying a word? That's where my ledger works and works well, as long as it's used with human intuition and constraints.

But, if the Voynich really is a pseudo language, no generator will be able to exactly reproduce it without cheating. Statisticallyl? Yes, I think so. Word for word? Absolutely not. Could his idea for a generator work? Very possibly. It has potential. Is it going to require a good set of rules to do so? Of course.
RE: A One-Page Ledger Method for Generating Voynich-Like Text

oshfdk > 26-05-2026, 11:57 PM

(26-05-2026, 08:42 PM)Dunsel Wrote: You are not allowed to view links. Register or Login to view.Your python generator per your exact specification:

I don't think it's a good idea to code something according to some interpretation of loose specs without clarifying the details.

I never specified how exactly words should be mixed, how random longer words should be generated, and the exact meaning of "occasionally repeat".

Let me make it more specific and yet still manageable to create manually:

1) Given three groups of: common words (A), common word combinations (B) and random long words ( C ) I think it makes sense to never put A and B next to one another and C can go between 1 and 3 times in a row. When selecting from A or B select top of the list more often. Reality check: very easy and simple to follow rule, the mixing order is straightforward: get either one of A or one of B then between 1 and 3 of C, then repeat. Prioritizing top of the list is the same as going down the list and randomly selecting the first word that seems fine, this will naturally give higher probability to higher A/B words.

2) Having just a list of common words the easiest way of generating long words is by trigrams taken from the common words list. Take any word starting bigram ab from existing preset A words and then continue appending one character at a time using another existing trigram, like abc, bcd, then cde, etc, and always finish with a word ending trigram. If at any moment the process is stuck (no non word ending trigram is available), it's ok to continue using a bigram. So, given "qokedy chol kotaiin chedy olkedy daraly" we can generate words like cholkedy. Reality check: given a paper with the list of common words this is a simple task, even though it may require a bit of thinking to avoid failing step 3. Edit: This step can be sped up significantly if a list of trigrams is prepared separately from A as a table. This won't even be anachronistic, as far as I understand, the idea of creating tables to enumerate some elements was a very common practice in computus manuscripts.

3) You idea of repeating one of last 10 words 25% of the time is a good one and would work nicely when writing, but I would adjust this a bit. Remembering my own list of restrictions for what a scribe could comfortably do, I'd expand the list of last words to 20. Additionally, if we want it to look like a natural language, we should never generate new words that either belong to A or belong to the last 20 of C. And we better avoid repeating the same word twice, so we'll never copy a word that will repeat the last word we added. Reality check: sometimes can be tedious, but generally not very hard. Given there are only 20 preset common words it's easy to remember them and then to tell apart generated and preset words, and when creating a new word it's not very hard to guide the process to avoid generating a word from A or a recently used word.

Code:
# mixwords_literal_generator.py # No external files required. # # Literal version of: # 1) write down 20-30 common words # 2) create 10-20 random common pairings from these words # 3) choose either a common word or a common combination # 4) after that, add 1-3 bigram-seeded, trigram-built longer words # 5) once 20 long words exist in the pipeline, repeat one of them 25% of the time # # No mutation. # No ledger. # No syllables. # No hidden morphology. import argparse import random from collections import Counter SEED = 42 PAGES = 100 WORDS_PER_PAGE = 95 WORDS_PER_LINE = 8 COMMON_COMBINATION_MIN_COUNT = 10 COMMON_COMBINATION_MAX_COUNT = 20 COMMON_COMBINATION_WORDS = 2 LONG_WORD_MIN_LEN = 5 LONG_WORD_MAX_LEN = 11 LONG_WORD_REPEAT_CHANCE = 0.25 RECENT_LONG_WORD_WINDOW = 20 LONG_WORDS_PER_MIX_MIN = 1 LONG_WORDS_PER_MIX_MAX = 3 COMMON_WORD_CHANCE = 0.5 COMMON_RANK_TOP_WEIGHT = 3.0 COMMON_RANK_BOTTOM_WEIGHT = 1.0 MAX_TRIGRAM_WORD_ATTEMPTS = 64 MAX_MIXED_LONG_WORD_ATTEMPTS = MAX_TRIGRAM_WORD_ATTEMPTS * 8 COMMON_WORDS = [ "daiin", "ol", "or", "qoky", "dary", "chol", "qokair", "sair", "opdor", "ockhol", "chedy", "qokchy", "ofain", "ykal", "oteedy", "sheey", "otey", "cphol", "sholdaly", "cheol", "saral", "doroldal", "qolky", "qokeedy", "chdy", ] COMMON_WORDS_ENGLISH = [ "the", "and", "are", "a", "this", "is", "unless", "more", "each", "together", "well", "anything", "right", "wrong", "stay", "move", "wish", "however", "two", "one", "four", "because", "always", "trouble", ] COMMON_COMBINATIONS = [] COMMON_SOURCE_WORD_TOKENS = [] COMMON_SOURCE_WORD_SET = set() COMMON_WORD_SET = set() COMMON_WORD_WEIGHTS = [] COMMON_COMBINATION_WEIGHTS = [] WORD_START_BIGRAMS = [] BIGRAM_OPTIONS = {} TRIGRAMS = [] TRIGRAM_OPTIONS = {} NONFINAL_TRIGRAMS = set() FINAL_BIGRAMS = set() FINAL_TRIGRAMS = set() WORD_FINAL_SUFFIXES = {} LONG_WORD_FALLBACKS = [] def make_common_combinations(): target_count = min( random.randint( COMMON_COMBINATION_MIN_COUNT, COMMON_COMBINATION_MAX_COUNT, ), len(COMMON_WORDS) * (len(COMMON_WORDS) - 1), ) combinations = [] seen = set() while len(combinations) < target_count: # Avoid A A pairs like "daiin daiin". pair = tuple(random.sample(COMMON_WORDS, COMMON_COMBINATION_WORDS)) if pair in seen: continue seen.add(pair) combinations.append(list(pair)) return combinations def make_rank_weights(items): count = len(items) if count <= 1: return [COMMON_RANK_BOTTOM_WEIGHT] * count step = ( COMMON_RANK_BOTTOM_WEIGHT - COMMON_RANK_TOP_WEIGHT ) / (count - 1) return [ COMMON_RANK_TOP_WEIGHT + (step * index) for index in range(count) ] def weighted_choice(items, weights): return random.choices(items, weights=weights, k=1)[0] def initialize_common_sources(): global COMMON_COMBINATIONS global COMMON_SOURCE_WORD_TOKENS global COMMON_SOURCE_WORD_SET global COMMON_WORD_SET global COMMON_WORD_WEIGHTS global COMMON_COMBINATION_WEIGHTS global WORD_START_BIGRAMS global BIGRAM_OPTIONS global TRIGRAMS global TRIGRAM_OPTIONS global NONFINAL_TRIGRAMS global FINAL_BIGRAMS global FINAL_TRIGRAMS global WORD_FINAL_SUFFIXES global LONG_WORD_FALLBACKS COMMON_COMBINATIONS = make_common_combinations() COMMON_WORD_SET = set(COMMON_WORDS) COMMON_WORD_WEIGHTS = make_rank_weights(COMMON_WORDS) COMMON_COMBINATION_WEIGHTS = make_rank_weights(COMMON_COMBINATIONS) COMMON_SOURCE_WORD_TOKENS = COMMON_WORDS + [ word for combo in COMMON_COMBINATIONS for word in combo ] COMMON_SOURCE_WORD_SET = set(COMMON_SOURCE_WORD_TOKENS) WORD_START_BIGRAMS = [ word[:2] for word in COMMON_SOURCE_WORD_TOKENS if len(word) >= 2 ] TRIGRAMS = [ word[i:i + 3] for word in COMMON_SOURCE_WORD_TOKENS for i in range(len(word) - 2) ] BIGRAM_OPTIONS = {} TRIGRAM_OPTIONS = {} NONFINAL_TRIGRAMS = set() FINAL_BIGRAMS = set() FINAL_TRIGRAMS = set() WORD_FINAL_SUFFIXES = {} for word in COMMON_SOURCE_WORD_TOKENS: if len(word) >= 2: FINAL_BIGRAMS.add(word[-2:]) if len(word) >= 3: FINAL_TRIGRAMS.add(word[-3:]) for i in range(len(word) - 1): bigram = word[i:i + 2] BIGRAM_OPTIONS.setdefault(bigram[0], []).append(bigram) for i in range(len(word) - 2): trigram = word[i:i + 3] TRIGRAM_OPTIONS.setdefault(trigram[:2], []).append(trigram) if i < len(word) - 3: NONFINAL_TRIGRAMS.add(trigram) suffix = word[i:] WORD_FINAL_SUFFIXES.setdefault(suffix[:2], []).append(suffix) for prefix, suffixes in WORD_FINAL_SUFFIXES.items(): WORD_FINAL_SUFFIXES[prefix] = sorted(suffixes, key=len) LONG_WORD_FALLBACKS = [ word for word in COMMON_SOURCE_WORD_SET if len(word) >= LONG_WORD_MIN_LEN ] def continue_with_bigram(word, target_length, disallowed_words): next_bigrams = BIGRAM_OPTIONS.get(word[-1], []) allowed_bigrams = [ bigram for bigram in next_bigrams if ( len(word) + 1 < target_length or ( word + bigram[1] not in disallowed_words and (word + bigram[1])[-2:] in FINAL_BIGRAMS and (word + bigram[1])[-3:] in FINAL_TRIGRAMS ) ) ] if not allowed_bigrams: return None return word + random.choice(allowed_bigrams)[1] def make_trigram_long_word(disallowed_words=None): target_length = random.randint(LONG_WORD_MIN_LEN, LONG_WORD_MAX_LEN) fallback = None disallowed_words = set(disallowed_words or ()) for _ in range(MAX_TRIGRAM_WORD_ATTEMPTS): word = random.choice(WORD_START_BIGRAMS) completed = False while len(word) < target_length: suffixes = WORD_FINAL_SUFFIXES.get(word[-2:], []) exact_suffixes = [ suffix for suffix in suffixes if len(word) + len(suffix) - 2 == target_length and word + suffix[2:] not in disallowed_words ] if exact_suffixes: word += random.choice(exact_suffixes)[2:] completed = True break next_trigrams = TRIGRAM_OPTIONS.get(word[-2:]) allowed_trigrams = [ trigram for trigram in next_trigrams if trigram in NONFINAL_TRIGRAMS ] if next_trigrams else [] if allowed_trigrams: word += random.choice(allowed_trigrams)[2] continue word = continue_with_bigram(word, target_length, disallowed_words) if word is None: break if len(word) == target_length: completed = True break if not completed or len(word) < LONG_WORD_MIN_LEN: continue if fallback is None: fallback = word if word not in COMMON_SOURCE_WORD_SET: return word if fallback is not None: return fallback allowed_fallbacks = [ word for word in LONG_WORD_FALLBACKS if word not in disallowed_words ] if allowed_fallbacks: return random.choice(allowed_fallbacks) return random.choice(LONG_WORD_FALLBACKS) def append_recent_long_word(recent_long_words, word): recent_long_words.append(word) if len(recent_long_words) > RECENT_LONG_WORD_WINDOW: recent_long_words.pop(0) def make_mixed_long_word(recent_long_words, previous_word=None): if ( len(recent_long_words) >= RECENT_LONG_WORD_WINDOW and random.random() < LONG_WORD_REPEAT_CHANCE ): repeat_candidates = [ word for word in recent_long_words if word != previous_word ] if repeat_candidates: return random.choice(repeat_candidates) fallback = None recent_long_word_set = set(recent_long_words) disallowed_words = COMMON_WORD_SET | recent_long_word_set for _ in range(MAX_MIXED_LONG_WORD_ATTEMPTS): word = make_trigram_long_word(disallowed_words) if word in COMMON_WORD_SET or word in recent_long_word_set: continue if fallback is None: fallback = word if word != previous_word: return word if fallback is not None and fallback != previous_word: return fallback distinct_fallbacks = [ word for word in LONG_WORD_FALLBACKS if ( word != previous_word and word not in COMMON_WORD_SET and word not in recent_long_word_set ) ] if distinct_fallbacks: return random.choice(distinct_fallbacks) if fallback is not None: return fallback raise RuntimeError( "Unable to generate a valid long word that is not a common word " "or one of the recent long words." ) def append_marked_word(words, marker, word): words.append((marker, word)) def generate_words(total_words, recent_long_words): words = [] while len(words) < total_words: if random.random() < COMMON_WORD_CHANCE: append_marked_word( words, "A", weighted_choice(COMMON_WORDS, COMMON_WORD_WEIGHTS), ) else: for word in weighted_choice( COMMON_COMBINATIONS, COMMON_COMBINATION_WEIGHTS, ): if len(words) >= total_words: break append_marked_word(words, "B", word) long_word_count = random.randint( LONG_WORDS_PER_MIX_MIN, LONG_WORDS_PER_MIX_MAX, ) for _ in range(long_word_count): if len(words) >= total_words: break previous_word = words[-1][1] if words else None word = make_mixed_long_word(recent_long_words, previous_word) append_marked_word(words, "C", word) append_recent_long_word(recent_long_words, word) return words[:total_words] def char_ngrams(words, n): counts = Counter() for word in words: if len(word) < n: continue for i in range(len(word) - n + 1): counts[word[i:i + n]] += 1 return counts def analyze(all_words): token_count = len(all_words) type_count = len(set(all_words)) word_freq = Counter(all_words) char_bigram_counts = char_ngrams(all_words, 2) char_trigram_counts = char_ngrams(all_words, 3) print() print("=" * 60) print("GLOBAL WORD STATISTICS") print("=" * 60) print() print(f"Tokens : {token_count}") print(f"Types : {type_count}") print(f"TTR : {type_count / token_count:.4f}") print(f"Hapax : {sum(1 for c in word_freq.values() if c == 1)}") print() print("Top 25 word tokens") print() for word, count in word_freq.most_common(25): print(f"{word:15} {count}") print() print("=" * 60) print("CHARACTER BIGRAMS") print("=" * 60) print() for bg, count in char_bigram_counts.most_common(30): print(f"{bg:5} {count}") print() print("=" * 60) print("CHARACTER TRIGRAMS") print("=" * 60) print() for tg, count in char_trigram_counts.most_common(30): print(f"{tg:5} {count}") def format_token(entry, debug=False): marker, word = entry if debug: return f"{marker} {word}" return word def print_page(page_num, page, debug=False): print() print("=" * 60) print(f"PAGE {page_num}") print("=" * 60) print() for i in range(0, len(page), WORDS_PER_LINE): print(" ".join(format_token(entry, debug) for entry in page[i:i + WORDS_PER_LINE])) def main(): parser = argparse.ArgumentParser() parser.add_argument("--debug", action="store_true") parser.add_argument("--english", action="store_true") args = parser.parse_args() if args.english: global COMMON_WORDS COMMON_WORDS = COMMON_WORDS_ENGLISH random.seed(SEED) initialize_common_sources() recent_long_words = [] total_words = PAGES * WORDS_PER_PAGE all_entries = generate_words(total_words, recent_long_words) all_words = [word for _, word in all_entries] for page_num in range(1, PAGES + 1): start = (page_num - 1) * WORDS_PER_PAGE end = start + WORDS_PER_PAGE page = all_entries[start:end] print_page(page_num, page, debug=args.debug) analyze(all_words) if __name__ == "__main__": main()

The TTR stats:
Tokens : 9500
Types : 727
TTR : 0.0765
Hapax : 223

What it looks like:

oteedy chedary qoldaly cholky qoky saiiral chedy cpholky
chdykaiin chedary ol daiin chdoroldal otey opdorolky ockhol
ykair otey shedal chedaiin doroldal dorolkchol sheey qokain
orolky opdor daiin qokal oteedor oteedykaiin qoky daiin
chdaiiiral dorolkchol qokal dary dorolkchol sair otey daraly
cholky ykaiiral chdy chol dalykal qoldarary cholkykal oteedy
sholdaly choldaldaly oroldaiin daraly chdy chol oldalykal oroldalky
qokeedaly qolky sholdal oldaiiral doroldal doroldaly ockhol dorolkchol
qokeedy opdor oroldaraiin ockholdaly qokal dary oldaldaly qokeey
daiin qokal sair daldalykair oteey sholkeey chdy choldaiin
sholdaly ykal cpholkykal dary oteedykaiin dary cholky cpholky
qokain qokeedy ockholkair cpholkykal ockholky cheol choldal

However, I also wouldn't choose the strange fixed order sequences of the Voynichese if I was to create a fake language. For comparison, the same algorithm when seeded with some English words:

The TTR stats:
Tokens : 9500
Types : 1296
TTR : 0.1364
Hapax : 595

What it looks like:

one the troneach anythe and thingethe one moreach
twover wroneach a onlessshing wrong more alwayshess foublestach
righell one the howell tronghthe alwaysshis and the
becaushis eacaustay twaything right thinythis always anytheach more
wish ongethe are unless anythevever howevething two wrong
anytheach wroneserong more this wever wrong more aronghis
onleseress are therongethe morouble moreless each mover howeless
are unless thing alwaysever stay tronless howevething always
four arong two wrong howevever right anytheach twong
anythe is tronless wrong more wrouble because trouble
tronlesever is eachething thether thecause and the unlestay
because trouble morouble and issereless wrouble thecausstay

Add a simple substitution on top of this (replace each English letter with an invented alphabet) and you have what looks like a plausible text in some exotic language.
RE: A One-Page Ledger Method for Generating Voynich-Like Text

Dunsel > 27-05-2026, 12:00 AM

(26-05-2026, 06:14 PM)oshfdk Wrote: You are not allowed to view links. Register or Login to view.In fact, this way of creating a fake language is much simpler and produces a much more plausible result:
1) create a list of common words, about 20-30, write them down
2) create 10-20 common combinations of these common words, write them down
3) start writing mixing common words and common word combinations with random longer words, which would occasionally repeat
4) done

Ok, so you saw the output of your generator idea. Now, let me show you the difference when my ledger is used to create new words:

============================================================
PAGE 51
============================================================

daiin oroksy qoeeey qkamamaiiiin oroksy otedy shey saiin
daiin or qokain daiin rotain dain or cthol
oltheeo chol daiin cthey qokal otedy ol cthol
chedy cheaiiky daiin shey qokain qokair dair cpholsho
qokain qokain daiin daiin chol chedy cthol cthey
solycpco yaiiiiiin cthey dycfhea sary lcksheol solycpco cthol
cheaiiky olokchodaiin oaiiim aiin olaldear shol okol ol
cthosheesheo qokain daiin cheeolsheeoy olololarcheo shey qokain ol
chedy qokair chodalshof dchdchain sheochocy ytheeor shey dain
orchosheotcy qokair dair otedy sholar chodalshof otedy ol
dain chol aiin or chol dain chol qokair
choteee qokaiiiin chodalshof qokeedy dain dain chol

Now you can see that the random words being are a good bit more Voynich like. But, they still have a lot of issues. They are following weights from the Voynich in picking letters but it's still not enough.

I took my basic "don't look stupid rules" from my generator and added them to yours. Now look at the difference.

============================================================
PAGE 51
============================================================

chor daiin qokal daiiin chthiin chol daiin qokal
chthiin ychind chol daiin solaiidaiiin chol shey chol
daiin otedy otedy ychos or sodaindaiiir otaich cthey
or chedy chedy dain chol okol cheolys dair
or choqolaiidy qokain daiin qokain or chol cthol
chedy qokeedy chdy olshey qkolsho aiin chotchy or
chol qokain cthey sary chdaiiit sheomm daiiky ikhain
chol daiin dainochokheo qokeedy qokedy qokeedy qokedy ytodoaiim
sheomm qokal ofshodain cthey okol sheeeotor qocheochody saiin
daiin cthey daiin ofshodain shol dain qokeedy qokedy
cheeotsamo soshy shol qokeedy qokedy or chol cthey
daiin daiiky qokeedy qokedy sheocy chol qokair

Now we have some control in those random words over vowels and consonant structure. As well as repeated vowels and repeated consonants. It's starting to look more like Voynich instead of word soup but, those random words still don't look Voynich. Not enough rules. In order to get that right, I'd have to do some machine learning. Learning what letters follow what other letters, what bigrams follow other bigrams and add weighting to that.

To run this, download my Ledger_Scribe1.json from the repository and then:

py minimal_phrase_generator_v3.py --ledger Ledger_scribe1.json --pages 100 --print-all-pages

Code:
# minimal_phrase_generator_v3.py # # "Helped" version of the forum proposal: # # 1) fixed common words # 2) fixed common combinations # 3) mix those with occasional longer words # 4) longer words are generated by using existing Ledger_scribe1.json # 5) generated longer words may repeat locally # # Required external file: # Ledger_scribe1.json # # Example: # py minimal_phrase_generator_v3.py --ledger Ledger_scribe1.json --pages 100 --print-all-pages import argparse import json import random from collections import Counter from pathlib import Path # ============================================================ # SETTINGS # ============================================================ DEFAULT_SEED = 42 DEFAULT_PAGES = 100 DEFAULT_WORDS_PER_PAGE = 95 DEFAULT_WORDS_PER_LINE = 8 # Uniform branch selection. # This avoids hidden weighting between the three production modes. MODES = [ "common_word", "common_combination", "ledger_word", ] # This is the smallest operational version of # "random longer words which occasionally repeat." LEDGER_WORD_REPEAT_CHANCE = 0.25 RECENT_LEDGER_WORD_WINDOW = 10 # Ledger does not encode word length, so this is the one explicit external assumption. MIN_LEDGER_WORD_LEN = 5 MAX_LEDGER_WORD_LEN = 12 MAX_ATTEMPTS_PER_LEDGER_WORD = 200 GALLOWS = set("ktpf") VOWELS = set("aeioy") # Basic "don't look stupid" guards copied from the v11 generator idea. # These are deliberately simple visual plausibility filters, not grammar. MAX_VOWEL_RUN = 4 MAX_CONSONANT_RUN = 4 MAX_TOKEN_LEN = 12 # Repeat/family guards for newly generated ledger words. RECENT_TOKEN_REPEAT_LIMIT = 3 MAX_LEDGER_TOKEN_PAGE_COUNT = 5 MAX_LEDGER_FAMILY_PAGE_COUNT = 9 # ============================================================ # FIXED COMMON WORDS # ============================================================ COMMON_WORDS = [ "daiin", "dain", "ol", "or", "chol", "chedy", "qokain", "qokeedy", "qotedy", "otedy", "shey", "cthey", "cthol", "shol", "chor", "dair", "qokair", "saiin", "aiin", "okain", "sary", "okol", "qol", "qokal", "chdy", ] # ============================================================ # FIXED COMMON COMBINATIONS # ============================================================ COMMON_COMBINATIONS = [ ["qokain", "daiin"], ["chol", "chedy"], ["qokeedy", "qokedy"], ["ol", "chedy"], ["dain", "chol"], ["shey", "qokain"], ["cthey", "daiin"], ["shol", "chor"], ["qotedy", "qokain"], ["otedy", "ol"], ["saiin", "daiin"], ["qokair", "dair"], ["chol", "daiin"], ["cthol", "chedy"], ["or", "chol"], ] # ============================================================ # BASIC HELPERS # ============================================================ def weighted_choice(rng, values, weights=None): if not values: return None if weights is None: return rng.choice(list(values)) return rng.choices(list(values), weights=list(weights), k=1)[0] def char_ngrams(words, n): counts = Counter() for word in words: for i in range(0, len(word) - n + 1): counts[word[i:i+n]] += 1 return counts def max_run(token, charset): best = 0 current = 0 for ch in token: if ch in charset: current += 1 best = max(best, current) else: current = 0 return best def family_form(token): return "".join(ch for ch in token if ch not in GALLOWS) def passes_dls(token, ledger, page): """ Basic v11-style don't-look-stupid filter. Applied only to newly generated ledger words, not to the fixed common-word/common-combination scaffolding. """ if not token: return False if len(token) > MAX_TOKEN_LEN: return False if max_run(token, VOWELS) > MAX_VOWEL_RUN: return False consonants = set(ledger.alphabet) - VOWELS if max_run(token, consonants) > MAX_CONSONANT_RUN: return False if page and token == page[-1]: return False if page[-RECENT_LEDGER_WORD_WINDOW:].count(token) >= RECENT_TOKEN_REPEAT_LIMIT: return False if page.count(token) >= MAX_LEDGER_TOKEN_PAGE_COUNT: return False family = family_form(token) if family: family_count = sum( 1 for existing in page if family_form(existing) == family ) if family_count >= MAX_LEDGER_FAMILY_PAGE_COUNT: return False return ledger.validate(token) # ============================================================ # YOUR LEDGER FORMAT # ============================================================ class Ledger: def __init__(self, path): path = Path(path) if not path.exists(): raise FileNotFoundError( f"Ledger file not found: {path}\n" "Put Ledger_scribe1.json beside this script or pass --ledger PATH." ) with path.open("r", encoding="utf-8") as f: data = json.load(f) self.metadata = data.get("metadata", {}) self.rows = data["ledger"] self.alphabet = list(data.get("alphabet", self.rows.keys())) self.columns = tuple(self.metadata.get("columns", ["prefix", "midfix", "suffix"])) self.tiers = tuple(self.metadata.get("tiers", ["80", "18", "2"])) self.short_tokens = set((self.metadata.get("short_tokens") or {}).keys()) self.tier_weights = self._tier_weights(self.tiers) self.first_token_values, self.first_token_weights = self._load_first_token_weights() self.followers = {} for glyph, row in self.rows.items(): self.followers[glyph] = {} for column in self.columns: values = [] weights = [] for tier, tier_weight in self.tier_weights: glyphs = row.get(column, {}).get(tier, []) if not glyphs: continue each_weight = tier_weight / len(glyphs) for follower in glyphs: values.append(follower) weights.append(each_weight) self.followers[glyph][column] = (values, weights) def _tier_weights(self, tiers): numeric = [] for tier in tiers: try: numeric.append((tier, float(tier) / 100.0)) except ValueError: numeric.append((tier, 1.0)) total = sum(weight for _, weight in numeric) or 1.0 return [ (tier, weight / total) for tier, weight in numeric ] def _load_first_token_weights(self): raw_weights = ( self.metadata.get("first_token_weights") or self.metadata.get("start_token_weights") ) if raw_weights: pairs = [ (glyph, float(weight)) for glyph, weight in raw_weights.items() if glyph in self.rows and float(weight) > 0 ] if pairs: pairs.sort() values = [p[0] for p in pairs] weights = [p[1] for p in pairs] return values, weights raw_counts = ( self.metadata.get("first_tokens") or self.metadata.get("start_tokens") ) if raw_counts: pairs = [ (glyph, float(count)) for glyph, count in raw_counts.items() if glyph in self.rows and float(count) > 0 ] if pairs: pairs.sort() values = [p[0] for p in pairs] weights = [p[1] for p in pairs] return values, weights return list(self.rows.keys()), None def choose_start_glyph(self, rng): return weighted_choice( rng, self.first_token_values, self.first_token_weights, ) def choose_follower(self, rng, left, column): values, weights = self.followers.get(left, {}).get(column, ([], [])) if not values: return None return weighted_choice(rng, values, weights) def legal_transition(self, left, right, column): row = self.rows.get(left, {}).get(column, {}) for tier in self.tiers: if right in row.get(tier, []): return True return False def has_multiple_gallows(self, token): return sum(1 for ch in token if ch in GALLOWS) > 1 def validate(self, token): if not token: return False if self.has_multiple_gallows(token): return False if len(token) == 1: return token in self.short_tokens if token[0] not in self.rows: return False if not self.legal_transition(token[0], token[1], "prefix"): return False for index in range(2, len(token) - 1): if not self.legal_transition(token[index - 1], token[index], "midfix"): return False return self.legal_transition(token[-2], token[-1], "suffix") def generate_word(self, rng, page=None, min_len=MIN_LEDGER_WORD_LEN, max_len=MAX_LEDGER_WORD_LEN): page = page or [] for _ in range(MAX_ATTEMPTS_PER_LEDGER_WORD): target_len = rng.randint(min_len, max_len) chars = [self.choose_start_glyph(rng)] if not chars[0]: continue while len(chars) < target_len: left = chars[-1] if len(chars) == 1: column = "prefix" elif len(chars) == target_len - 1: column = "suffix" else: column = "midfix" right = self.choose_follower(rng, left, column) if right is None: break chars.append(right) token = "".join(chars) if len(token) == target_len and passes_dls(token, self, page): return token raise RuntimeError( "Could not generate a ledger-valid DLS-passing word after " f"{MAX_ATTEMPTS_PER_LEDGER_WORD} attempts." ) # ============================================================ # GENERATION # ============================================================ def generate_page(rng, ledger, words_per_page): page = [] recent_ledger_words = [] while len(page) < words_per_page: mode = rng.choice(MODES) if mode == "common_word": page.append(rng.choice(COMMON_WORDS)) elif mode == "common_combination": page.extend(rng.choice(COMMON_COMBINATIONS)) elif mode == "ledger_word": word = None if recent_ledger_words and rng.random() < LEDGER_WORD_REPEAT_CHANCE: repeat_candidates = [ candidate for candidate in recent_ledger_words if passes_dls(candidate, ledger, page) ] if repeat_candidates: word = rng.choice(repeat_candidates) if word is None: word = ledger.generate_word(rng, page=page) recent_ledger_words.append(word) if len(recent_ledger_words) > RECENT_LEDGER_WORD_WINDOW: recent_ledger_words.pop(0) page.append(word) return page[:words_per_page] # ============================================================ # OUTPUT / ANALYSIS # ============================================================ def fixed_vocab(): vocab = set(COMMON_WORDS) for combo in COMMON_COMBINATIONS: vocab.update(combo) return vocab def print_page(page_num, page, words_per_line): print() print("=" * 60) print(f"PAGE {page_num}") print("=" * 60) print() for i in range(0, len(page), words_per_line): print(" ".join(page[i:i + words_per_line])) def analyze(all_words): vocab = fixed_vocab() freq = Counter(all_words) generated_occurrences = [ word for word in all_words if word not in vocab ] generated_freq = Counter(generated_occurrences) print() print("=" * 60) print("GLOBAL STATISTICS") print("=" * 60) print() print(f"Tokens : {len(all_words)}") print(f"Types : {len(freq)}") print(f"TTR : {len(freq) / len(all_words):.4f}") print(f"Hapax : {sum(1 for c in freq.values() if c == 1)}") print(f"Fixed vocabulary size : {len(vocab)}") print(f"Ledger-word occurrences : {len(generated_occurrences)}") print(f"Ledger-word types : {len(generated_freq)}") print(f"Repeated ledger-word types : {sum(1 for c in generated_freq.values() if c > 1)}") print(f"Max ledger-word repeat : {max(generated_freq.values()) if generated_freq else 0}") print() print("Top 25 word tokens") print() for word, count in freq.most_common(25): print(f"{word:20} {count}") print() print("Top 30 character bigrams") print() for bg, count in char_ngrams(all_words, 2).most_common(30): print(f"{bg:5} {count}") print() print("Top 30 character trigrams") print() for tg, count in char_ngrams(all_words, 3).most_common(30): print(f"{tg:5} {count}") print() print("Top repeated ledger-generated words") print() shown = 0 for word, count in generated_freq.most_common(): if count <= 1: break print(f"{word:20} {count}") shown += 1 if shown >= 25: break def main(): parser = argparse.ArgumentParser( description="Minimal phrase + common word generator using the existing Scribe 1 ledger for new words." ) parser.add_argument("--ledger", default="Ledger_scribe1.json") parser.add_argument("--pages", type=int, default=DEFAULT_PAGES) parser.add_argument("--words-per-page", type=int, default=DEFAULT_WORDS_PER_PAGE) parser.add_argument("--words-per-line", type=int, default=DEFAULT_WORDS_PER_LINE) parser.add_argument("--seed", type=int, default=DEFAULT_SEED) parser.add_argument("--print-all-pages", action="store_true") parser.add_argument("--output", default=None, help="Optional text output file.") args = parser.parse_args() rng = random.Random(args.seed) ledger = Ledger(args.ledger) lines = [] all_words = [] # Capture print output if --output is requested. # Simpler than redirecting stdout externally. def emit(text=""): print(text) if args.output is not None: lines.append(text) emit("Common words:") emit(", ".join(COMMON_WORDS)) emit() emit("Common combinations:") for combo in COMMON_COMBINATIONS: emit(" ".join(combo)) for page_num in range(1, args.pages + 1): page = generate_page(rng, ledger, args.words_per_page) all_words.extend(page) if args.print_all_pages or page_num <= 3 or page_num == args.pages: emit() emit("=" * 60) emit(f"PAGE {page_num}") emit("=" * 60) emit() for i in range(0, len(page), args.words_per_line): emit(" ".join(page[i:i + args.words_per_line])) # Analysis prints to stdout directly. analyze(all_words) if args.output is not None: Path(args.output).write_text("\n".join(lines) + "\n", encoding="utf-8") if __name__ == "__main__": main()

Disclaimer: I had codex create and modify this code from a prompt.
RE: A One-Page Ledger Method for Generating Voynich-Like Text

bi3mw > 27-05-2026, 12:04 AM

(26-05-2026, 11:28 PM)Dunsel Wrote: You are not allowed to view links. Register or Login to view.PRECISELY... which is what my ledger does. Weighted letters.
I was thinking more along the lines of an extension for “random.” Just for a quick suggestion, I decided to ask the AI for the first time. The basic idea seems viable, but of course the accuracy of the weighting would need to be verified.

Code:
# oshfdk_literal_generator_modified.py # No external files required. # # Literal version of: # 1) create 20-30 common words # 2) create 10-20 common combinations # 3) mix common words + common combinations + random longer words # 4) random longer words occasionally repeat # # No mutation. # No ledger. # No syllables. # No hidden morphology. # # FIX for Problem: # Long words are no longer generated by flat random character sampling. # Instead, positional character pools (INITIAL / MEDIAL / TERMINAL) with # weights give each word a plausible frame: # INITIAL – q, d, s, c dominate word-starts # MEDIAL – o, e, a, i are the backbone of word bodies # TERMINAL – n, y, r, l, m are the most common word-endings import random from collections import Counter SEED = 42 PAGES = 100 WORDS_PER_PAGE = 95 WORDS_PER_LINE = 8 # --------------------------------------------------------------------------- # Standard EVA-ish alphabet, sorted alphabetically. # --------------------------------------------------------------------------- VOYNICH_ALPHABET = "acdefhiklnoqrstxy" LONG_WORD_MIN_LEN = 5 LONG_WORD_MAX_LEN = 12 LONG_WORD_REPEAT_CHANCE = 0.25 RECENT_LONG_WORD_WINDOW = 10 # --------------------------------------------------------------------------- # Common words and combinations (unchanged) # --------------------------------------------------------------------------- COMMON_WORDS = [ "daiin", "dain", "ol", "or", "chol", "chedy", "qokain", "qokeedy", "qotedy", "otedy", "shey", "cthey", "cthol", "shol", "chor", "dair", "qokair", "saiin", "aiin", "okain", "sary", "okol", "qol", "qokal", "chdy", ] COMMON_COMBINATIONS = [ ["qokain", "daiin"], ["chol", "chedy"], ["qokeedy","qokedy"], ["ol", "chedy"], ["dain", "chol"], ["shey", "qokain"], ["cthey", "daiin"], ["shol", "chor"], ["qotedy", "qokain"], ["otedy", "ol"], ["saiin", "daiin"], ["qokair", "dair"], ["chol", "daiin"], ["cthol", "chedy"], ["or", "chol"], ] # --------------------------------------------------------------------------- # Weights are intentionally skewed to reflect the strong positional biases # observed in Voynich script: # INITIAL – q, d, s, c dominate word-starts # MEDIAL – o, e, a, i are the backbone of word bodies # TERMINAL – n, y, r, l, m are the most common word-endings # --------------------------------------------------------------------------- INITIAL_CHARS = { "q": 8, "d": 7, "s": 6, "c": 6, "o": 4, "a": 3, "f": 2, "r": 1, "k": 1, "t": 1, } MEDIAL_CHARS = { "o": 9, "e": 7, "a": 6, "i": 6, "l": 4, "n": 3, "k": 3, "h": 3, "r": 2, "t": 2, "d": 1, "s": 1, } TERMINAL_CHARS = { "n": 8, "y": 8, "r": 6, "l": 5, "m": 4, "s": 3, "d": 3, } def weighted_choice(weight_dict: dict) -> str: """Return a single character sampled according to integer weights.""" chars = list(weight_dict.keys()) weights = list(weight_dict.values()) return random.choices(chars, weights=weights, k=1)[0] def make_random_long_word() -> str: """ Build a long word using positional character pools. Structure: [INITIAL char] + [MEDIAL body] + [TERMINAL char] """ length = random.randint(LONG_WORD_MIN_LEN, LONG_WORD_MAX_LEN) # --- Word-initial character (positional pool) --- word = weighted_choice(INITIAL_CHARS) # --- Medial body (positional pool) --- for _ in range(length - 2): word += weighted_choice(MEDIAL_CHARS) # --- Word-terminal character (positional pool) --- word += weighted_choice(TERMINAL_CHARS) return word # --------------------------------------------------------------------------- # Page generation (unchanged logic, fixed long-word generator plugged in) # --------------------------------------------------------------------------- def generate_page() -> list[str]: page = [] recent_long_words = [] while len(page) < WORDS_PER_PAGE: mode = random.choice(["common_word", "combination", "long_word"]) if mode == "common_word": page.append(random.choice(COMMON_WORDS)) elif mode == "combination": combo = random.choice(COMMON_COMBINATIONS) page.extend(combo) else: # long_word if recent_long_words and random.random() < LONG_WORD_REPEAT_CHANCE: word = random.choice(recent_long_words) else: word = make_random_long_word() recent_long_words.append(word) if len(recent_long_words) > RECENT_LONG_WORD_WINDOW: recent_long_words.pop(0) page.append(word) return page[:WORDS_PER_PAGE] # --------------------------------------------------------------------------- # Analysis helpers (unchanged) # --------------------------------------------------------------------------- def char_ngrams(words: list[str], n: int) -> Counter: counts = Counter() for word in words: if len(word) < n: continue for i in range(len(word) - n + 1): counts[word[i:i + n]] += 1 return counts def analyze(all_words: list[str]) -> None: token_count = len(all_words) type_count = len(set(all_words)) word_freq = Counter(all_words) char_bigram_counts = char_ngrams(all_words, 2) char_trigram_counts = char_ngrams(all_words, 3) print() print("=" * 60) print("GLOBAL WORD STATISTICS") print("=" * 60) print() print(f"Tokens : {token_count}") print(f"Types : {type_count}") print(f"TTR : {type_count / token_count:.4f}") print(f"Hapax : {sum(1 for c in word_freq.values() if c == 1)}") print() print("Top 25 word tokens") print() for word, count in word_freq.most_common(25): print(f"{word:15} {count}") print() print("=" * 60) print("CHARACTER BIGRAMS") print("=" * 60) print() for bg, count in char_bigram_counts.most_common(30): print(f"{bg:5} {count}") print() print("=" * 60) print("CHARACTER TRIGRAMS") print("=" * 60) print() for tg, count in char_trigram_counts.most_common(30): print(f"{tg:5} {count}") def print_page(page_num: int, page: list[str]) -> None: print() print("=" * 60) print(f"PAGE {page_num}") print("=" * 60) print() for i in range(0, len(page), WORDS_PER_LINE): print(" ".join(page[i:i + WORDS_PER_LINE])) # --------------------------------------------------------------------------- # Entry point # --------------------------------------------------------------------------- def main() -> None: random.seed(SEED) all_words = [] for page_num in range(1, PAGES + 1): page = generate_page() all_words.extend(page) print_page(page_num, page) analyze(all_words) if __name__ == "__main__": main()

Sorry, I was a little slow in posting this. - Edit: Where exactly can I find “Ledger_Scribe1.json”?
Next Oldest Next Newest

A One-Page Ledger Method for Generating Voynich-Like Text

Index

RE: A One-Page Ledger Method for Generating Voynich-Like Text

RE: A One-Page Ledger Method for Generating Voynich-Like Text

RE: A One-Page Ledger Method for Generating Voynich-Like Text

RE: A One-Page Ledger Method for Generating Voynich-Like Text

RE: A One-Page Ledger Method for Generating Voynich-Like Text

RE: A One-Page Ledger Method for Generating Voynich-Like Text

RE: A One-Page Ledger Method for Generating Voynich-Like Text

RE: A One-Page Ledger Method for Generating Voynich-Like Text

RE: A One-Page Ledger Method for Generating Voynich-Like Text

RE: A One-Page Ledger Method for Generating Voynich-Like Text