The Voynich Ninja - A One-Page Ledger Method for Generating Voynich-Like Text

Pages: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19

(26-05-2026, 04:44 PM)Dunsel Wrote: You are not allowed to view links. Register or Login to view.I think the Vounich is implausible. If it is copy mutate, the best you can hope for is statistically comparable
Everyone hopes for an exact explanation or reproduction. What if there isn't one?

If there isn't one, then there isn't one. I understand that a lot of people see the need for some closure, but personally I'm ok with not knowing how the manuscript was created, at least for now.

(26-05-2026, 04:44 PM)Dunsel Wrote: You are not allowed to view links. Register or Login to view.Try my challenge in a previous reply. See if you can make statistically correct Voynich with it.

I think I did exactly that in my reply, it broke down immediately. I took a very common word "dain" and I identified that introducing some simple changes to it that pass the ledger produces words that not only are unattested in the manuscript, but also don't conform to CLS framework (which is ok in rare cases, but not generally). This challenge doesn't produce statistically correct Voynichese.

(26-05-2026, 05:04 PM)oshfdk Wrote: You are not allowed to view links. Register or Login to view.I think I did exactly that in my reply, it broke down immediately. I took a very common word "dain" and I identified that introducing some simple changes to it that pass the ledger produces words that not only are unattested in the manuscript, but also don't conform to CLS framework (which is ok in rare cases, but not generally). This challenge doesn't produce statistically correct Voynichese.

No, you're trying to make it look Voynich. Don't. Just try to make it look like a language. Create your own visually appealing language.

(26-05-2026, 05:19 PM)Dunsel Wrote: You are not allowed to view links. Register or Login to view.No, you're trying to make it look Voynich. Don't. Just try to make it look like a language. Create your own visually appealing language.

So, if I understand it correctly, you give me a set of rules (copy + mutate and the ledger) and you want me to add on top of this a set of my own preferences and then produce some text and you claim that this text would bear statistical resemblance to Voynichese no matter what my preferences are? This is clearly wrong. Suppose my preferences are: most common words should be 1-3 characters long and repeat often, there should be only a few of them, longer words should variate more but occasionally also repeat close to one another.

Seeds: daiin,sary,qokain,otedy,okol

Text: sar oko qokair oko daiir sar qotedy daiiny or qokainy sar oko san qotey o daiir otedysary oko saiin qotain or o sary daiir qotair sar oko otedy otedysar daiiny okol sary qotady sar oko okedy o daiir qokain daiin qokairy qotey o daiir otesy sar oko qokair.

And naturally, if I wanted to make it look like a real language, I'd make sure that "sar oko" and "or" and "o" keep repeating thought the whole text, while "ar" and "sa" do not appear often, because that's the way most (all?) languages work when it comes to short words. If you think about it, I have to fight against copy + mutate here, defaulting to plain copy in most cases and feeling unnecessary restricted by mutation rules to keep going, because copy + mutate is not a good way to create plausible looking language. Just creating a bunch of random common words and word patterns and sticking a new long random word now and then would be much easier.

Now, I think this looks much more similar to a real language than Voyncihese. Because most real languages have plenty of grammar words and particles that often repeat and have structures that also repeat. Languages like Latin have a lot of suffixes that repeat all over the place, etc. So naturally, if I wanted to imitate a language, I would make it look like a language and not like a weird codebook of number like sequences. Of course it goes without saying that statistically this won't look similar to Voynichese, and the distribution of tokens and types would be different.

In fact, this way of creating a fake language is much simpler and produces a much more plausible result:
1) create a list of common words, about 20-30, write them down
2) create 10-20 common combinations of these common words, write them down
3) start writing mixing common words and common word combinations with random longer words, which would occasionally repeat
4) done

There is nothing wrong for a text in a natural language to have some repetitions, it's expected. In some styles of texts (recipes, log entries) it's even mandatory. There is nothing wrong with that, there is no need to fight against this and the result looks very much like an unknown language. The statistics would differ significantly from Voynichese, but maybe they would be closer to real languages, and what is more, no-one in the XV century would go over the text with a scientific calculator and measure token type ratios.

(26-05-2026, 06:14 PM)oshfdk Wrote: You are not allowed to view links. Register or Login to view.In fact, this way of creating a fake language is much simpler and produces a much more plausible result:
1) create a list of common words, about 20-30, write them down
2) create 10-20 common combinations of these common words, write them down
3) start writing mixing common words and common word combinations with random longer words, which would occasionally repeat
4) done

There is nothing wrong for a text in a natural language to have some repetitions, it's expected. In some styles of texts (recipes, log entries) it's even mandatory. There is nothing wrong with that, there is no need to fight against this and the result looks very much like an unknown language. The statistics would differ significantly from Voynichese, but maybe they would be closer to real languages, and what is more, no-one in the XV century would go over the text with a scientific calculator and measure token type ratios.

Your python generator per your exact specification:

And I hope you don't mind but I did have to toss a few rules in there like... the alphabet to make your random words with. I used the Voynich alphabet. A max word length so it just didn't keep randomly jamming letters together. A minimum word length for your 'longer words' and some weighting so it didn't just keep repeating your longer random words. Number of pages, words per line, words per page, things python can't just guess at very well. I even tossed in a seed so you can reproduce it every time the same way

Code:
# oshfdk_literal_generator.py

# No external files required.

#

# Literal version of:

# 1) create 20-30 common words

# 2) create 10-20 common combinations

# 3) mix common words + common combinations + random longer words

# 4) random longer words occasionally repeat

#

# No mutation.

# No ledger.

# No syllables.

# No hidden morphology.

import random

from collections import Counter

SEED = 42

PAGES = 100

WORDS_PER_PAGE = 95

WORDS_PER_LINE = 8

# Standard EVA-ish alphabet, sorted alphabetically.

VOYNICH_ALPHABET = "acdefhiklnoqrstxy"

LONG_WORD_MIN_LEN = 5

LONG_WORD_MAX_LEN = 12

LONG_WORD_REPEAT_CHANCE = 0.25

RECENT_LONG_WORD_WINDOW = 10

COMMON_WORDS = [

    "daiin", "dain", "ol", "or", "chol",

    "chedy", "qokain", "qokeedy", "qotedy", "otedy",

    "shey", "cthey", "cthol", "shol", "chor",

    "dair", "qokair", "saiin", "aiin", "okain",

    "sary", "okol", "qol", "qokal", "chdy",

]

COMMON_COMBINATIONS = [

    ["qokain", "daiin"],

    ["chol", "chedy"],

    ["qokeedy", "qokedy"],

    ["ol", "chedy"],

    ["dain", "chol"],

    ["shey", "qokain"],

    ["cthey", "daiin"],

    ["shol", "chor"],

    ["qotedy", "qokain"],

    ["otedy", "ol"],

    ["saiin", "daiin"],

    ["qokair", "dair"],

    ["chol", "daiin"],

    ["cthol", "chedy"],

    ["or", "chol"],

]

def make_random_long_word():

    length = random.randint(LONG_WORD_MIN_LEN, LONG_WORD_MAX_LEN)

    return "".join(random.choice(VOYNICH_ALPHABET) for _ in range(length))

def generate_page():

    page = []

    recent_long_words = []

    while len(page) < WORDS_PER_PAGE:

        mode = random.choice(["common_word", "combination", "long_word"])

        if mode == "common_word":

            page.append(random.choice(COMMON_WORDS))

        elif mode == "combination":

            combo = random.choice(COMMON_COMBINATIONS)

            page.extend(combo)

        else:

            if recent_long_words and random.random() < LONG_WORD_REPEAT_CHANCE:

                word = random.choice(recent_long_words)

            else:

                word = make_random_long_word()

                recent_long_words.append(word)

                if len(recent_long_words) > RECENT_LONG_WORD_WINDOW:

                    recent_long_words.pop(0)

            page.append(word)

    return page[:WORDS_PER_PAGE]

def char_ngrams(words, n):

    counts = Counter()

    for word in words:

        if len(word) < n:

            continue

        for i in range(len(word) - n + 1):

            counts[word[i:i + n]] += 1

    return counts

def analyze(all_words):

    token_count = len(all_words)

    type_count = len(set(all_words))

    word_freq = Counter(all_words)

    char_bigram_counts = char_ngrams(all_words, 2)

    char_trigram_counts = char_ngrams(all_words, 3)

    print()

    print("=" * 60)

    print("GLOBAL WORD STATISTICS")

    print("=" * 60)

    print()

    print(f"Tokens : {token_count}")

    print(f"Types  : {type_count}")

    print(f"TTR    : {type_count / token_count:.4f}")

    print(f"Hapax  : {sum(1 for c in word_freq.values() if c == 1)}")

    print()

    print("Top 25 word tokens")

    print()

    for word, count in word_freq.most_common(25):

        print(f"{word:15} {count}")

    print()

    print("=" * 60)

    print("CHARACTER BIGRAMS")

    print("=" * 60)

    print()

    for bg, count in char_bigram_counts.most_common(30):

        print(f"{bg:5} {count}")

    print()

    print("=" * 60)

    print("CHARACTER TRIGRAMS")

    print("=" * 60)

    print()

    for tg, count in char_trigram_counts.most_common(30):

        print(f"{tg:5} {count}")

def print_page(page_num, page):

    print()

    print("=" * 60)

    print(f"PAGE {page_num}")

    print("=" * 60)

    print()

    for i in range(0, len(page), WORDS_PER_LINE):

        print(" ".join(page[i:i + WORDS_PER_LINE]))

def main():

    random.seed(SEED)

    all_words = []

    for page_num in range(1, PAGES + 1):

        page = generate_page()

        all_words.extend(page)

        print_page(page_num, page)

    analyze(all_words)

if __name__ == "__main__":

    main()

Example output:

============================================================

PAGE 1
============================================================

alkkfe scadik sktlahso qokeedy qokedy chdy chol chedy
cthol cthey otedy ol chol daiin qokal qotedy
qokain cthol saiin cthol chedy idckndkerl saiin daiin
qokeedy qokedy shey qokain okol qokair dair khtrlko
qokeedy shey dain chol qokain ixrtflfkls fyxdcefh rrtyla
oenshtaly qokair sary cthol chedy fqhyaoxa cthey ol
chedy qokeedy xdffxh qotedy qokain inrqtyte qokeedy shey
aiin ixrtflfkls ol dcodyklx saiin qokal xsieesqs shol
chor ixrtflfkls cthol kiitfs qotedy ol chedy chor
xsieesqs ol chedy shol chor ol chedy or
chol chedy qokain daiin dain chol dain chol
qokair dair finiccoccxyy dain oenshtaly okain okol

============================================================
PAGE 50
============================================================

chol daiin aynxasdcici aiin qokain daiin qokair dair
aiin qotedy qokain cthol chol aynxasdcici qotedy qokain
qotedy qokain qokeedy qokedy chdy olrentat ihecreiinin shol
chor sary olrentat qokair otedy qokain shey dair
trhxxnkdea okain daiin or chol otedy cthey daiin
ecqiisq qotedy saiin daiin otedy or qokeedy qokedy
shol chor or xhaed ecqiisq axdfidrra dair cthol
chedy cthey daiin daiin lnxila ktacsxkc shey qokain
okain shol chor qotedy qokain cthey oqshrqt chol
daiin qokeedy qokedy sqsdlfoqtthc qotedy qokain axdfidrra firdsey
trhxxnkdea sqsdlfoqtthc oxeoqryr aiin qokair dair qokair dair
qokeedy iayexfd rnieiaaa chol chedy chedy oqshrqt

============================================================

PAGE 100
============================================================

qsqfrayh qokeedy qokedy ldfyyrsedffi ol saiin daiin qsqfrayh
chol chedy aiin chol daiin onrotll or chol
qokair dair chedy saiin daiin dain chol or
chol or onrotll dirtiyaofc saiin itsyne xdlrfhadl kolceeht
saiin cthol qokain qokair or chol or chol
shol chor qokain daiin xsektrfqaxs shexfc qotedy qokain
cneyrkckhrn lstnedrcoka chol daiin chol daiin lstnedrcoka aknfoetcedqe
cthol shexfc qokain daiin dain chol or or
chol ol shexfc xsektrfqaxs cthol chol or otedy
ol ytqly chol shexfc aknfoetcedqe ol chedy asqtko
dair shey qokain cthey daiin cneyrkckhrn daiin qokal
otedy chdy chol xdlrfhadl akotexxdxaic cthey daiin

Report:

Code:
============================================================

GLOBAL WORD STATISTICS

============================================================

Tokens : 9500

Types  : 1800

TTR    : 0.1895

Hapax  : 1330

Top 25 word tokens

chol            748

daiin          736

chedy          589

qokain          530

ol              427

qokeedy        289

or              277

saiin          276

otedy          275

cthol          268

dair            268

shey            247

cthey          246

qokair          246

shol            242

qotedy          231

chor            230

dain            227

qokedy          185

aiin            107

qokal          98

okol            93

sary            89

chdy            84

okain          80

============================================================

CHARACTER BIGRAMS

============================================================

ai    2538

in    2011

ol    1901

dy    1729

ch    1711

qo    1704

ed    1632

ok    1615

ho    1546

da    1291

ii    1179

he    1136

ka    1029

ir    580

sh    573

te    571

ot    569

th    561

or    561

ey    557

ct    550

ke    538

sa    433

ee    346

ar    179

al    159

ko    154

ry    147

hd    139

fa    85

============================================================

CHARACTER TRIGRAMS

============================================================

edy  1571

qok  1348

hol  1260

dai  1237

iin  1126

aii  1122

cho  986

oka  961

kai  863

ain  839

hed  594

che  591

air  522

cth  516

ote  515

ted  510

hey  493

oke  485

eed  293

kee  290

sai  277

tho  268

she  256

the  248

sho  244

qot  232

hor  230

ked  190

kal  102

kol  96

Your zipf curve:

[attachment=15776]

Your word length distribution chart.

[attachment=15777]

Oh, and for that run, some more stats I had codex retrieve.

Fixed starting vocabulary: 26 words
Random-long word occurrences: 2,342
Distinct random-long words: 1,774
Random-long words that repeated: 444
Single-use random-long words: 1,330
Extra repeat events beyond first use: 568
Highest repeat count for one random word: 5

I would say the biggest problem in the script is on line 60.

Code:
"".join(random.choice(VOYNICH_ALPHABET) for _ in range(length))

Each character is generated independently there. However, Voynich words exhibit strong positional dependencies. Certain characters only appear at the beginning or end of a word, such as qo-, -dy, and -in for example. Statistically, the result is therefore immediately recognizable as random noise.

One possible solution might be to assign weights to the individual parts of the word.

(26-05-2026, 10:55 PM)bi3mw Wrote: You are not allowed to view links. Register or Login to view.I would say the biggest problem in the script is on line 60.

Code:
"".join(random.choice(VOYNICH_ALPHABET) for _ in range(length))

Each character is generated independently there. However, Voynich words exhibit strong positional dependencies. Certain characters only appear at the beginning or end of a word, such as qo-, -dy, and -in for example. Statistically, the result is therefore immediately recognizable as random noise.

You are ABSOLUTELY correct. But, that's not my generator. User You are not allowed to view links. Register or Login to view. gave instructions on how to create a fake language and I followed those instructions to create that generator. The output of mine is on page one of this post. He believes that the copy and mutate system I've described with rules is too cumbersome and not a good way to make a pseudo language and that just inserting random words periodically is.

(26-05-2026, 10:55 PM)bi3mw Wrote: You are not allowed to view links. Register or Login to view.One possible solution might be to assign weights to the individual parts of the word.

PRECISELY... which is what my ledger does. Weighted letters.

The point I hope he gets is that making a pseudo language, if that's what the Voynich is, requires some rules. Reducing those rules to a minimal set that still produces something statistically like a language is what I'm trying to prove. Random words just don't work. In order to make that generator work, yes, that random word generator has got to go. But, what rules do you apply when creating a new word or modifying a word? That's where my ledger works and works well, as long as it's used with human intuition and constraints.

But, if the Voynich really is a pseudo language, no generator will be able to exactly reproduce it without cheating. Statisticallyl? Yes, I think so. Word for word? Absolutely not. Could his idea for a generator work? Very possibly. It has potential. Is it going to require a good set of rules to do so? Of course.

(26-05-2026, 08:42 PM)Dunsel Wrote: You are not allowed to view links. Register or Login to view.Your python generator per your exact specification:

I don't think it's a good idea to code something according to some interpretation of loose specs without clarifying the details.

I never specified how exactly words should be mixed, how random longer words should be generated, and the exact meaning of "occasionally repeat".

Let me make it more specific and yet still manageable to create manually:

1) Given three groups of: common words (A), common word combinations (B) and random long words ( C ) I think it makes sense to never put A and B next to one another and C can go between 1 and 3 times in a row. When selecting from A or B select top of the list more often. Reality check: very easy and simple to follow rule, the mixing order is straightforward: get either one of A or one of B then between 1 and 3 of C, then repeat. Prioritizing top of the list is the same as going down the list and randomly selecting the first word that seems fine, this will naturally give higher probability to higher A/B words.

2) Having just a list of common words the easiest way of generating long words is by trigrams taken from the common words list. Take any word starting bigram ab from existing preset A words and then continue appending one character at a time using another existing trigram, like abc, bcd, then cde, etc, and always finish with a word ending trigram. If at any moment the process is stuck (no non word ending trigram is available), it's ok to continue using a bigram. So, given "qokedy chol kotaiin chedy olkedy daraly" we can generate words like cholkedy. Reality check: given a paper with the list of common words this is a simple task, even though it may require a bit of thinking to avoid failing step 3. Edit: This step can be sped up significantly if a list of trigrams is prepared separately from A as a table. This won't even be anachronistic, as far as I understand, the idea of creating tables to enumerate some elements was a very common practice in computus manuscripts.

3) You idea of repeating one of last 10 words 25% of the time is a good one and would work nicely when writing, but I would adjust this a bit. Remembering my own list of restrictions for what a scribe could comfortably do, I'd expand the list of last words to 20. Additionally, if we want it to look like a natural language, we should never generate new words that either belong to A or belong to the last 20 of C. And we better avoid repeating the same word twice, so we'll never copy a word that will repeat the last word we added. Reality check: sometimes can be tedious, but generally not very hard. Given there are only 20 preset common words it's easy to remember them and then to tell apart generated and preset words, and when creating a new word it's not very hard to guide the process to avoid generating a word from A or a recently used word.

Code:
# mixwords_literal_generator.py

# No external files required.

#

# Literal version of:

# 1) write down 20-30 common words

# 2) create 10-20 random common pairings from these words

# 3) choose either a common word or a common combination

# 4) after that, add 1-3 bigram-seeded, trigram-built longer words

# 5) once 20 long words exist in the pipeline, repeat one of them 25% of the time

#

# No mutation.

# No ledger.

# No syllables.

# No hidden morphology.

import argparse

import random

from collections import Counter

SEED = 42

PAGES = 100

WORDS_PER_PAGE = 95

WORDS_PER_LINE = 8

COMMON_COMBINATION_MIN_COUNT = 10

COMMON_COMBINATION_MAX_COUNT = 20

COMMON_COMBINATION_WORDS = 2

LONG_WORD_MIN_LEN = 5

LONG_WORD_MAX_LEN = 11

LONG_WORD_REPEAT_CHANCE = 0.25

RECENT_LONG_WORD_WINDOW = 20

LONG_WORDS_PER_MIX_MIN = 1

LONG_WORDS_PER_MIX_MAX = 3

COMMON_WORD_CHANCE = 0.5

COMMON_RANK_TOP_WEIGHT = 3.0

COMMON_RANK_BOTTOM_WEIGHT = 1.0

MAX_TRIGRAM_WORD_ATTEMPTS = 64

MAX_MIXED_LONG_WORD_ATTEMPTS = MAX_TRIGRAM_WORD_ATTEMPTS * 8

COMMON_WORDS = [

    "daiin", "ol", "or", "qoky", "dary", 

    "chol", "qokair", "sair", "opdor", "ockhol",

    "chedy", "qokchy", "ofain", "ykal", "oteedy",

    "sheey", "otey", "cphol", "sholdaly", "cheol",

    "saral", "doroldal", "qolky", "qokeedy", "chdy",

]

COMMON_WORDS_ENGLISH = [

        "the", "and", "are", "a", "this",

        "is", "unless", "more", "each",

        "together", "well", "anything", "right", "wrong",

        "stay", "move", "wish", "however", "two",

        "one", "four", "because", "always", "trouble",

]

COMMON_COMBINATIONS = []

COMMON_SOURCE_WORD_TOKENS = []

COMMON_SOURCE_WORD_SET = set()

COMMON_WORD_SET = set()

COMMON_WORD_WEIGHTS = []

COMMON_COMBINATION_WEIGHTS = []

WORD_START_BIGRAMS = []

BIGRAM_OPTIONS = {}

TRIGRAMS = []

TRIGRAM_OPTIONS = {}

NONFINAL_TRIGRAMS = set()

FINAL_BIGRAMS = set()

FINAL_TRIGRAMS = set()

WORD_FINAL_SUFFIXES = {}

LONG_WORD_FALLBACKS = []

def make_common_combinations():

    target_count = min(

        random.randint(

            COMMON_COMBINATION_MIN_COUNT,

            COMMON_COMBINATION_MAX_COUNT,

        ),

        len(COMMON_WORDS) * (len(COMMON_WORDS) - 1),

    )

    combinations = []

    seen = set()

    while len(combinations) < target_count:

        # Avoid A A pairs like "daiin daiin".

        pair = tuple(random.sample(COMMON_WORDS, COMMON_COMBINATION_WORDS))

        if pair in seen:

            continue

        seen.add(pair)

        combinations.append(list(pair))

    return combinations

def make_rank_weights(items):

    count = len(items)

    if count <= 1:

        return [COMMON_RANK_BOTTOM_WEIGHT] * count

    step = (

        COMMON_RANK_BOTTOM_WEIGHT - COMMON_RANK_TOP_WEIGHT

    ) / (count - 1)

    return [

        COMMON_RANK_TOP_WEIGHT + (step * index)

        for index in range(count)

    ]

def weighted_choice(items, weights):

    return random.choices(items, weights=weights, k=1)[0]

def initialize_common_sources():

    global COMMON_COMBINATIONS

    global COMMON_SOURCE_WORD_TOKENS

    global COMMON_SOURCE_WORD_SET

    global COMMON_WORD_SET

    global COMMON_WORD_WEIGHTS

    global COMMON_COMBINATION_WEIGHTS

    global WORD_START_BIGRAMS

    global BIGRAM_OPTIONS

    global TRIGRAMS

    global TRIGRAM_OPTIONS

    global NONFINAL_TRIGRAMS

    global FINAL_BIGRAMS

    global FINAL_TRIGRAMS

    global WORD_FINAL_SUFFIXES

    global LONG_WORD_FALLBACKS

    COMMON_COMBINATIONS = make_common_combinations()

    COMMON_WORD_SET = set(COMMON_WORDS)

    COMMON_WORD_WEIGHTS = make_rank_weights(COMMON_WORDS)

    COMMON_COMBINATION_WEIGHTS = make_rank_weights(COMMON_COMBINATIONS)

    COMMON_SOURCE_WORD_TOKENS = COMMON_WORDS + [

        word for combo in COMMON_COMBINATIONS for word in combo

    ]

    COMMON_SOURCE_WORD_SET = set(COMMON_SOURCE_WORD_TOKENS)

    WORD_START_BIGRAMS = [

        word[:2] for word in COMMON_SOURCE_WORD_TOKENS if len(word) >= 2

    ]

    TRIGRAMS = [

        word[i:i + 3]

        for word in COMMON_SOURCE_WORD_TOKENS

        for i in range(len(word) - 2)

    ]

    BIGRAM_OPTIONS = {}

    TRIGRAM_OPTIONS = {}

    NONFINAL_TRIGRAMS = set()

    FINAL_BIGRAMS = set()

    FINAL_TRIGRAMS = set()

    WORD_FINAL_SUFFIXES = {}

    for word in COMMON_SOURCE_WORD_TOKENS:

        if len(word) >= 2:

            FINAL_BIGRAMS.add(word[-2:])

        if len(word) >= 3:

            FINAL_TRIGRAMS.add(word[-3:])

        for i in range(len(word) - 1):

            bigram = word[i:i + 2]

            BIGRAM_OPTIONS.setdefault(bigram[0], []).append(bigram)

        for i in range(len(word) - 2):

            trigram = word[i:i + 3]

            TRIGRAM_OPTIONS.setdefault(trigram[:2], []).append(trigram)

            if i < len(word) - 3:

                NONFINAL_TRIGRAMS.add(trigram)

            suffix = word[i:]

            WORD_FINAL_SUFFIXES.setdefault(suffix[:2], []).append(suffix)

    for prefix, suffixes in WORD_FINAL_SUFFIXES.items():

        WORD_FINAL_SUFFIXES[prefix] = sorted(suffixes, key=len)

    LONG_WORD_FALLBACKS = [

        word for word in COMMON_SOURCE_WORD_SET if len(word) >= LONG_WORD_MIN_LEN

    ]

def continue_with_bigram(word, target_length, disallowed_words):

    next_bigrams = BIGRAM_OPTIONS.get(word[-1], [])

    allowed_bigrams = [

        bigram

        for bigram in next_bigrams

        if (

            len(word) + 1 < target_length

            or (

                word + bigram[1] not in disallowed_words

                and (word + bigram[1])[-2:] in FINAL_BIGRAMS

                and (word + bigram[1])[-3:] in FINAL_TRIGRAMS

            )

        )

    ]

    if not allowed_bigrams:

        return None

    return word + random.choice(allowed_bigrams)[1]

def make_trigram_long_word(disallowed_words=None):

    target_length = random.randint(LONG_WORD_MIN_LEN, LONG_WORD_MAX_LEN)

    fallback = None

    disallowed_words = set(disallowed_words or ())

    for _ in range(MAX_TRIGRAM_WORD_ATTEMPTS):

        word = random.choice(WORD_START_BIGRAMS)

        completed = False

        while len(word) < target_length:

            suffixes = WORD_FINAL_SUFFIXES.get(word[-2:], [])

            exact_suffixes = [

                suffix

                for suffix in suffixes

                if len(word) + len(suffix) - 2 == target_length

                and word + suffix[2:] not in disallowed_words

            ]

            if exact_suffixes:

                word += random.choice(exact_suffixes)[2:]

                completed = True

                break

            next_trigrams = TRIGRAM_OPTIONS.get(word[-2:])

            allowed_trigrams = [

                trigram for trigram in next_trigrams if trigram in NONFINAL_TRIGRAMS

            ] if next_trigrams else []

            if allowed_trigrams:

                word += random.choice(allowed_trigrams)[2]

                continue

            word = continue_with_bigram(word, target_length, disallowed_words)

            if word is None:

                break

            if len(word) == target_length:

                completed = True

                break

        if not completed or len(word) < LONG_WORD_MIN_LEN:

            continue

        if fallback is None:

            fallback = word

        if word not in COMMON_SOURCE_WORD_SET:

            return word

    if fallback is not None:

        return fallback

    allowed_fallbacks = [

        word for word in LONG_WORD_FALLBACKS if word not in disallowed_words

    ]

    if allowed_fallbacks:

        return random.choice(allowed_fallbacks)

    return random.choice(LONG_WORD_FALLBACKS)

def append_recent_long_word(recent_long_words, word):

    recent_long_words.append(word)

    if len(recent_long_words) > RECENT_LONG_WORD_WINDOW:

        recent_long_words.pop(0)

def make_mixed_long_word(recent_long_words, previous_word=None):

    if (

        len(recent_long_words) >= RECENT_LONG_WORD_WINDOW

        and random.random() < LONG_WORD_REPEAT_CHANCE

    ):

        repeat_candidates = [

            word for word in recent_long_words if word != previous_word

        ]

        if repeat_candidates:

            return random.choice(repeat_candidates)

    fallback = None

    recent_long_word_set = set(recent_long_words)

    disallowed_words = COMMON_WORD_SET | recent_long_word_set

    for _ in range(MAX_MIXED_LONG_WORD_ATTEMPTS):

        word = make_trigram_long_word(disallowed_words)

        if word in COMMON_WORD_SET or word in recent_long_word_set:

            continue

        if fallback is None:

            fallback = word

        if word != previous_word:

            return word

    if fallback is not None and fallback != previous_word:

        return fallback

    distinct_fallbacks = [

        word

        for word in LONG_WORD_FALLBACKS

        if (

            word != previous_word

            and word not in COMMON_WORD_SET

            and word not in recent_long_word_set

        )

    ]

    if distinct_fallbacks:

        return random.choice(distinct_fallbacks)

    if fallback is not None:

        return fallback

    raise RuntimeError(

        "Unable to generate a valid long word that is not a common word "

        "or one of the recent long words."

    )

def append_marked_word(words, marker, word):

    words.append((marker, word))

def generate_words(total_words, recent_long_words):

    words = []

    while len(words) < total_words:

        if random.random() < COMMON_WORD_CHANCE:

            append_marked_word(

                words,

                "A",

                weighted_choice(COMMON_WORDS, COMMON_WORD_WEIGHTS),

            )

        else:

            for word in weighted_choice(

                COMMON_COMBINATIONS,

                COMMON_COMBINATION_WEIGHTS,

            ):

                if len(words) >= total_words:

                    break

                append_marked_word(words, "B", word)

        long_word_count = random.randint(

            LONG_WORDS_PER_MIX_MIN,

            LONG_WORDS_PER_MIX_MAX,

        )

        for _ in range(long_word_count):

            if len(words) >= total_words:

                break

            previous_word = words[-1][1] if words else None

            word = make_mixed_long_word(recent_long_words, previous_word)

            append_marked_word(words, "C", word)

            append_recent_long_word(recent_long_words, word)

    return words[:total_words]

def char_ngrams(words, n):

    counts = Counter()

    for word in words:

        if len(word) < n:

            continue

        for i in range(len(word) - n + 1):

            counts[word[i:i + n]] += 1

    return counts

def analyze(all_words):

    token_count = len(all_words)

    type_count = len(set(all_words))

    word_freq = Counter(all_words)

    char_bigram_counts = char_ngrams(all_words, 2)

    char_trigram_counts = char_ngrams(all_words, 3)

    print()

    print("=" * 60)

    print("GLOBAL WORD STATISTICS")

    print("=" * 60)

    print()

    print(f"Tokens : {token_count}")

    print(f"Types  : {type_count}")

    print(f"TTR    : {type_count / token_count:.4f}")

    print(f"Hapax  : {sum(1 for c in word_freq.values() if c == 1)}")

    print()

    print("Top 25 word tokens")

    print()

    for word, count in word_freq.most_common(25):

        print(f"{word:15} {count}")

    print()

    print("=" * 60)

    print("CHARACTER BIGRAMS")

    print("=" * 60)

    print()

    for bg, count in char_bigram_counts.most_common(30):

        print(f"{bg:5} {count}")

    print()

    print("=" * 60)

    print("CHARACTER TRIGRAMS")

    print("=" * 60)

    print()

    for tg, count in char_trigram_counts.most_common(30):

        print(f"{tg:5} {count}")

def format_token(entry, debug=False):

    marker, word = entry

    if debug:

        return f"{marker} {word}"

    return word

def print_page(page_num, page, debug=False):

    print()

    print("=" * 60)

    print(f"PAGE {page_num}")

    print("=" * 60)

    print()

    for i in range(0, len(page), WORDS_PER_LINE):

        print(" ".join(format_token(entry, debug) for entry in page[i:i + WORDS_PER_LINE]))

def main():

    parser = argparse.ArgumentParser()

    parser.add_argument("--debug", action="store_true")

    parser.add_argument("--english", action="store_true")

    args = parser.parse_args()

    if args.english: 

        global COMMON_WORDS

        COMMON_WORDS = COMMON_WORDS_ENGLISH

    random.seed(SEED)

    initialize_common_sources()

    recent_long_words = []

    total_words = PAGES * WORDS_PER_PAGE

    all_entries = generate_words(total_words, recent_long_words)

    all_words = [word for _, word in all_entries]

    for page_num in range(1, PAGES + 1):

        start = (page_num - 1) * WORDS_PER_PAGE

        end = start + WORDS_PER_PAGE

        page = all_entries[start:end]

        print_page(page_num, page, debug=args.debug)

    analyze(all_words)

if __name__ == "__main__":

    main()

The TTR stats:
Tokens : 9500
Types : 727
TTR : 0.0765
Hapax : 223

What it looks like:

oteedy chedary qoldaly cholky qoky saiiral chedy cpholky
chdykaiin chedary ol daiin chdoroldal otey opdorolky ockhol
ykair otey shedal chedaiin doroldal dorolkchol sheey qokain
orolky opdor daiin qokal oteedor oteedykaiin qoky daiin
chdaiiiral dorolkchol qokal dary dorolkchol sair otey daraly
cholky ykaiiral chdy chol dalykal qoldarary cholkykal oteedy
sholdaly choldaldaly oroldaiin daraly chdy chol oldalykal oroldalky
qokeedaly qolky sholdal oldaiiral doroldal doroldaly ockhol dorolkchol
qokeedy opdor oroldaraiin ockholdaly qokal dary oldaldaly qokeey
daiin qokal sair daldalykair oteey sholkeey chdy choldaiin
sholdaly ykal cpholkykal dary oteedykaiin dary cholky cpholky
qokain qokeedy ockholkair cpholkykal ockholky cheol choldal

However, I also wouldn't choose the strange fixed order sequences of the Voynichese if I was to create a fake language. For comparison, the same algorithm when seeded with some English words:

The TTR stats:
Tokens : 9500
Types : 1296
TTR : 0.1364
Hapax : 595

What it looks like:

one the troneach anythe and thingethe one moreach
twover wroneach a onlessshing wrong more alwayshess foublestach
righell one the howell tronghthe alwaysshis and the
becaushis eacaustay twaything right thinythis always anytheach more
wish ongethe are unless anythevever howevething two wrong
anytheach wroneserong more this wever wrong more aronghis
onleseress are therongethe morouble moreless each mover howeless
are unless thing alwaysever stay tronless howevething always
four arong two wrong howevever right anytheach twong
anythe is tronless wrong more wrouble because trouble
tronlesever is eachething thether thecause and the unlestay
because trouble morouble and issereless wrouble thecausstay

Add a simple substitution on top of this (replace each English letter with an invented alphabet) and you have what looks like a plausible text in some exotic language.

(26-05-2026, 06:14 PM)oshfdk Wrote: You are not allowed to view links. Register or Login to view.In fact, this way of creating a fake language is much simpler and produces a much more plausible result:
1) create a list of common words, about 20-30, write them down
2) create 10-20 common combinations of these common words, write them down
3) start writing mixing common words and common word combinations with random longer words, which would occasionally repeat
4) done

Ok, so you saw the output of your generator idea. Now, let me show you the difference when my ledger is used to create new words:

============================================================
PAGE 51
============================================================

daiin oroksy qoeeey qkamamaiiiin oroksy otedy shey saiin
daiin or qokain daiin rotain dain or cthol
oltheeo chol daiin cthey qokal otedy ol cthol
chedy cheaiiky daiin shey qokain qokair dair cpholsho
qokain qokain daiin daiin chol chedy cthol cthey
solycpco yaiiiiiin cthey dycfhea sary lcksheol solycpco cthol
cheaiiky olokchodaiin oaiiim aiin olaldear shol okol ol
cthosheesheo qokain daiin cheeolsheeoy olololarcheo shey qokain ol
chedy qokair chodalshof dchdchain sheochocy ytheeor shey dain
orchosheotcy qokair dair otedy sholar chodalshof otedy ol
dain chol aiin or chol dain chol qokair
choteee qokaiiiin chodalshof qokeedy dain dain chol

Now you can see that the random words being are a good bit more Voynich like. But, they still have a lot of issues. They are following weights from the Voynich in picking letters but it's still not enough.

I took my basic "don't look stupid rules" from my generator and added them to yours. Now look at the difference.

============================================================
PAGE 51
============================================================

chor daiin qokal daiiin chthiin chol daiin qokal
chthiin ychind chol daiin solaiidaiiin chol shey chol
daiin otedy otedy ychos or sodaindaiiir otaich cthey
or chedy chedy dain chol okol cheolys dair
or choqolaiidy qokain daiin qokain or chol cthol
chedy qokeedy chdy olshey qkolsho aiin chotchy or
chol qokain cthey sary chdaiiit sheomm daiiky ikhain
chol daiin dainochokheo qokeedy qokedy qokeedy qokedy ytodoaiim
sheomm qokal ofshodain cthey okol sheeeotor qocheochody saiin
daiin cthey daiin ofshodain shol dain qokeedy qokedy
cheeotsamo soshy shol qokeedy qokedy or chol cthey
daiin daiiky qokeedy qokedy sheocy chol qokair

Now we have some control in those random words over vowels and consonant structure. As well as repeated vowels and repeated consonants. It's starting to look more like Voynich instead of word soup but, those random words still don't look Voynich. Not enough rules. In order to get that right, I'd have to do some machine learning. Learning what letters follow what other letters, what bigrams follow other bigrams and add weighting to that.

To run this, download my Ledger_Scribe1.json from the repository and then:

py minimal_phrase_generator_v3.py --ledger Ledger_scribe1.json --pages 100 --print-all-pages

Code:
# minimal_phrase_generator_v3.py

#

# "Helped" version of the forum proposal:

#

#  1) fixed common words

#  2) fixed common combinations

#  3) mix those with occasional longer words

#  4) longer words are generated by using existing Ledger_scribe1.json

#  5) generated longer words may repeat locally

#

# Required external file:

#  Ledger_scribe1.json

#

# Example:

#  py minimal_phrase_generator_v3.py --ledger Ledger_scribe1.json --pages 100 --print-all-pages

import argparse

import json

import random

from collections import Counter

from pathlib import Path

# ============================================================

# SETTINGS

# ============================================================

DEFAULT_SEED = 42

DEFAULT_PAGES = 100

DEFAULT_WORDS_PER_PAGE = 95

DEFAULT_WORDS_PER_LINE = 8

# Uniform branch selection.

# This avoids hidden weighting between the three production modes.

MODES = [

    "common_word",

    "common_combination",

    "ledger_word",

]

# This is the smallest operational version of

# "random longer words which occasionally repeat."

LEDGER_WORD_REPEAT_CHANCE = 0.25

RECENT_LEDGER_WORD_WINDOW = 10

# Ledger does not encode word length, so this is the one explicit external assumption.

MIN_LEDGER_WORD_LEN = 5

MAX_LEDGER_WORD_LEN = 12

MAX_ATTEMPTS_PER_LEDGER_WORD = 200

GALLOWS = set("ktpf")

VOWELS = set("aeioy")

# Basic "don't look stupid" guards copied from the v11 generator idea.

# These are deliberately simple visual plausibility filters, not grammar.

MAX_VOWEL_RUN = 4

MAX_CONSONANT_RUN = 4

MAX_TOKEN_LEN = 12

# Repeat/family guards for newly generated ledger words.

RECENT_TOKEN_REPEAT_LIMIT = 3

MAX_LEDGER_TOKEN_PAGE_COUNT = 5

MAX_LEDGER_FAMILY_PAGE_COUNT = 9

# ============================================================

# FIXED COMMON WORDS

# ============================================================

COMMON_WORDS = [

    "daiin", "dain", "ol", "or", "chol",

    "chedy", "qokain", "qokeedy", "qotedy", "otedy",

    "shey", "cthey", "cthol", "shol", "chor",

    "dair", "qokair", "saiin", "aiin", "okain",

    "sary", "okol", "qol", "qokal", "chdy",

]

# ============================================================

# FIXED COMMON COMBINATIONS

# ============================================================

COMMON_COMBINATIONS = [

    ["qokain", "daiin"],

    ["chol", "chedy"],

    ["qokeedy", "qokedy"],

    ["ol", "chedy"],

    ["dain", "chol"],

    ["shey", "qokain"],

    ["cthey", "daiin"],

    ["shol", "chor"],

    ["qotedy", "qokain"],

    ["otedy", "ol"],

    ["saiin", "daiin"],

    ["qokair", "dair"],

    ["chol", "daiin"],

    ["cthol", "chedy"],

    ["or", "chol"],

]

# ============================================================

# BASIC HELPERS

# ============================================================

def weighted_choice(rng, values, weights=None):

    if not values:

        return None

    if weights is None:

        return rng.choice(list(values))

    return rng.choices(list(values), weights=list(weights), k=1)[0]

def char_ngrams(words, n):

    counts = Counter()

    for word in words:

        for i in range(0, len(word) - n + 1):

            counts[word[i:i+n]] += 1

    return counts

def max_run(token, charset):

    best = 0

    current = 0

    for ch in token:

        if ch in charset:

            current += 1

            best = max(best, current)

        else:

            current = 0

    return best

def family_form(token):

    return "".join(ch for ch in token if ch not in GALLOWS)

def passes_dls(token, ledger, page):

    """

    Basic v11-style don't-look-stupid filter.

    Applied only to newly generated ledger words, not to the fixed

    common-word/common-combination scaffolding.

    """

    if not token:

        return False

    if len(token) > MAX_TOKEN_LEN:

        return False

    if max_run(token, VOWELS) > MAX_VOWEL_RUN:

        return False

    consonants = set(ledger.alphabet) - VOWELS

    if max_run(token, consonants) > MAX_CONSONANT_RUN:

        return False

    if page and token == page[-1]:

        return False

    if page[-RECENT_LEDGER_WORD_WINDOW:].count(token) >= RECENT_TOKEN_REPEAT_LIMIT:

        return False

    if page.count(token) >= MAX_LEDGER_TOKEN_PAGE_COUNT:

        return False

    family = family_form(token)

    if family:

        family_count = sum(

            1 for existing in page

            if family_form(existing) == family

        )

        if family_count >= MAX_LEDGER_FAMILY_PAGE_COUNT:

            return False

    return ledger.validate(token)

# ============================================================

# YOUR LEDGER FORMAT

# ============================================================

class Ledger:

    def __init__(self, path):

        path = Path(path)

        if not path.exists():

            raise FileNotFoundError(

                f"Ledger file not found: {path}\n"

                "Put Ledger_scribe1.json beside this script or pass --ledger PATH."

            )

        with path.open("r", encoding="utf-8") as f:

            data = json.load(f)

        self.metadata = data.get("metadata", {})

        self.rows = data["ledger"]

        self.alphabet = list(data.get("alphabet", self.rows.keys()))

        self.columns = tuple(self.metadata.get("columns", ["prefix", "midfix", "suffix"]))

        self.tiers = tuple(self.metadata.get("tiers", ["80", "18", "2"]))

        self.short_tokens = set((self.metadata.get("short_tokens") or {}).keys())

        self.tier_weights = self._tier_weights(self.tiers)

        self.first_token_values, self.first_token_weights = self._load_first_token_weights()

        self.followers = {}

        for glyph, row in self.rows.items():

            self.followers[glyph] = {}

            for column in self.columns:

                values = []

                weights = []

                for tier, tier_weight in self.tier_weights:

                    glyphs = row.get(column, {}).get(tier, [])

                    if not glyphs:

                        continue

                    each_weight = tier_weight / len(glyphs)

                    for follower in glyphs:

                        values.append(follower)

                        weights.append(each_weight)

                self.followers[glyph][column] = (values, weights)

    def _tier_weights(self, tiers):

        numeric = []

        for tier in tiers:

            try:

                numeric.append((tier, float(tier) / 100.0))

            except ValueError:

                numeric.append((tier, 1.0))

        total = sum(weight for _, weight in numeric) or 1.0

        return [

            (tier, weight / total)

            for tier, weight in numeric

        ]

    def _load_first_token_weights(self):

        raw_weights = (

            self.metadata.get("first_token_weights")

            or self.metadata.get("start_token_weights")

        )

        if raw_weights:

            pairs = [

                (glyph, float(weight))

                for glyph, weight in raw_weights.items()

                if glyph in self.rows and float(weight) > 0

            ]

            if pairs:

                pairs.sort()

                values = [p[0] for p in pairs]

                weights = [p[1] for p in pairs]

                return values, weights

        raw_counts = (

            self.metadata.get("first_tokens")

            or self.metadata.get("start_tokens")

        )

        if raw_counts:

            pairs = [

                (glyph, float(count))

                for glyph, count in raw_counts.items()

                if glyph in self.rows and float(count) > 0

            ]

            if pairs:

                pairs.sort()

                values = [p[0] for p in pairs]

                weights = [p[1] for p in pairs]

                return values, weights

        return list(self.rows.keys()), None

    def choose_start_glyph(self, rng):

        return weighted_choice(

            rng,

            self.first_token_values,

            self.first_token_weights,

        )

    def choose_follower(self, rng, left, column):

        values, weights = self.followers.get(left, {}).get(column, ([], []))

        if not values:

            return None

        return weighted_choice(rng, values, weights)

    def legal_transition(self, left, right, column):

        row = self.rows.get(left, {}).get(column, {})

        for tier in self.tiers:

            if right in row.get(tier, []):

                return True

        return False

    def has_multiple_gallows(self, token):

        return sum(1 for ch in token if ch in GALLOWS) > 1

    def validate(self, token):

        if not token:

            return False

        if self.has_multiple_gallows(token):

            return False

        if len(token) == 1:

            return token in self.short_tokens

        if token[0] not in self.rows:

            return False

        if not self.legal_transition(token[0], token[1], "prefix"):

            return False

        for index in range(2, len(token) - 1):

            if not self.legal_transition(token[index - 1], token[index], "midfix"):

                return False

        return self.legal_transition(token[-2], token[-1], "suffix")

    def generate_word(self, rng, page=None, min_len=MIN_LEDGER_WORD_LEN, max_len=MAX_LEDGER_WORD_LEN):

        page = page or []

        for _ in range(MAX_ATTEMPTS_PER_LEDGER_WORD):

            target_len = rng.randint(min_len, max_len)

            chars = [self.choose_start_glyph(rng)]

            if not chars[0]:

                continue

            while len(chars) < target_len:

                left = chars[-1]

                if len(chars) == 1:

                    column = "prefix"

                elif len(chars) == target_len - 1:

                    column = "suffix"

                else:

                    column = "midfix"

                right = self.choose_follower(rng, left, column)

                if right is None:

                    break

                chars.append(right)

            token = "".join(chars)

            if len(token) == target_len and passes_dls(token, self, page):

                return token

        raise RuntimeError(

            "Could not generate a ledger-valid DLS-passing word after "

            f"{MAX_ATTEMPTS_PER_LEDGER_WORD} attempts."

        )

# ============================================================

# GENERATION

# ============================================================

def generate_page(rng, ledger, words_per_page):

    page = []

    recent_ledger_words = []

    while len(page) < words_per_page:

        mode = rng.choice(MODES)

        if mode == "common_word":

            page.append(rng.choice(COMMON_WORDS))

        elif mode == "common_combination":

            page.extend(rng.choice(COMMON_COMBINATIONS))

        elif mode == "ledger_word":

            word = None

            if recent_ledger_words and rng.random() < LEDGER_WORD_REPEAT_CHANCE:

                repeat_candidates = [

                    candidate for candidate in recent_ledger_words

                    if passes_dls(candidate, ledger, page)

                ]

                if repeat_candidates:

                    word = rng.choice(repeat_candidates)

            if word is None:

                word = ledger.generate_word(rng, page=page)

                recent_ledger_words.append(word)

                if len(recent_ledger_words) > RECENT_LEDGER_WORD_WINDOW:

                    recent_ledger_words.pop(0)

            page.append(word)

    return page[:words_per_page]

# ============================================================

# OUTPUT / ANALYSIS

# ============================================================

def fixed_vocab():

    vocab = set(COMMON_WORDS)

    for combo in COMMON_COMBINATIONS:

        vocab.update(combo)

    return vocab

def print_page(page_num, page, words_per_line):

    print()

    print("=" * 60)

    print(f"PAGE {page_num}")

    print("=" * 60)

    print()

    for i in range(0, len(page), words_per_line):

        print(" ".join(page[i:i + words_per_line]))

def analyze(all_words):

    vocab = fixed_vocab()

    freq = Counter(all_words)

    generated_occurrences = [

        word for word in all_words

        if word not in vocab

    ]

    generated_freq = Counter(generated_occurrences)

    print()

    print("=" * 60)

    print("GLOBAL STATISTICS")

    print("=" * 60)

    print()

    print(f"Tokens                    : {len(all_words)}")

    print(f"Types                      : {len(freq)}")

    print(f"TTR                        : {len(freq) / len(all_words):.4f}")

    print(f"Hapax                      : {sum(1 for c in freq.values() if c == 1)}")

    print(f"Fixed vocabulary size      : {len(vocab)}")

    print(f"Ledger-word occurrences    : {len(generated_occurrences)}")

    print(f"Ledger-word types          : {len(generated_freq)}")

    print(f"Repeated ledger-word types : {sum(1 for c in generated_freq.values() if c > 1)}")

    print(f"Max ledger-word repeat    : {max(generated_freq.values()) if generated_freq else 0}")

    print()

    print("Top 25 word tokens")

    print()

    for word, count in freq.most_common(25):

        print(f"{word:20} {count}")

    print()

    print("Top 30 character bigrams")

    print()

    for bg, count in char_ngrams(all_words, 2).most_common(30):

        print(f"{bg:5} {count}")

    print()

    print("Top 30 character trigrams")

    print()

    for tg, count in char_ngrams(all_words, 3).most_common(30):

        print(f"{tg:5} {count}")

    print()

    print("Top repeated ledger-generated words")

    print()

    shown = 0

    for word, count in generated_freq.most_common():

        if count <= 1:

            break

        print(f"{word:20} {count}")

        shown += 1

        if shown >= 25:

            break

def main():

    parser = argparse.ArgumentParser(

        description="Minimal phrase + common word generator using the existing Scribe 1 ledger for new words."

    )

    parser.add_argument("--ledger", default="Ledger_scribe1.json")

    parser.add_argument("--pages", type=int, default=DEFAULT_PAGES)

    parser.add_argument("--words-per-page", type=int, default=DEFAULT_WORDS_PER_PAGE)

    parser.add_argument("--words-per-line", type=int, default=DEFAULT_WORDS_PER_LINE)

    parser.add_argument("--seed", type=int, default=DEFAULT_SEED)

    parser.add_argument("--print-all-pages", action="store_true")

    parser.add_argument("--output", default=None, help="Optional text output file.")

    args = parser.parse_args()

    rng = random.Random(args.seed)

    ledger = Ledger(args.ledger)

    lines = []

    all_words = []

    # Capture print output if --output is requested.

    # Simpler than redirecting stdout externally.

    def emit(text=""):

        print(text)

        if args.output is not None:

            lines.append(text)

    emit("Common words:")

    emit(", ".join(COMMON_WORDS))

    emit()

    emit("Common combinations:")

    for combo in COMMON_COMBINATIONS:

        emit(" ".join(combo))

    for page_num in range(1, args.pages + 1):

        page = generate_page(rng, ledger, args.words_per_page)

        all_words.extend(page)

        if args.print_all_pages or page_num <= 3 or page_num == args.pages:

            emit()

            emit("=" * 60)

            emit(f"PAGE {page_num}")

            emit("=" * 60)

            emit()

            for i in range(0, len(page), args.words_per_line):

                emit(" ".join(page[i:i + args.words_per_line]))

    # Analysis prints to stdout directly.

    analyze(all_words)

    if args.output is not None:

        Path(args.output).write_text("\n".join(lines) + "\n", encoding="utf-8")

if __name__ == "__main__":

    main()

Disclaimer: I had codex create and modify this code from a prompt.

(26-05-2026, 11:28 PM)Dunsel Wrote: You are not allowed to view links. Register or Login to view.PRECISELY... which is what my ledger does. Weighted letters.

I was thinking more along the lines of an extension for “random.” Just for a quick suggestion, I decided to ask the AI for the first time. The basic idea seems viable, but of course the accuracy of the weighting would need to be verified.

Code:
# oshfdk_literal_generator_modified.py

# No external files required.

#

# Literal version of:

# 1) create 20-30 common words

# 2) create 10-20 common combinations

# 3) mix common words + common combinations + random longer words

# 4) random longer words occasionally repeat

#

# No mutation.

# No ledger.

# No syllables.

# No hidden morphology.

#

# FIX for Problem:

# Long words are no longer generated by flat random character sampling.

# Instead, positional character pools (INITIAL / MEDIAL / TERMINAL) with

# weights give each word a plausible frame:

#  INITIAL  – q, d, s, c dominate word-starts

#  MEDIAL  – o, e, a, i are the backbone of word bodies

#  TERMINAL – n, y, r, l, m are the most common word-endings

import random

from collections import Counter

SEED          = 42

PAGES          = 100

WORDS_PER_PAGE = 95

WORDS_PER_LINE = 8

# ---------------------------------------------------------------------------

# Standard EVA-ish alphabet, sorted alphabetically.

# ---------------------------------------------------------------------------

VOYNICH_ALPHABET = "acdefhiklnoqrstxy"

LONG_WORD_MIN_LEN      = 5

LONG_WORD_MAX_LEN      = 12

LONG_WORD_REPEAT_CHANCE = 0.25

RECENT_LONG_WORD_WINDOW = 10

# ---------------------------------------------------------------------------

# Common words and combinations (unchanged)

# ---------------------------------------------------------------------------

COMMON_WORDS = [

    "daiin", "dain", "ol", "or", "chol",

    "chedy", "qokain", "qokeedy", "qotedy", "otedy",

    "shey", "cthey", "cthol", "shol", "chor",

    "dair", "qokair", "saiin", "aiin", "okain",

    "sary", "okol", "qol", "qokal", "chdy",

]

COMMON_COMBINATIONS = [

    ["qokain", "daiin"],

    ["chol",  "chedy"],

    ["qokeedy","qokedy"],

    ["ol",    "chedy"],

    ["dain",  "chol"],

    ["shey",  "qokain"],

    ["cthey",  "daiin"],

    ["shol",  "chor"],

    ["qotedy", "qokain"],

    ["otedy",  "ol"],

    ["saiin",  "daiin"],

    ["qokair", "dair"],

    ["chol",  "daiin"],

    ["cthol",  "chedy"],

    ["or",    "chol"],

]

# ---------------------------------------------------------------------------

# Weights are intentionally skewed to reflect the strong positional biases

# observed in Voynich script:

#  INITIAL  – q, d, s, c dominate word-starts

#  MEDIAL  – o, e, a, i are the backbone of word bodies

#  TERMINAL – n, y, r, l, m are the most common word-endings

# ---------------------------------------------------------------------------

INITIAL_CHARS = {

    "q": 8, "d": 7, "s": 6, "c": 6,

    "o": 4, "a": 3, "f": 2, "r": 1,

    "k": 1, "t": 1,

}

MEDIAL_CHARS = {

    "o": 9, "e": 7, "a": 6, "i": 6,

    "l": 4, "n": 3, "k": 3, "h": 3,

    "r": 2, "t": 2, "d": 1, "s": 1,

}

TERMINAL_CHARS = {

    "n": 8, "y": 8, "r": 6, "l": 5,

    "m": 4, "s": 3, "d": 3,

}

def weighted_choice(weight_dict: dict) -> str:

    """Return a single character sampled according to integer weights."""

    chars  = list(weight_dict.keys())

    weights = list(weight_dict.values())

    return random.choices(chars, weights=weights, k=1)[0]

def make_random_long_word() -> str:

    """

    Build a long word using positional character pools.

    Structure:

      [INITIAL char]  +  [MEDIAL body]  +  [TERMINAL char]

    """

    length = random.randint(LONG_WORD_MIN_LEN, LONG_WORD_MAX_LEN)

    # --- Word-initial character (positional pool) ---

    word = weighted_choice(INITIAL_CHARS)

    # --- Medial body (positional pool) ---

    for _ in range(length - 2):

        word += weighted_choice(MEDIAL_CHARS)

    # --- Word-terminal character (positional pool) ---

    word += weighted_choice(TERMINAL_CHARS)

    return word

# ---------------------------------------------------------------------------

# Page generation (unchanged logic, fixed long-word generator plugged in)

# ---------------------------------------------------------------------------

def generate_page() -> list[str]:

    page            = []

    recent_long_words = []

    while len(page) < WORDS_PER_PAGE:

        mode = random.choice(["common_word", "combination", "long_word"])

        if mode == "common_word":

            page.append(random.choice(COMMON_WORDS))

        elif mode == "combination":

            combo = random.choice(COMMON_COMBINATIONS)

            page.extend(combo)

        else:  # long_word

            if recent_long_words and random.random() < LONG_WORD_REPEAT_CHANCE:

                word = random.choice(recent_long_words)

            else:

                word = make_random_long_word()

                recent_long_words.append(word)

                if len(recent_long_words) > RECENT_LONG_WORD_WINDOW:

                    recent_long_words.pop(0)

            page.append(word)

    return page[:WORDS_PER_PAGE]

# ---------------------------------------------------------------------------

# Analysis helpers (unchanged)

# ---------------------------------------------------------------------------

def char_ngrams(words: list[str], n: int) -> Counter:

    counts = Counter()

    for word in words:

        if len(word) < n:

            continue

        for i in range(len(word) - n + 1):

            counts[word[i:i + n]] += 1

    return counts

def analyze(all_words: list[str]) -> None:

    token_count      = len(all_words)

    type_count      = len(set(all_words))

    word_freq        = Counter(all_words)

    char_bigram_counts  = char_ngrams(all_words, 2)

    char_trigram_counts = char_ngrams(all_words, 3)

    print()

    print("=" * 60)

    print("GLOBAL WORD STATISTICS")

    print("=" * 60)

    print()

    print(f"Tokens : {token_count}")

    print(f"Types  : {type_count}")

    print(f"TTR    : {type_count / token_count:.4f}")

    print(f"Hapax  : {sum(1 for c in word_freq.values() if c == 1)}")

    print()

    print("Top 25 word tokens")

    print()

    for word, count in word_freq.most_common(25):

        print(f"{word:15} {count}")

    print()

    print("=" * 60)

    print("CHARACTER BIGRAMS")

    print("=" * 60)

    print()

    for bg, count in char_bigram_counts.most_common(30):

        print(f"{bg:5} {count}")

    print()

    print("=" * 60)

    print("CHARACTER TRIGRAMS")

    print("=" * 60)

    print()

    for tg, count in char_trigram_counts.most_common(30):

        print(f"{tg:5} {count}")

def print_page(page_num: int, page: list[str]) -> None:

    print()

    print("=" * 60)

    print(f"PAGE {page_num}")

    print("=" * 60)

    print()

    for i in range(0, len(page), WORDS_PER_LINE):

        print(" ".join(page[i:i + WORDS_PER_LINE]))

# ---------------------------------------------------------------------------

# Entry point

# ---------------------------------------------------------------------------

def main() -> None:

    random.seed(SEED)

    all_words = []

    for page_num in range(1, PAGES + 1):

        page = generate_page()

        all_words.extend(page)

        print_page(page_num, page)

    analyze(all_words)

if __name__ == "__main__":

    main()

Sorry, I was a little slow in posting this. - Edit: Where exactly can I find “Ledger_Scribe1.json”?

Pages: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19