The Voynich Ninja
Why and how the text could be Bavarian - Printable Version

+- The Voynich Ninja (https://www.voynich.ninja)
+-- Forum: Voynich Research (https://www.voynich.ninja/forum-27.html)
+--- Forum: Theories & Solutions (https://www.voynich.ninja/forum-58.html)
+--- Thread: Why and how the text could be Bavarian (/thread-5312.html)

Pages: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23


RE: Why and how the text could be Bavarian - JoJo_Jost - 15-05-2026

What I’d also like to add are the bigrams that span word boundaries—along with their frequencies.

(Note: Since I split “aiin” into “ai” and “in” and treat “in” as a fixed unit—and do the same with “qo”—the bigram ‘nq’ will be represented as “inqo.” This is just for better readability!)

What is really interesting, but of course fits the patterns shown above: With just 30 of these bigrams, 80 percent of all space bigrams can be represented.

If you look at the distribution, it could be highly dangerous to ignore these bigrams; after all, there are over 32,000 (!) of them. In my opinion, ignoring them:

1. could explain a great many of the failed attempts to decrypt VMS.
2. could have massively influenced or even distorted so many statistics.

But, of course, I don’t know if they are really relevant. In any case, I will continue working with this level.

The list of bigrams.


   


RE: Why and how the text could be Bavarian - JoJo_Jost - 21-05-2026

A possible inner structure of the VMS - vowels across the spaces?

I think I have finally found a solution to my old problem with the cores, one that could fit the 15th century.

Mandatory disclaimer: This is not yet a solution and not a reading. It is a structural hypothesis. But for me (!) it suddenly explains a lot of things in the VMS that until now seemed separate and strange.

One basic assumption stays: The visible word boundaries of the VMS are not real word boundaries.

Thesis: The visible spaces cut right through a vowel or linking layer.

A number of things follow from this.

1. Two different layers: inside and across the space

I distinguish two kinds of bigrams.

Inner bigrams: glyph pairs within a single visible VMS token.

Example:

chedy
-> (ch) e+d (y)

Cross bigrams: glyph pairs across visible spaces.

Example:

chedy qokeedy
-> y+qo

The interesting point is: several of the oddities of the VMS fit exactly into this split. Line Initial Markers (LIM), Line End Markers (LEM), short tokens and certain long tokens behave as if there really were an inner core layer and an outer joint or linking layer.

My current definition of this structure is:

Inner bigrams = consonant / cluster layer
Cross bigrams = vowel layer

2. The actual shift

In this theory the real sound or word structure does not look like this:

Classical reading:
Token | Token | Token

but rather like this:
consonant core + vowel bigram + consonant core ...

And the visible VMS cuts right through these vowel bigrams (!).

A visible token in the VMS would then be roughly something like:
half a vowel code + consonant core + half a vowel code

Or shown schematically (simplified representation):

V C C V

where VV is always a vowel bigram and CC is always a consonant bigram. V is a half Vowel

But the two V's do not simply belong to the token itself. Together with the neighbouring tokens they form the actual vowel bigrams.

For two tokens in a normal VMS line:
V C C V  V C C V

the real vowel lies across the visible boundary:
V CC V|V CC V

So: the vowel bigram sits between the original tokens, on top of the space, and the space cuts this vowel bigram in half.
That would be the surprising trick.

Take:

sheedy qokeedy

Roughly broken down:

sheedy = sh | e ed | y
qokeedy = qo | ke ed | y

The transition between the two tokens is:

y|qo

(Note: e, ee and the other "VMS vowels" would then of course no longer be vowels, but only glyph combinations that represent consonants.)

That would be the vowel or linking code in my theory.

So the real reading would not stop at sheedy and start again at qokeedy, but run across the visible boundary:

... ed + y|qo + ke ...

That is then not simply word end plus new word, but:

core + vowel + core

This way you suddenly get a plausible consonant / vowel structure in the VMS (simplified representation):

CC VV CC VV CC VV

Or, if you think of it as an actual sound sequence:

core - vowel - core - vowel - core

3. Why this is attractive for language

The problem with many polyphonic substitution approaches to the VMS, at least for me, was: the cores get too small, the frequencies do not match vowels and consonants, and you do not get a good language structure in terms of vowels and consonants. That is exactly what happened with my older attempts.

In this new variant something different happens.

The actual words are not separated at the visible spaces, but shifted in between the consonant cores and the vowel bigrams.

This produces a much more natural structure:

K V K
K V K
K V K

Interestingly, this short way of writing fits reduced MHG / Bavarian quite well.

Many Bavarian forms are short and basically built as cluster-vowel-cluster:

Haus
Haut
Wein
Bein
Leim
neun

Not always exactly, of course. But as a basic pattern it is strong. And since Bavarian is an almost monosyllabic language, it fits this structure well. But of course, longer words can also be encrypted this way.

Overall the VMS problem becomes less severe, because the "real" vowels are just in the wrong place: not in the visible word, but across the space.

4. What happens to the cores then?

In this model the visible token cores would not be complete words, but consonant or cluster parts.

A token then contains roughly:

the second part of a vowel bigram from the previous transition,
a final consonant or cluster part,
an initial consonant or cluster part,
the first part of the next vowel bigram.

So roughly:

V C C V

The real word or syllable boundary then does not lie at the visible space, but in the consonant region of the token.

[...] VC | CVVC | CVVC


This turns the picture of word boundaries completely upside down.
The visible VMS words are then not words, but wrappers:

half a vowel + consonant material + half a vowel.

This also explains why so many VMS tokens look so strangely similar. They do not have to be normal words. They can be small, recurring consonant and vowel frames.

5. A small core inventory

If this idea is right, there should be a small list of frequent inner bigrams.

And that is exactly what you see.

In my current tokenisation, where I treat ch, sh, qo etc. as units (single glyphs) and where the aiin family is at first defined as a protected special block, in Currier B roughly:

Top 20 inner bigrams: about 66 %
Top 30 inner bigrams: about 75 %
Top 50 inner bigrams: about 85 %

So one can speak of a small, reusable core inventory.

Frequent inner bigrams are for example:

e+d
e+e
k+e
ch+e
e+o
k+a
t+e
o+d
d+a
t+a

I am not yet saying:

e+d = n

That would be too early.

But structurally it looks as if these inner bigrams belong to the consonant or cluster layer. So to the parts that, in the real reading, stand before and after the vowel bigrams.

And here you can see why the two ideas belong together.

If the visible tokens in the middle are not real words, but consonant material around a shifted vowel layer, then many repetitions suddenly become less absurd. They are then not necessarily word repetitions, but recurring building blocks of a shifted writing system.

6. Short tokens: filler for special cases

A big problem, for me and for others, was always these very short 1- and 2-glyph tokens:

y
s
or
ar
ol
dy
etc., everybody knows them.

If the visible spaces were real word boundaries, these would all have to be tiny words. Some maybe. But the quantity is strange.

In my joint model they get a clear function.

Because a real language does not consist only of perfect K-V-K chains. There are words that begin with a vowel, and words that end in a vowel.

Example:
"eine andere" (English: a(n) other)

"eine" ends in a vowel.
"andere" begins with a vowel.

If the cross bigrams are the vowel or linking layer, a problem arises here: two vowel values meet (simplified representation).

VV CC VV VV CC VV

A simple vowel transition is not enough for this (except for diphthongs, which are probably included in the cross bigrams).
This is exactly where the short tokens could help.

Many 2-atom tokens look like little joint pieces:

left vowel half + right vowel half
no consonant core

So not necessarily short words, but filler for special cases in the vowel stream.

I checked this against the neighbourhoods:

last atom of the predecessor + first atom of the 2-atom token
last atom of the 2-atom token + first atom of the successor

Many frequent 2-atom tokens have cross-typical couplings on both sides.

Examples:

ol
ar
or
al
dy
chy
qol
cthy


This fits the idea well: these tokens often carry no core, but connect two vowel or linking positions. 
But not all short tokens are like this. Some are asymmetric. In such forms often only one side is cross-typical, while the other already shifts into the core layer. That is not a problem. On the contrary: it shows that these short tokens can take on different technical roles:

cross-cross
cross-core
core-cross

They are thus small, necessary switching pieces in the stream - which, if you encipher a normal language, would logically have to occur.

7. Vowel-initial words and the LIM (Line Initial Markers)

Another problem:

What happens at the start of a line? If the vowels are normally written as cross bigrams across boundaries, then at the start of the line the predecessor is missing. This is especially important for words that begin with a vowel:
in German:

anfangen (begin)
aufhoeren (stop)
essen (eat)
oben (above)
etc.

If the first plaintext word of the line begins with a vowel, the system needs an artificial left vowel half at the start of the line. This is exactly where the LIM could come in. Even more interesting: the LIM seem to have two roles.

Some form rather typical inner bigrams with the following atom:

o
qo
d
p
ch

These would be cases where the line starts directly with a consonant core. These signs would then not simply be part of the normal text, but start signs for a consonantal beginning. That is why they do not have to behave like normal beginnings.

Other LIM, in particular:

y
s

form rather cross bigrams with the following atom, with y of course being extremely productive. These would be cases where the line begins with a vowel or linking value.

And here it gets interesting: If I look only at normal text lines in the Herbal running text, that is, no label lines, then about 20.3 % of the lines begin with these possible vocalic LIM y or s.

In my MHG comparison texts, around 20-21 % of the lines begin with a vowel. This close match is of course partly coincidence. Different texts have different values. But the order of magnitude fits remarkably well.

If y / s really are the vocalic start classes, then their frequency in the Herbal text looks roughly like what you would expect from genuinely vowel-initial lines in MHG / Bavarian.

Then the LIM would not be decoration. They would be start operators.

More precisely: y and s could be start forms of the vowel or linking layer. y in particular behaves almost completely cross-typically at the start of a line. s is somewhat more mixed, but also not normally inner-typical.

This could even be a first anchor for the underlying polyphony of the vowel bigrams: at the start of a line, y and s might show the base class, while the same vowel or linking values are polyphonically disguised inside the line by other left halves.

8. Line ends and the m

If this holds at the start, it should mirror at the end.

At the start of the line the predecessor is missing.
At the end of the line the successor is missing.

If a vowel or linking code normally runs across a joint, then the line cannot simply break off. The stream has to be closed.
This is where the LEM come in, the Line End Markers.

Particularly interesting here is of course the m, or the am / om phenomenon (as a bigram).

m is, as we all know, not particularly dominant in the normal text flow, but strongly overrepresented at the end of a line. In one test, m at the end of a line was about 15 times more frequent than at normal token ends.

In this model m / am / om would not be a normal sound value, but a kind of closing operator, mostly as its own closing formula with a preceding vowel or linking part.

Xm = closing sign for the open stream at the end of the line

Then we have a nice symmetry:

LIM = start operator
LEM / m = closing operator

This would also explain the meaning, or rather the necessity, of these particular LAAFU effects.

9. The problems: AIIN remains the hard special case

The biggest open knot, for me, is still the aiin family.

My explanation, but only as an idea:

AIIN = internal ending / nasal / closing block, as opposed to the am end-closing block (the theory that am is just a differently written aiin already exists).

Maybe an -en / -n ending class. That would fit MHG / Bavarian well. But unfortunately it is not certain yet.

What I can say fairly confidently: aiin does not behave like normal core material - what a surprise.

10. Is all of this plausible for the 15th century?

The clever thing about this structure is that it has nothing to do with complicated mathematics.

It is a bigram notation with a small but highly effective layout shift.

You take vowel values, write them as bigrams across visible boundaries, and leave the consonant or cluster parts standing in the visible tokens. Already in the 15th century people knew that vowels are revealing - and that is why they were disguised.

But the other means fit the period too:

Spaces were not always set consistently, even in normal manuscripts.
In ciphers, spaces could be omitted or shifted.
Polyphonic encipherment is historically attested.
Tables of character values are historically attested.
Multi-character or bigram values would not be obviously anachronistic.

And the brilliant part, as I wrote before, would be:

The signs used immediately look Latin-medieval to a reader.

y recalls the 9-shaped abbreviation sign.
qo looks like a familiar medieval ligature or abbreviation form.

Everybody thinks: Latin, abbreviations, words.

Conclusion

But if these signs are in truth vowel halves across the spaces, then you are looking in exactly the wrong place. The eye is additionally led away from the real content.
Something that, if you look at the many proposed solutions, would have worked perfectly to this day. Wink 
If this approach is right. Huh

Maybe this view is wrong. But once you look at the VMS through this lens, a lot of it suddenly becomes remarkably logical.

So, and now I'll put on a helmet too (someone else here in the forum wrote that after publishing his theory, I found it funny) and wait to see what happens.

Yours, Jost


RE: Why and how the text could be Bavarian - JoJo_Jost - 21-05-2026

Example: 

To show what I mean, here is a real sentence from one of my MHG
comparison texts (a uroscopy / medical text):
 
  "Wer den harm recht schawen wil, der nem ein weiss glas vas,
  das lautter sey, vnd oben enger den vnden."
 
  (Translation: Whoever wants to examine the urine properly should take a
  white glass vessel that is clear, and narrower at the top
  than at the bottom.)

I am now assuming that all consonants are also represented as bigrams. However, this is practically nonsensical and would not work. Individual consonants are not bigrams. So this is still the simplified form, only now expanded to include consonants as bigrams. In other words, a token = V CC CC V
(Sch / ch are treated as a single bigram)

Wer         CC VV CC
den         CC VV CC
harm       CC VV CC CC
recht       CC VV CC CC
schawen  CC VV CC VV CC
wil          CC VV CC
der         CC VV CC
nem        CC VV CC
ein          VV CC
weiss      CC VV CC
glas        CC CC VV CC
vas         CC VV CC
das         CC VV CC
lautter     CC VV CC VV
sey          CC VV
v(u)nd     VV CC CC
oben        VV CC VV CC
enger       VV CC VV
den          CC VV CC
v(u)nden  VV CC CC VV CC

Stream:
CC VV CC  CC VV CC CC VV CC CC CC VV CC CC CC VV CC VV CC CC VV CC CC VV CC
 CC VV CC VV CC CC VV CC CC CC VV CC CC VV CC CC VV CC CC VV CC VV CC VV VV CC CC VV CC VV CC VV CC VV CC VV CC VV CC CC VV CC


RE: Why and how the text could be Bavarian - Grove - 21-05-2026

Wouldn’t your stream look more like this?

CCV VCCCCV VCCCCV VCCCCCCV VCCCCCCV VCCV VCCCCV VCCCCV VCCCCV VCCV VCCCCV VCCCCCCV VCCCCV VCCCCV VCCCCV VCCV VCCV VV VCCCCV VCCV VCCV VCV VCCV VCCV VCCCCV VCC

Basically VMS word lengths of:
3,6,6,8,8,4,6,6,6,4,6,8,6,6,6,4,4,2,6,4,4,3,4,4,6,3

Not sure how that compares overall to VMS. Where are the other odd lengths like 5 and what do you do with the odd single characters?


RE: Why and how the text could be Bavarian - JoJo_Jost - 21-05-2026

Yes, that's correct, but I just want to clarify what I'm actually doing—in terms of structure—and this post was already quite long. Wink

As I wrote above, this sequence assumes that every consonant is a bigram. But gallows and certain other glyphs (o/a) are probably not bigrams, but single letters, and then the 2, 4, 6, etc. division immediately becomes invalid.

And exactly, there are also 1-glyph tokens, 2-glyph tokens, etc. These are necessary because not all words are structured as CCVVCC; there are also words with VVCCVV, etc. When two vowels meet across a word boundary, it gets tricky. Or when consonants (which is possible in German) form very long clusters, you need these 1/2-glyph switches, because otherwise the rhythm is broken. That would explain her unusually frequent appearances very well!

And of course, the overall distribution is still difficult, including the very wide spread of the VMS. But I’m making progress there—though you can only make progress if you base it on a language. I don’t know if it’s Bavarian.... that’s still just a guess. Some small details actually fit Latin slightly better, but overall, Middle High German fits quite well.

But, as I said, that’s not the point right now. I’m trying to illustrate this inverse perspective on the VMS to make it clear that the VMS might simply need a change of perspective—if it hasn’t been solved yet. And this is one such perspective; it might not be the right one, but it shows that there are other ways of looking at the world. 


And I think that’s crazy enough... Cool


RE: Why and how the text could be Bavarian - JoJo_Jost - 21-05-2026

A quick addendum:

If you incorporate these rules (and two others I haven't mentioned yet) and run them on the MHD corpus, you get a distribution of word lengths that comes very close to the VMS. On average, it's actually very close; it's just that the curve looks a little different... more 4/5 Glyph Token, less 1 Glyph Token. I'm currently trying to figure out why...

(But it's all still very rough, so take it with a grain of salt.)


RE: Why and how the text could be Bavarian - Grove - 21-05-2026

I’ve often wondered about tables with column row pairs, but wasn’t sure how to handle odd lengths. I like the idea of ignoring the spaces with a rule of some kind like this. I had wondered about row or table shifts in a cipher but this could be a fun shift in thinking. There would have to be some clarity to when a character is standalone and when it requires a pairing.

I still struggle with P and F being characters due to their prominence in first lines of paragraphs only and particularly heavy amount of first character of a paragraph occurrences. To me they have to have a specific purpose that is only required in first lines. I can’t think of any linguistic reason for certain characters to be restricted in use.


RE: Why and how the text could be Bavarian - JoJo_Jost - 22-05-2026

(21-05-2026, 05:27 PM)Grove Wrote: You are not allowed to view links. Register or Login to view.There would have to be some clarity to when a character is standalone and when it requires a pairing.

You can easily test this in this system.

If the first and last glyphs are part of a cross-vowel bigram, you can use a three-glyph token (VCV) to check whether the glyph in the middle is likely to be a standalone glyph or not.

The test checks: Of all the internal occurrences of an atom, how many can stand alone in this VCV form.

Important: I treat “aiin” as a single unit here, since it is highly unlikely that “aiin” is part of a bigram (though I’m not entirely sure about that either—but then again, what is ever certain in VMS?).

The assumption is that Gallows and o/a are independent. For Gallows, because they already look like ligatures (in part), and for o/a, because they often stand alone within tokens (Chol).

   

Results:

High single-core candidates:
ckh, cth, a, o, cph, k, t
It’s interesting that the bench-gallows—which are almost obviously ligatures—stand out so clearly.

With only 41 hits, cfh is statistically insignificant; p falls somewhere in the middle and is unlikely to be a single core, which is somewhat surprising.

Low / more likely to be bound components:
d, e, sh, f, ch, y


Regarding P: P is actually widely distributed throughout the text. In this system, P’s primary function is simply to indicate that a line begins with a consonant. However, I also believe that it serves other functions, particularly because lines beginning with P are, on average, significantly longer than others.

Important: There is a valid debate as to whether P and F are not two variants of the same glyph. At the very least, the context surrounding these glyphs is very similar.

-----
But this test also reveals something else: namely, that we can’t simply rely on a strict bigram count; glyphs clearly appear both as bigrams and individually. So the system is more complex, but it also explains why the VMS is so broad.

If I had described this complexity—which is probably easy to break down—from the start, the first post would have been very, very much longer Wink.

But more on that later.


RE: Why and how the text could be Bavarian - Grove - 22-05-2026

“…glyphs clearly appear both as bigrams and individually. So the system is more complex”

As with many discoveries, there has to be a clear repeatable rule to signify when one glyph means one thing and when it means another. There doesn’t seem to be any obvious mechanism to choose single glyph or bigram.

It’s also hard for me to not see a+various counts of i +r or n to be a unit, but this is the VMS and we all see things that might be there or are forced into existence by our tiny brains.


RE: Why and how the text could be Bavarian - JoJo_Jost - 22-05-2026

How true, how true  Wink

In reality, however, it was in the 15th century unfortunately not uncommon for individual glyphs in polyphonic ciphers to be subject to substitution, and for other polyphonic lists...