A family of grammars for Voynichese

A family of grammars for Voynichese - Printable Version

+- The Voynich Ninja (https://www.voynich.ninja)
+-- Forum: Voynich Research (https://www.voynich.ninja/forum-27.html)
+--- Forum: Analysis of the text (https://www.voynich.ninja/forum-41.html)
+--- Thread: A family of grammars for Voynichese (/thread-4418.html)

Pages: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

RE: A family of grammars for Voynichese - magnesium - 27-08-2025

(27-08-2025, 07:52 AM)ReneZ Wrote: You are not allowed to view links. Register or Login to view.Just to avoid misunderstanding, I do not think that the creator(s) of the Voynich MS used a slot approach to generate their vocabulary. I strictly see it as a potential model to (hopefully) better understand the composition of the words.

The same is truefor the Naibbe approach. It is (can be) a model that may tell us something about the text.

Totally understand—and to that end, I completely agree that the slot approach is an incredibly powerful model. You could think of something like the Naibbe approach, or even the Matlach steganographic cipher approach, as downstream ways of exploring how slot grammars can be reconciled with typical natural language.

RE: A family of grammars for Voynichese - Mauro - 27-08-2025

(27-08-2025, 08:37 AM)oshfdk Wrote: You are not allowed to view links. Register or Login to view.I wonder why ch and Sh are absent from Pharma and Cosmological.

Uh.. it's surely an error on my part, I did not generate the tables automatically but copied them by hand... I'll check and correct it, thanks for noticing.

RE: A family of grammars for Voynichese - Mauro - 27-08-2025

Revised and corrected, ch and sh go together with the other 'consonants' in Pharma and Cosmological:

Filename: grammars.png Size: 40.33 KB 27-08-2025, 04:10 PM

RE: A family of grammars for Voynichese - ReneZ - 27-08-2025

Just an editorial suggestion, if you use capitalised Eva, the result will look even better:

"Sh" instead of "sh" : Sh

"cKh" instead of "ckh": cKh

RE: A family of grammars for Voynichese - DG97EEB - 24-12-2025

If you push ChatGPT and Gemini to describe the grammar, they both independently come up with a slot grammar.. unclear whether their training data is the same and this idea is therefore extant or whether genuinely coming up with this. The conclusion in both cases is that same: that Voynich cannot be translated but only decoded as it's fundamentally procedural and not for example compressed Latin which can be uncompressed. I have dozens of different statistical tests both in and out of chat models, and frankly it feels plausible but I wouldn't put my name behind it just yet.

RE: A family of grammars for Voynichese - Jorge_Stolfi - 24-12-2025

(24-12-2025, 05:04 PM)DG97EEB Wrote: You are not allowed to view links. Register or Login to view.If you push ChatGPT and Gemini to describe the grammar, they both independently come up with a slot grammar.. unclear whether their training data is the same and this idea is therefore extant or whether genuinely coming up with this. The conclusion in both cases is that same: that Voynich cannot be translated but only decoded as it's fundamentally procedural and not for example compressed Latin which can be uncompressed. I have dozens of different statistical tests both in and out of chat models, and frankly it feels plausible but I wouldn't put my name behind it just yet.

Several slot models for Voynichese "words" have been published in the last 50 years, so those LLMs may be just picking those up. And not showing due credit, of course.

The "conclusion" is reasonable only if you assume that the language is "European" and that each word in its "standard" spelling is written as one Voynichese "word".

But there are hundreds of languages, spoken by billions of people, where each word is just one syllable long; and syllables do have a rigid slot structure, like that seen in Voynichese "words".

Even if the language is "European", maybe the Author chose to write each syllable as a separate Voynichese "word", with double space between original words. Like "ha be mus pa pam et ma mam". Then the same observation above would apply.

Unfortunately all available VMS transcription have only one code for "word space", irrespective of its length...

All the best, --stolfi

RE: A family of grammars for Voynichese - nablator - 24-12-2025

(24-12-2025, 05:31 PM)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view.Like "ha be mus pa pam et ma mam". Then the same observation above would apply.

Latin syllables are more structured than most people realize, there are only about a thousand different Latin syllables in large texts. 10 bits of information => 10 slots would be more than enough for a code, because repetitions may be allowed of the some slot, which gives more information per slot, and maybe a few extra slots would be needed if the structure is preserved, in 3 parts (before the vowel, the vowel, after the vowel).

(24-12-2025, 05:31 PM)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view.Unfortunately all available VMS transcription have only one code for "word space", irrespective of its length...

I have one that is much more detailed but I need to finish it and check it. I was thinking lately that I could use combinations of existing IVTFF separators "." and "," instead of specific characters that would break parsers:
,
,,
.
.,
..
...

I don't believe that spaces are significant anymore, so I lost interest years ago and didn't finish the transliteration.

RE: A family of grammars for Voynichese - Jorge_Stolfi - 24-12-2025

(24-12-2025, 06:22 PM)nablator Wrote: You are not allowed to view links. Register or Login to view.I don't believe that spaces are significant anymore

Well, labels have basically the same structure as text "words", so that is a bit of evidence that "words" are indeed words. Also, AFAIK all the "word" statistics (like entropy per "word", length of "words" at line start and end) are consistent with them being indeed words of the language. So I am sticking with that hypothesis for now...

All the best, --stolfi

RE: A family of grammars for Voynichese - ReneZ - 25-12-2025

(24-12-2025, 08:50 PM)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view.Well, labels have basically the same structure as text "words", so that is a bit of evidence that "words" are indeed words.

I would agree that this implies that they are 'units' of some kind, but not necessarily words. That would not clash with your Asian theory, I think.

As you know, in your grammar numerous 'abnormal words' are in fact concatenations of two words. We cannot yet decide if this is due to:
- mistake (space left out)
- undefined spelling (no fixed rules)
- composite word
- something else.
I don't see it as a problem. It happens in many regular languages.

RE: A family of grammars for Voynichese - Jorge_Stolfi - 26-12-2025

REVISITING MY OLD WORD PARADIGM

SUMMARY

This is a brushed-up version of the word structure model (aka word paradigm, word grammar) that I proposed to the old mailing list around ~2000 and is You are not allowed to view links. Register or Login to view.. Here I also report the results of parsing the main text of the VMS with this paradigm. See the end of this note for details on the transcription file I used.

I see that there have been several other proposed paradigms since that time. I did not have the time to analyze them or compare them with this one, sorry. In particular, I have measured only the coverage (how many VMS tokens fit this paradigm) and not the specificity (how many of the words allowed by this paradigm actually occur in the VMS).

In the following, the VMS sections are named "hea" and "heb" for Herbal-A and Herbal-B, "bio" and "zod" obvious, "cos" for Cosmo, "pha" for Pharma, "str" for Starred Parags, and "unk" for pages of unknown nature. The latter include the bottom half of f116r, as well as f1r, f86v6, and f86v5.

For this note I am considering only running text in paragraphs -- excluding labels, titles, radial labels, and text rings. Tokens adjacent to dubious spaces ',' are counted as fractions, as I explained elsewhere.

My paradigm can be described a stack of filters, where each filter takes the words that passed the previous filters, rejects some, and parses the others in some way.

FILTER LEVEL 0: VALID EVA CHARACTERS

This filter says that a word is valid if and only if its EVA encoding, after mapping to lowercase and removing all fluff (ligature braces, inline comments, parag markers like "<%>" and "<$>"), uses only the letters

a c d e f h i k l m n o p q r s t y

This filter rejects as invalid words that contain weirdo codes ("&NNN;" in Rene's notation) or the unreadable glyph symbol "?", or any non-alpha character, or and of the EVA letters

b g j u v x w z

(My scripts currently accept also w and z because I have been using them to denote the hooked-arm versions of p and f, respectively. I now strongly suspect that the hook is meaningless flourish and thus that effort was wasted. Anyway, for this note consider w and z are forbidden too.)

Here are the basic statistics for this level, per section and total. "all" is the total (fractional) count of tokens in the input, "gud" are the tokens that pass this level, "bad" are those which are rejected.

all gud bad % gud sec-type
------------ ------------ ------------ ----- ----------
6210.000000 6028.000000 182.000000 97.07 bio-parags
1008.500000 974.500000 34.000000 96.63 cos-parags
7722.500000 7442.500000 280.000000 96.37 hea-parags
3360.000000 3230.000000 130.000000 96.13 heb-parags
2291.500000 2181.500000 110.000000 95.20 pha-parags
10613.500000 10318.250000 295.250000 97.22 str-parags
3001.500000 2892.500000 109.000000 96.37 unk-parags

34207.500000 33067.250000 1140.250000 96.67 tot-parags

So at least 96% of the words in the main sections ("str", "bio", "hea", "heb") use only the "valid" EVA letters above.

The vast majority of the "bad" words have "?" or weirdo codes. The most words with neither of those that are rejected because of rare glyphs is 51 in "hea" and 42 in "str".

FILTER LEVEL 1 - ELEMENTS

The next filter parses the valid EVA strings into the /elements/ of the word paradigm.

{q} {o} {a} {y} {d} {r} {l} {s}

{ch} {che} {sh} {she} {ee} {eee}

{k} {ke} {t} {te} {p} {pe} {f} {fe}

{ckh} {ckhh} {ckhe} {ckhhe}
{cth} {cthh} {cthe} {cthhe}
{cph} {cphh} {cphe} {cphhe}
{cfh} {cfhh} {cfhe} {cfhhe}

{n} {in} {iin} {iiin}
{m} {im} {iim} {iiim}
{ir} {iir} {iiir}

This definition of element is a bit ambiguous since, 'cheeee' for instance, could be parsed as {che}{eee} or {ch}{ee}{ee}. I chose to break these ambiguous cases by excluding the e from the first element.

Here are the results of this filter, where "all" are all the tokens that passed level 0:

all gud bad % gud sec-type
------------ ------------ ------------ ----- ----------
6028.000000 5980.500000 47.500000 99.21 bio-parags
974.500000 942.500000 32.000000 96.72 cos-parags
7442.500000 7321.000000 121.500000 98.37 hea-parags
3230.000000 3176.000000 54.000000 98.33 heb-parags
2181.500000 2120.500000 61.000000 97.20 pha-parags
10318.250000 10142.250000 176.000000 98.29 str-parags
2892.500000 2856.500000 36.000000 98.76 unk-parags

33067.250000 32539.250000 528.000000 98.40 tot-parags

So at least 98% of all words in the main sections that have only the "valid" glyphs can be parsed into valid elements of the model.

There are only a dozen words rejected because of clusters like ith ikh etc.. I thought it was not worth including those combinations into valid elements. Maybe they should be turned into cth etc. for statistics.

FILTER LEVEL 2 - THE OKOKO MODEL

This filter applies to the words that consist only of valid elements, parsed and marked with braces as above. We tag the elements {o} {a} {y} as "O" and all the others as "K", and then try to parse the resulting string as a sequence of zero or more "K" with at most one "O" after each "K" and an optional "O" prefix.

O? K O? K O? ... K O?

where the "?" means that the "O" may be present or not.

all gud bad % gud sec-type
------------ ------------ ------------ ----- ----------
5980.500000 5968.500000 12.000000 99.80 bio-parags
942.500000 922.500000 20.000000 97.88 cos-parags
7321.000000 7174.250000 146.750000 98.00 hea-parags
3176.000000 3143.500000 32.500000 98.98 heb-parags
2120.500000 2067.500000 53.000000 97.50 pha-parags
10142.250000 9994.750000 147.500000 98.55 str-parags
2856.500000 2820.000000 36.500000 98.72 unk-parags

32539.250000 32091.000000 448.250000 98.62 tot-parags

Thus, in the main sections, at least 98% of all words that consist of valid elements also fit this "OKOKO" model.

Many of the rejected words are rejected because of two or more "O" elements in a row. If we allow up to two "O" in each slot, the acceptance becomes almost total:

all gud bad % gud sec-type
------------ ------------ ------------ ----- ----------
5980.500000 5980.500000 0.000000 100.00 bio-parags
942.500000 941.500000 1.000000 99.89 cos-parags
7321.000000 7319.000000 2.000000 99.97 hea-parags
3176.000000 3176.000000 0.000000 100.00 heb-parags
2120.500000 2120.500000 0.000000 100.00 pha-parags
10142.250000 10141.250000 1.000000 99.99 str-parags
2856.500000 2856.500000 0.000000 100.00 unk-parags

32539.250000 32535.250000 4.000000 99.99 tot-parags

FILTER LEVEL 3 - LAYER MODEL

This level considers the strings that passed the OKOKO criterion, parsed into elements and marked with braces as per level 1. All "O" elements are ignored, and the "K" elements are tagged instead with the specific classes "Q" = { @q }, "D" = { @d, @l,@r, @s } (the "dealers") "X" = { @ch, @sh, @ee } (the "benches") with optional @e suffix, "H" = all gallows, with optional platform and @e suffix, and "N" = { @n, @m } after zero or more @i, or @r after one or more @i.

The resulting string of classes is then parsed to fit the pattern

Q^q D^d X^x H^h X^y D^e N^n

where q,h,n may be 0 or 1, and d+e and x+y may be 0 to 3. Note that many potential sequences are invalid, e. g. words two gallows, with four dealers or four benches, with a "D" between two "X" or between an "X" and a gallows, etc. And of course "Q" can only occur at the beginning, and "N" at the end.

all gud bad % gud sec-type
------------ ------------ ------------ ----- ----------
5968.500000 5830.500000 138.000000 97.69 bio-parags
922.500000 894.750000 27.750000 96.99 cos-parags
7174.250000 6945.250000 229.000000 96.81 hea-parags
3143.500000 3026.500000 117.000000 96.28 heb-parags
2067.500000 2004.750000 62.750000 96.96 pha-parags
9994.750000 9565.250000 429.500000 95.70 str-parags
2820.000000 2707.875000 112.125000 96.02 unk-parags

32091.000000 30974.875000 1116.125000 96.52 tot-parags

Thus at least 96% of the words that fit the OKOKO model also fit this layer model.

As noted before, many rejected words seem to be pairs of more or less common words run together. Here is a sample of the rejected words (the "*" marks the point(s) where parsing failed):

pattern ! word
--------+----------------------------
HD*X | {p}{o}{l}*{sh}{y}
HD*XD | {p}{o}{l}*{che}{d}{y}
XD*HN | {che}{o}{l}*{k}{a}{in}
XD*X | {che}{d}*{che}{y}
QHD*XD | {q}{o}{k}{o}{l}*{che}{d}{y}
HD*XD | {t}{o}{l}*{che}{d}{y}
XD*HD | {che}{d}{y}*{k}{a}{r}
HDD*XD | {p}{d}{a}{l}*{sh}{o}{r}
H*HN | {p}{o}*{k}{a}{in}
XD*H*X | {ch}{o}{l}*{t}*{eee}{y}
XD*XD | {che}{o}{l}*{ch}{d}{y}
HD*X | {o}{te}{d}*{ee}{y}
HD*XDD | {p}{o}{l}*{sh}{d}{a}{l}
HD*XDD | {f}{s}*{che}{d}{a}{l}
HX*HXD | {t}{she}{o}*{k}{ee}{d}{y}

The words chedy, kain, shor etc are fairly common on their own.

CONCLUSION

In summary, of the 34207.5 "parags" tokens in the transcription file, 30974.875 passed through all the filters. The overall coverage rate is thus 90%. If we consider only the tokens that passed level 0 (valid EVA characters), the coverage is 30974.875/33067.25 = 93%.

TRANSCRIPTION AND SECTIONS

The statistics above used a VMS transcription which is 80% new readings of mine from the BL 2014 scans (code ";U") with the remaining 20% taken from Rene's IVT file (code ";Z"). A large part of the my readings were compared with Rene's and the discrepancies were double-checked against the images.

I intend to release this transcription to be included in Rene's files, but it currently uses a somewhat incompatible format that I found more convenient to use while building it. Anyway most differences are like a/o, r/s, on ambiguous glyphs. And I also make more liberal use of ',' for uncertain word spaces.

If anyone is interested, I can provide this transcription and/or the GAWK scripts that implement this word model. But you may want to use your own favorite transcription and implement the filter on your own, to better suit your needs.

All the best, --stolfi