The Voynich Ninja
A family of grammars for Voynichese - Printable Version

+- The Voynich Ninja (https://www.voynich.ninja)
+-- Forum: Voynich Research (https://www.voynich.ninja/forum-27.html)
+--- Forum: Analysis of the text (https://www.voynich.ninja/forum-41.html)
+--- Thread: A family of grammars for Voynichese (/thread-4418.html)

Pages: 1 2 3 4 5 6 7 8 9 10 11 12 13


RE: A family of grammars for Voynichese - Jorge_Stolfi - 29-12-2025

Finally, here are the counts for crust-mantle-core patterns with a core (gallows).  Here I make a distinction between simple gallows and those with platforms.

Again, the CMC pattern of  a word is obtained by deleting all "O" elements {a} {o} {y},
and mapping the other elements to the classes 
  • "Q" just {q}.
  • "D" the /dealers/ {d} {l} {r} {s}.
  • "X" the /benches/ {ch} {sh} {ee} with an optional 'e' suffix.
  • "G" the /simple gallows/ {k} {t} {p} {f} with optional 'e' suffix.
  • "H" the /platform gallows/ {cth} {ckh} {cph} {cfh} with optional 'e' or 'h' suffix.
  • "N" the /codas/ {n}, {in}, {iin}, {iiin}, {m} {im} {iim} {iiim}, {ir}, {iir}, {iir}.

In this modified CMC model, a valid word with core ("G" or "H") must have the form

   Q^q D^d X^x (G|H) X^y D^e N^n

where q,n are 0 or 1, and d+e and x+y are in 0..3. 

Here are the counts of the patterns with G core (left) and with H core (right):

COUNTING TOKENS WITH G AND H CORE BY CMC PATTERN


  14060.375000 1.00000 TOTAL        1487.250000 1.00000 TOTAL

   2008.125000 0.14282 GD            394.750000 0.26542 XH   
   1538.750000 0.10944 GXD           375.250000 0.25231 HD   
   1314.500000 0.09349 QGD           315.000000 0.21180 H     
   1175.375000 0.08359 GN             76.750000 0.05161 XHD   
   1025.000000 0.07290 GX             58.500000 0.03933 QH   
    858.500000 0.06106 QGXD           55.000000 0.03698 HN   
    813.875000 0.05788 QGN            41.000000 0.02757 QHD   
    687.000000 0.04886 QGX            33.000000 0.02219 HX   
    483.000000 0.03435 G              31.750000 0.02135 HDD   
    417.625000 0.02970 XG             21.000000 0.01412 DXH   
    410.750000 0.02921 QG             11.250000 0.00756 HDN   
    274.625000 0.01953 XGD            10.000000 0.00672 DH   
    255.500000 0.01817 GDD             8.000000 0.00538 HXD   
    244.750000 0.01741 DGD             6.500000 0.00437 XXH   
    240.875000 0.01713 DGXD            6.000000 0.00403 DHD   
    234.625000 0.01669 DGN             5.500000 0.00370 HDDD 
    201.000000 0.01430 XGX             5.000000 0.00336 XHN   
    191.500000 0.01362 XGN             4.500000 0.00303 XHX   
    189.000000 0.01344 GDN             4.000000 0.00269 QHX   
    174.125000 0.01238 DGX             3.250000 0.00219 HDDN 
    170.125000 0.01210 GXDD            3.000000 0.00202 QXH   
    111.750000 0.00795 GXDN            2.000000 0.00134 DDH   
    102.687500 0.00730 DG              2.000000 0.00134 QDH   
    101.625000 0.00723 XGXD            2.000000 0.00134 QHDD 
     75.000000 0.00533 XXG             2.000000 0.00134 QHN   
     69.000000 0.00491 GXX             2.000000 0.00134 QHXD 
     67.250000 0.00478 QGDD            1.500000 0.00101 DDXH 
     62.000000 0.00441 GXN             1.000000 0.00067 DXHD 
     39.500000 0.00281 GXXD            1.000000 0.00067 DXXH 
     34.500000 0.00245 QGDN            1.000000 0.00067 DXXHD 
     33.000000 0.00235 QGXDD           1.000000 0.00067 HXDD 
     27.875000 0.00198 GDDN            1.000000 0.00067 HXN   
     18.500000 0.00132 XGDD            1.000000 0.00067 XXHD 
     18.375000 0.00131 DDGXD           0.250000 0.00017 XHDD 
     16.875000 0.00120 DDGD            0.250000 0.00017 XHDDD 
     16.875000 0.00120 DXG             0.250000 0.00017 XHDN 
     16.750000 0.00119 QGXDN
     16.500000 0.00117 QGXN
     16.500000 0.00117 QGXX
     16.500000 0.00117 XXGX
     16.125000 0.00115 GDDD
     15.500000 0.00110 DDGN
     14.125000 0.00100 DGDD
     13.500000 0.00096 GXDDD
     13.000000 0.00092 QDGXD
     12.250000 0.00087 DDGX
     12.000000 0.00085 XGDN
     11.750000 0.00084 DGXX
     10.750000 0.00076 XXGD
      9.750000 0.00069 XXGN
      9.500000 0.00068 QGXXD
      9.250000 0.00066 DGDN
      9.125000 0.00065 DGXDD
      8.562500 0.00061 DDG
      8.500000 0.00060 QDGX
      8.000000 0.00057 QDGN
      7.500000 0.00053 DGXDN
      7.500000 0.00053 XGXX
      7.250000 0.00052 QDG
      7.000000 0.00050 XGXN
      6.750000 0.00048 DXGX
      6.500000 0.00046 DGXXD
      4.750000 0.00034 GXDDN
      4.625000 0.00033 DXGD
      4.500000 0.00032 DXGXD
      4.125000 0.00029 QDGD
      4.000000 0.00028 QGDDD
      3.750000 0.00027 DXGN
      3.000000 0.00021 DGXN
      3.000000 0.00021 DXXG
      3.000000 0.00021 XGXDD
      2.500000 0.00018 DXXGD
      2.000000 0.00014 GXXN
      2.000000 0.00014 QGDDN
      2.000000 0.00014 QGXXN
      1.625000 0.00012 DGDDN
      1.500000 0.00011 DDXXG
      1.500000 0.00011 QDDGXD
      1.500000 0.00011 QXG
      1.500000 0.00011 XXGXD
      1.250000 0.00009 XGXDN
      1.000000 0.00007 DXXGX
      1.000000 0.00007 GXXDN
      1.000000 0.00007 QGXDDD
      1.000000 0.00007 QXGN
      1.000000 0.00007 QXGXD
      1.000000 0.00007 QXXG
      1.000000 0.00007 XGDDN
      1.000000 0.00007 XGXDDN
      1.000000 0.00007 XGXXD
      0.500000 0.00004 DDDGN
      0.500000 0.00004 DDXGX
      0.500000 0.00004 GXXDD
      0.500000 0.00004 GXXDDD
      0.500000 0.00004 GXXX
      0.500000 0.00004 GXXXD
      0.500000 0.00004 QGXXDD
      0.500000 0.00004 XGDDD
      0.500000 0.00004 XXGXN
      0.250000 0.00002 DGXXDD
      0.250000 0.00002 GDDDN
      0.250000 0.00002 XXGDN
      0.125000 0.00001 DDGXDN

I don't know yet what to conclude from these numbers.  

For either class of core, the formula above allows 2 x 10 x10 x 2 = 400 possible patterns, but only 103 "G" patterns occur in the parags text, and only 36 "H" patterns, even with fractional counting.  Obviously there are some combinations of q,d,x,y,e,n  that are so rare that they could be excluded from the CMC model; but I don't see a simple rule yet. 

One notable thing is that words with H core are not only ~1/10 as common as those with G core, but the distribution of H patterns decays significantly faster.  In particular, there is a large drop from the first three patterns to the fourth one in the above list.

The most common "G" pattern with three benches "X" is XXGX, that occurs 16.5 times (~0.1% of all "G" patterns). There are  no "H" patterns with three benches.  

That may be just a consequence of "H" patterns being less common, but it is also consistent with the theory that an "H" element should be counted as one bench for the rule x+y <= 3.

All the best, --stolfi


RE: A family of grammars for Voynichese - Jorge_Stolfi - 29-12-2025

I think I can narrow the core-mantle-crust a bit more.  Recall that the model is 

  Q^q D^d X^x G^g H^h X^y D^e N^n

where q,g,h,n are 0 or 1,  d+e is at most 3, x+y is at most 3, and g+h is at most 1. (The last condition says simply that there may be at most one gallows, simple or platform.)  The range of d+e seems to be indeed 0..3:


COUNTING CMC PATTERNS BY NUMBER OF  D


  31300.250000 1.00000 TOTAL

   9705.125000 0.31007 0
  16944.000000 0.54134 1
   4153.562500 0.13270 2
    497.562500 0.01590 3

However, the range of q+d+e+n also seems to be 0..3.  That is, the rule should be q+d+e+n <= 3, not just d+e <= 3:

COUNTING CMC PATTERNS BY NUMBER OF  QDN

  31300.250000 1.00000 TOTAL

   4937.125000 0.15773 0
  15160.625000 0.48436 1
  10097.625000 0.32261 2
   1071.812500 0.03424 3
     33.062500 0.00106 4

And, moreover, the condition on the number of benches should be x+h+y <= 2, not just x+y <= 3.  That is, there could be at most two benches, counting a platform gallows as one bench:

COUNTING CMC PATTERNS BY NUMBER OF  XH

  31300.250000 1.00000 TOTAL

  15443.250000 0.49339 0
  13730.125000 0.43866 1
   2080.875000 0.06648 2
     46.000000 0.00147 3

I am adding these revised rules to the model.

All the best, --stolfi


RE: A family of grammars for Voynichese - Jorge_Stolfi - 29-12-2025

In conclusion, the revised crust-mantle-core (CMC) model says that a word that fits the OKOKO model is valid if, after deleting the "O"s, it has the form

    Q^q D^d X^x G^g H^h X^y D^e N^n
   
where 
  • q,g,h,n may be 0 or 1;
  • g+h at most 1 (there can be at most one gallows per word);
  • q+d+e+n <= 3 (there can be at most three of Q, D, and N);
  • x+h+y <= 2 (, there can be at most two benches, counting a platform gallows as one bench.)

With these tighter rules, the statistics for the CMC level are
 
      all          gud          bad        % gud sec-type
    ------------ ------------ ------------  ----- ----------
     5966.750000  5824.687500   142.062500  97.62 bio-parags
      935.500000  904.750000     30.750000  96.71 cos-parags
     7277.500000  7035.625000   241.875000  96.68 hea-parags
     3157.375000  3034.875000   122.500000  96.12 heb-parags
     2100.750000  2034.625000    66.125000  96.85 pha-parags
    10095.250000  9665.500000   429.750000  95.74 str-parags
     2832.375000  2721.125000   111.250000  96.07 unk-parags

    32365.500000 31221.187500  1144.312500  96.46 tot-parags

That is, more than 96% of the words that fit the OKOKO model fit the CMC pattern above.

We also have that of the 33067.25 words that contain only valid EVA characters (no weirdos, "?", or the rare characters b g j u v w x z), 94.4% satisfy the rest of the model (parsing into elements, OKOKO structure, and CMC structure).  

Maybe the revised model is too tight, and some fraction of the 5.6% rejected words are in fact valid. But the number seems compatible with the theory that those 5.6% are indeed errors -- by the Author, by the Scribe, by the Retracers, and by the transcribers.  Especially words run together.

There probably are further rules relating the insertion of the "O"s in the CMC pattern.  For instance, maybe we can require that a "Q" is always followed by at least one "O", and an "N" is almost always preceded by at least one "O".   Said another way, maybe we can combine the OKOKO and CMC models in a single formula with rules that tie the number of "O"s in each slot to the presence or number of the other "Q", "D" etc elements.

All the best, --stolfi


RE: A family of grammars for Voynichese - oshfdk - 29-12-2025

5.6% of word types or word tokens? Sorry, I haven't been following the discussion closely.


RE: A family of grammars for Voynichese - Jorge_Stolfi - 29-12-2025

(29-12-2025, 10:05 AM)oshfdk Wrote: You are not allowed to view links. Register or Login to view.5.6% of word types or word tokens? Sorry, I haven't been following the discussion closely.

Sorry, all the numbers above are token counts, not word type counts. 

I presume that the rejection rate for word types would be much higher, since many rejected words occur only once or a few times.  Unless we define the lexicon as being the word types that occur at least N times, where N is, say, 10.

I will try to do this statistic.  Please stay tuned...

All the best, --stolfi


RE: A family of grammars for Voynichese - Jorge_Stolfi - 29-12-2025

So far I have excluded {ith} {ikh} {iph} {ifh} from the set of valid elements.  

However there are 70 words (tokens) with those combinations in the whole parags text, out of more that 30'000 tokens.  Is that significant?  Contrast that with 1622 occurrences of {cth} {ckh} {cph} {cfh}. 

Maybe I should accept those 'i' variants too (and treat them too as platform gallows, CMC class "H").

On the other hand here are only 4 occurrences of {ih}; contrast with 9362 of {ch}.

This may be a clue that {ith} {ikh} {iph} {ifh} are instances of {cth} {ckh} {cph} {cfh} that were mangled by the Scribe/Retracer/Transcriber. So maybe we should just "error-correct" them as such.

--All the best, --stolfi


RE: A family of grammars for Voynichese - Grove - 29-12-2025

Don’t e’s only precede d’s , or o’s or a’s and [font=TimesNewRomanPS-BoldMT]that a’s only precede r’s l’s and n’s (or i benches)? I see the a as a transition from e to  i series although there are cases of o’s followed by i series without the a transition.[/font]

[font=TimesNewRomanPS-BoldMT]I feel like the e ee eee series preceding a d act in a similar way to the i ii iii that precedes the rln variants.[/font]


RE: A family of grammars for Voynichese - Jorge_Stolfi - 29-12-2025

(29-12-2025, 10:05 AM)oshfdk Wrote: You are not allowed to view links. Register or Login to view.5.6% of word types or word tokens?

Here are the numbers for word types (as opposed to tokens).  For these statistics I defined the lexicon of a text as being the set of words that occur at least 3 times.   The column "vlex" is the number of word types in each subset that occur at least 3 times and uses only "valid" EVA chars (excluding '?', weirdos, and [bgjuvwxz]).  The column "vgud" is the number of those words that fit the crust-mantle-core (CMC) model as defined before.  "vbad" is the number that violate that model, and %gud is the percentage of "vgud" over "vlex".

LEXICON SIZES WITH VALID CMC STRUCTURE
   
    sec-type    vlex  vgud  vbad    %gud
    ----------- ----- ----- ----- ------
    bio-parags  295    295    0   100.00
    cos-parags   56     56    0   100.00
    hea-parags  407    404    3    99.26
    heb-parags  210    209    1    99.52
    pha-parags  150    149    1    99.33
    str-parags  546    541    5    99.08
    unk-parags  205    205    0   100.00

    tot-parags 1361   1328   33    97.58

That is, 97.58% of the word types that use only valid EVA characters fit the CMC model.

The numbers for "tot-parags" are bigger than the sum of the section numbers  because a word that occurs (say) once in one section and 2 times in another will not be in the lexicons of those sections, but will be in the lexicon of the whole parags text.

These are the 33 word types (with at least 3 occurrences) that do not fit the CMC model, and their occurrence counts in the total parags text:

    7.0 daiidy     Should be valid?
    7.0 polchedy   Two words?
    6.0 aithy      Uses the {ith} non-element.
    5.5 cholky     Two words?
    4.5 ail        Should be valid?
    4.5 cholkaiin  Two words?
    4.0 cheeteey   Has 3 benches.
    4.0 chodchy    Two words?
    4.0 cholkar    Two words?
    4.0 cholkeedy  Two words?
    4.0 dairal     Two words, {d}{a}{ir}.{a}{l}?
    4.0 dairin     Ditto?
    4.0 ety        Should be valid?
    4.0 qoedy      Should be valid?
    4.0 qoeol      Should be valid?
    4.0 shoikhy    Uses the {ikh} non-element.
    3.5 pchocthy   Two gallows; two words?
    3.0 aiinal     Two words, {a}{in}.{a}{l}?
    3.0 aikhy      Uses the {ikh} non-element.
    3.0 airody     Two words, {a}{ir}.{o}{d}{y}?
    3.0 airol      Two words, {a}{ir}.{o}{l}?
    3.0 cheolkain  Two words?
    3.0 cholchedy  Two words?
    3.0 chsey      The {se} is {sh} perhaps?
    3.0 chsky      Two words?
    3.0 dairody    Two words, {d}{a}{ir}.{o}{d}{y}?
    3.0 daldaiin   Two words, {d}{a}{d}.{d}{a}{iin}?
    3.0 oikhy      Uses the {ikh} non-element.
    3.0 polshy     Two words?
    3.0 qekchdy    Should be valid?
    3.0 qoedaiin   Should be valid?
    3.0 sheekchy   Has 3 benches.
    3.0 tockhy     Has 2 gallows; two words?

All the best, --stolfi


RE: A family of grammars for Voynichese - ReneZ - 29-12-2025

(29-12-2025, 01:05 PM)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view.So far I have excluded {ith} {ikh} {iph} {ifh} from the set of valid elements.  

However there are 70 words (tokens) with those combinations in the whole parags text, out of more that 30'000 tokens.  Is that significant?  Contrast that with 1622 occurrences of {cth} {ckh} {cph} {cfh}. 

Maybe I should accept those 'i' variants too (and treat them too as platform gallows, CMC class "H").

On the other hand here are only 4 occurrences of {ih}; contrast with 9362 of {ch}.

This may be a clue that {ith} {ikh} {iph} {ifh} are instances of {cth} {ckh} {cph} {cfh} that were mangled by the Scribe/Retracer/Transcriber. So maybe we should just "error-correct" them as such.

Your counts correspond reasonably well with those in this document: You are not allowed to view links. Register or Login to view.
for the three most complete transliterations.

At this point, I don't see any way how we can conlude that things like ith / iTh are meaningfully different from cth / cTh or not. If we found that they strongly depend on the scribal hand, then that would be an indication, but I do not think that that is the case.

Failing that, we can just record the differences, and then decide when doing stats. The problem with that, of course, is that one quickly runs into many different scenarios based on different combinations of 'choices' to be made, to the point where it becomes totally unpractical.


RE: A family of grammars for Voynichese - dashstofsk - 30-12-2025

(29-12-2025, 03:33 PM)Grove Wrote: You are not allowed to view links. Register or Login to view.Don’t e’s only precede d’s , or o’s or a’s and [font=TimesNewRomanPS-BoldMT]that a’s only precede r’s l’s and n’s (or i benches)? I see the a as a transition from e to  i series although there are cases of o’s followed by i series without the a transition.[/font]

[font=TimesNewRomanPS-BoldMT]I feel like the e ee eee series preceding a d act in a similar way to the i ii iii that precedes the rln variants.[/font]

You are right in highlighting the fact that many word strings can be classed e -series and i -series. If you look at the most frequent words and the way they are written then you will see that this is so. The characters in words such as chedy, Shedy are all e -based. Character d is written first as e and a loop is then added upwards and to the right. Character y is written first as e and the downward swing is then added. The initial e curve to these characters look the same.

           

Likewise for characters n l r m . They are usually written first with the same downward stroke i . These characters often like to come after other i -stroke characters.

                       

Character a does indeed seem to be written with the e stroke followed by the i stroke and does seem to be the transition character between e -strings and i -strings.

It seems that the writer likes the easy-to-repeat strokes  i and  e . And seems to prefer to write in a style that is not too taxing. He is doodling with letters to create artificial text. I mentioned something about this theory of mine in a previous post

You are not allowed to view links. Register or Login to view.