The Voynich Ninja
A family of grammars for Voynichese - Printable Version

+- The Voynich Ninja (https://www.voynich.ninja)
+-- Forum: Voynich Research (https://www.voynich.ninja/forum-27.html)
+--- Forum: Analysis of the text (https://www.voynich.ninja/forum-41.html)
+--- Thread: A family of grammars for Voynichese (/thread-4418.html)

Pages: 1 2 3 4 5 6 7 8 9 10 11 12 13


RE: A family of grammars for Voynichese - ReneZ - 26-12-2025

(26-12-2025, 11:06 AM)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view.  I intend to release this transcription to be included in Rene's files, but it currently uses a somewhat incompatible format that I found more convenient to use while building it. Anyway most differences are like a/o, r/s, on ambiguous glyphs. And I also make more liberal use of ',' for uncertain word spaces.

I would be happy to support that exercise.


RE: A family of grammars for Voynichese - Jorge_Stolfi - 26-12-2025

(26-12-2025, 11:06 AM)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view.REVISITING MY OLD WORD PARADIGM

...

  The next filter parses the valid EVA strings into the /elements/ of the word paradigm.
   
    {q} {o} {a} {y} {d} {r} {l} {s}   
    {ch} {che} {sh} {she} {ee} {eee}
    {k} {ke} {t} {te} {p} {pe} {f} {fe}
    {ckh} {ckhh} {ckhe} {ckhhe}
    {cth} {cthh} {cthe} {cthhe}
    {cph} {cphh} {cphe} {cphhe}
    {cfh} {cfhh} {cfhe} {cfhhe}
    {n} {in} {iin} {iiin}
    {m} {im} {iim} {iiim}
        {ir} {iir} {iiir}
       

And here are the frequencies of elements in tokens that pass level 1 (parsing into 
the elements above):

  12443.625000  0.09984 {a}
  21543.125000  0.17285 {o}
  15120.250000  0.12132 {y}

  5192.000000   0.04166 {q}

  11388.500000  0.09137 {d}
   9220.250000  0.07398 {l}
   5792.375000  0.04647 {r}
   2081.750000  0.01670 {s}

   5664.500000  0.04545 {ch}   3906.250000  0.03134 {che}
   3842.250000  0.03083 {ee}    324.000000  0.00260 {eee}
   2085.500000  0.01673 {sh}   1869.250000  0.01500 {she}

   7295.000000  0.05853 {k}    1519.000000  0.01219 {ke}
   4173.750000  0.03349 {t}     786.000000  0.00631 {te}
    316.000000  0.00254 {f}       0.0       0.0     {fe}
   1197.375000  0.00961 {p}       0.0       0.0     {pe}


    452.500000  0.00363 {ckh}   169.500000  0.00136 {ckhe}
    591.000000  0.00474 {cth}   148.000000  0.00119 {cthe}
     36.000000  0.00029 {cfh}    14.000000  0.00011 {cfhe}
    104.500000  0.00084 {cph}    49.000000  0.00039 {cphe}

     16.000000  0.00013 {ckhh}    0.0       0.0     {ckhhe}
     25.000000  0.00020 {cthh}    1.0       0.00001 {cthhe}
      1.0       0.00001 {cfhh}    0.0       0.0     {cfhhe}
      4.0       0.00003 {cphh}    2.0       0.00002 {cphhe}

    113.500000  0.00091 {n}     868.250000  0.00697 {m} 
   1665.500000  0.01336 {in}     40.000000  0.00032 {im} 
   3779.000000  0.03032 {iin}    15.000000  0.00012 {iim}
    159.000000  0.00128 {iiin}    1.0       0.00001 {iiim}
     
    487.750000  0.00391 {ir}
    130.500000  0.00105 {iir}
      1.0       0.00001 {iiir}

Obviously some of the originally proposed elements hardly occur at all, and can be excluded from the "alphabet"  with little loss.

In particular, note that element {ckhhe} does not occur, while {ckh},  {ckhe} and {ckhh} do.  I take that as evidence that {ckhe} and {ckhh} are just calligraphic variants of each other, and should perhaps be mapped and counted as such.  Ditto for the other three gallows, including p and p.

Also note that while {cfh}, {cfhe},{cph} and {cphe} occur in reasonable numbers and proportions, {pe} and {fe} are completely absent.  This well-known anomaly may indicate that p and f are substitutes for te and ke.  Maybe the hook at the end of the arm is meant to be the e. 

(But beware that the frequencies of these characters and digraphs cannot be relied upon to reveal what p and f mean.  Word frequencies on parag head lines, where most p and f are found, are very likely different from the word frequencies in the other lines; and character frequencies depend entirely on word frequencies.)

And apparently we can exclude {iiir} and {iiim} from the element set.

All the best, --stolfi


RE: A family of grammars for Voynichese - Grove - 26-12-2025

With p, f, cph, and cfh- how valid are they if they aren’t in the headline of a paragraph? What percentage of them fall outside of those identifiable headlines?

And thanks for rewinding back to the earlier 2000’s 8-)


RE: A family of grammars for Voynichese - Jorge_Stolfi - 26-12-2025

(26-12-2025, 02:18 PM)Grove Wrote: You are not allowed to view links. Register or Login to view.With p, f, cph, and cfh- how valid are they if they aren’t in the headline of a paragraph? What percentage of them fall outside of those identifiable headlines?

According to my scripts (if they are buggy, blame them, not me!), in running paragraph lines:
  • There are  683 head lines, 534 of which (~78%) contain puffs (p or f).  That is 1267 puffs total (~1.9 puffs per head line).
  • There are 3423 body lines,  370 of which (~11%) contain puffs.  That is  597 puffs total (~1 puff every 5.7 lines).
However these numbers do not include the "fancy" puffs, like the one shaped like a woman's body,  that have been transcribed as weirdos.

All the best, --stolfi


RE: A family of grammars for Voynichese - ReneZ - 26-12-2025

(26-12-2025, 06:29 PM)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view.However these numbers do not include the "fancy" puffs, like the one shaped like a woman's body,  that have been transcribed as weirdos.

If you use STA, they are all together in a single family. The pedestalled ones are in three additional (small) families. STA is only available for all 'published' transliterations. If you want to add yours, we will first have to do the exercise of finding any 'new' characters, which is not trivial, but also not overly complicated.


RE: A family of grammars for Voynichese - Jorge_Stolfi - 28-12-2025

Recall that my "OKOKO" partial paradigm says that a Voynichese word is a sequence of "K" elements with "O" elements inserted before, between, and after them, with at most two "O" per slot.  Where the "O" elements are {a} {o} {y}, and the "K" elements are all the others.

Here are the counts of tokens that satisfy this paradigm, condensed to the number of {O} elements:

COUNTING TOKENS BY NUMBER OF 'O'S

  15656.687500 0.48866 O
  13798.312500 0.43066 OO
   1437.625000 0.04487 OOO
   1052.250000 0.03284 -
     89.500000 0.00279 OOOO
      5.437500 0.00017 OOOOO
      0.187500 0.00001 OOOOOO

These counts consider the running paragraph text only, of all sections, from the same transcription file as before (80% U, 20% Z).  The counts are fractional in order to account for uncertain spaces ',' as I explained before.

Thus it seems that, in addition to the limit of two "O" elements per slot, there is also a limit of three "O" elements in total.  I will add this rule to the model.  The exceptions are only 0.3% of all tokens, and may well be cases of two words run together in the transcription.

If we delete the "O" elements and count only "K" elements, we get this:

COUNTING TOKENS BY NUMBER OF 'K'S

  14144.875000 0.44148 KK
   9226.875000 0.28798 KKK
   5183.687500 0.16179 K
   2717.125000 0.08480 KKKK
    401.875000 0.01254 KKKKK
    281.750000 0.00879 -
     68.062500 0.00212 KKKKKK
     13.187500 0.00041 KKKKKKK
      2.437500 0.00008 KKKKKKKK
      0.125000 0.00000 KKKKKKKKK

Here the cutoff is not as clear as for the "O"s, but tokens with more than six "K"s can be considered noise.  Maybe those with six "K"s are noise too. 

I have not yet decided whether I should put that limit on total "K"s as another constraint of the "OKOKO" model.  Anyway the number of "K"s will be limited by the next stage of the model, the "layer" (formerly "crust-mantle-core") model. 

All the best, --stolfi


RE: A family of grammars for Voynichese - Jorge_Stolfi - 28-12-2025

(28-12-2025, 02:01 PM)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view.Recall that my "OKOKO" partial paradigm says that a Voynichese word is a sequence of "K" elements with "O" elements inserted before, between, and after them, with at most two "O" per slot.  Where the "O" elements are {a} {o} {y}, and the "K" elements are all the others.

If we map each element to "O" or "K", we get the "OKOKO pattern" of a word.  Here are the 50 most common OKOKO patterns in parag text:

COUNTING OKOKO PATTERNS


   5560.000000 0.17179 KOK
   2687.250000 0.08303 OKOK
   2613.062500 0.08074 KOKOK
   2560.125000 0.07910 KKO
   2192.250000 0.06773 OK
   1751.125000 0.05410 KO
   1479.875000 0.04572 KOKKO
   1400.312500 0.04327 KOKO
   1296.250000 0.04005 OKKO
   1276.125000 0.03943 KKOK
   1108.750000 0.03426 KKKO
    839.000000 0.02592 KOKKKO
    833.000000 0.02574 OKKKO
    646.562500 0.01998 OKO
    627.625000 0.01939 OKKOK
    593.750000 0.01835 K
    425.875000 0.01316 KOKKOK
    287.437500 0.00888 OKOKO
    281.750000 0.00871 O
    254.875000 0.00787 KK
    232.437500 0.00718 KOKOKO
    226.000000 0.00698 KKKOK
    221.375000 0.00684 OKOKOK
    201.250000 0.00622 KKOKO
    192.562500 0.00595 KKKKO
    160.625000 0.00496 KKK
    147.312500 0.00455 KKOKOK
    138.750000 0.00429 KOOK
    128.625000 0.00397 KOKK
    112.000000 0.00346 OKKOKO
    110.437500 0.00341 OKKKKO
     98.625000 0.00305 OKK
     97.375000 0.00301 KOKKK
     94.375000 0.00292 OKKKOK
     91.500000 0.00283 OKOKKO
     86.000000 0.00266 OKKK
     79.250000 0.00245 KOKOKOK
     75.750000 0.00234 KOKKKKO
     72.750000 0.00225 OOK
     66.875000 0.00207 KKOKKO
     59.625000 0.00184 KOKKOKO
     55.000000 0.00170 KOKKKOK
     52.250000 0.00161 KOKOKKO
     45.500000 0.00141 OKKOKOK
     39.250000 0.00121 KOO
     38.500000 0.00119 KKKK
     34.875000 0.00108 OKOKKOK
     33.000000 0.00102 OKOKK
     31.000000 0.00096 KKKOKO
     30.250000 0.00093 OKOKKKO


These counts are already using the new constraint of at most three "O"s, which, as shown in the previous post, excludes only 94 tokens out of ~31'000.  

Note that the distribution is long-tailed, with no obvious cut-off.  

On this list, the first OKOKO pattern with two consecutive "O"s is #28, "KOOK", with 138.75 occurrences.  

The first one with three total "O"s is (surprise"!) #18, "OKOKO" itself, with 287.4375 occurrences.

All the best, --stolfi


RE: A family of grammars for Voynichese - Grove - 28-12-2025

I’m curious what happens if you don’t include gallows as part of K (except the bench ones because they still have the underlying ch). I’m wondering if the gallows were some sort of instruction that could take place not only in a page or word initial position, but mid-word as well.


John


RE: A family of grammars for Voynichese - Jorge_Stolfi - 28-12-2025

(28-12-2025, 02:43 PM)Grove Wrote: You are not allowed to view links. Register or Login to view.I’m curious what happens if you don’t include gallows as part of K (except the bench ones because they still have the underlying ch).

Good point.  Lumping the platform gallows (CTh etc) with simple gallows (t etc) is a potential defect of the "crust-mantle-core" (CMC) model as I formulated it, because former should contribute to the count of benches "X".  Maybe as half a bench in the prefix, half in the suffix.  

While we think about that issue, here are some counts that don't depend on it.  I we delete the "O"s and map each element to its class "Q", "D", "X" etc, we get the "CMC pattern" of the word.  If the word has no gallows ("H") and no benches ("X"), a valid CMC pattern must be Q^q D^d+e N^n where q and n are 0 or 1, and d+e is between 0 and 3.  Recall that 
  • "Q" is just {q}, 
  • "D" (the "dealers") are {d} {l} {r} {s}
  • "N" is {n}, {m}, {in}, {im}, {ir}, ...{iir}, {iiin}, {iiim}

Here are the counts:

COUNTING CRUST-ONLY TOKENS BY CMC PATTERN

   2433.187500 0.30517 D
   2034.375000 0.25515 DN
   1552.062500 0.19466 DD
    849.250000 0.10651 N
    286.500000 0.03593 -
    225.750000 0.02831 QD
    217.375000 0.02726 DDD
    148.562500 0.01863 DDN
     83.500000 0.01047 QDN
     47.000000 0.00589 Q
     44.250000 0.00555 QN
     34.000000 0.00426 QDD
     15.437500 0.00194 DDDN
      2.000000 0.00025 QDDD

There are 2 x 4 x 2 = 16 CMC patterns that fit the formula above.  Here they are sorted "alphabetically" instead of by frequency:

    217.375000 0.02726 DDD
   1552.062500 0.19466 DD
   2433.187500 0.30517 D
    286.500000 0.03593 -
   
     15.437500 0.00194 DDDN
    148.562500 0.01863 DDN
   2034.375000 0.25515 DN
    849.250000 0.10651 N
   
      2.000000 0.00025 QDDD
     34.000000 0.00426 QDD
    225.750000 0.02831 QD
     47.000000 0.00589 Q
     
      0.0      0.0     QDDDN
      0.0      0.0     QDDN
     83.500000 0.01047 QDN
     44.250000 0.00555 QN

The patterns "QDDDN" and "QDDN" are absent, and "QDDD" is down at noise level.   Maybe the limit should be q+d+e+n <= 3, rather than just d+e <= 3.  But let me first see what happens when there are "X" and "H"...

All the best, --stolfi


RE: A family of grammars for Voynichese - Jorge_Stolfi - 28-12-2025

And here are the counts for crust-mantle-core (CMC) patterns with mantle ("X") but no core ("H").
Recall that a valid word in this class must have the CMC pattern

  Q^q D^d X^x+y D^e N^n

Where (currently) q and n are 0 or 1, d+e is in 0..3, and x+y is in 1..3.  The counts are:

COUNTING TOKENS WITH MANTLE BUT NO CORE BY CMC PATTERN

   3343.125000 0.42974 XD
   1185.250000 0.15236 X
    654.375000 0.08412 DXD
    415.000000 0.05335 XX
    357.375000 0.04594 XDD
    315.750000 0.04059 DX
    309.750000 0.03982 XXD
    307.500000 0.03953 XDN
    245.750000 0.03159 XN
     85.375000 0.01097 DXX
     77.625000 0.00998 DDXD
     56.500000 0.00726 QXD
     55.250000 0.00710 DXXD
     44.250000 0.00569 QX
     41.500000 0.00533 DXDD
     38.625000 0.00497 DDX
     31.500000 0.00405 DXN
     27.000000 0.00347 QDXD
     26.375000 0.00339 XDDD
     25.500000 0.00328 QDX
     24.500000 0.00315 DXDN
     19.750000 0.00254 XDDN
     19.000000 0.00244 XXDN
     13.875000 0.00178 XXDD
      6.250000 0.00080 XXN
      5.750000 0.00074 QXDN
      5.250000 0.00067 DDXX
      4.500000 0.00058 QDXX
      4.000000 0.00051 DDXXD
      4.000000 0.00051 DXXDD
      3.500000 0.00045 QDXXD
      3.250000 0.00042 QXX
      3.000000 0.00039 QXDD
      3.000000 0.00039 QXXD
      2.000000 0.00026 DXDDN
      2.000000 0.00026 XXDDN
      1.500000 0.00019 DXXDN
      1.500000 0.00019 XXX
      1.000000 0.00013 DDXN
      1.000000 0.00013 QDXDD
      1.000000 0.00013 QDXDN
      1.000000 0.00013 QDXN
      1.000000 0.00013 QXN
      1.000000 0.00013 XXDDD
      0.750000 0.00010 XXXD
      0.500000 0.00006 DDDX
      0.500000 0.00006 DXXN
      0.500000 0.00006 QXXXD
      0.500000 0.00006 XDDDN
      0.250000 0.00003 DXXXD
      0.250000 0.00003 QXXN
      0.125000 0.00002 DDXDN


The exponents d,e may be 0,0  1,0  0,1  2,0  1,1  0,2  3,0  2,1  1,2  0,3 so there are 2 x 10 x 4 x 2 = 160 possible CMC patterns with mantle but no core.  Yet only the 52 patterns above occur at all, and several occur only at noise level.

I have to think more abut these numbers.  Please stay tuned...

All the best, --stolfi