The Voynich Ninja - A family of grammars for Voynichese

Pages: 1 2 3 4 5 6 7 8 9 10

(01-12-2024, 09:48 AM)Mauro Wrote: You are not allowed to view links. Register or Login to view.Ahhhh... so the cryptic string @nablator posted...

Quote:ivtt -x7 RF1a-n.txt RF1a-n-x7.txt

.. was a command line for ivtt.exe, lol, I could not make sense of it . It looks quite interesting for a lot of useful things. I guess it runs under Windows command prompt, is it possible to get the .exe? I did not find a link to it on your website (may have missed it), and building it from the C file would not be easy for me. It's been years and years since I worked with C, and while I guess Visual Studio can build C programs, I have no idea of how to do it (and little will to learn how, tbh, a lot of head-scratching for a once-time use is not appealing, I'd rather port the whole thing to C# even! xD). Thanks!

Source and Windows exe are attached in my post #24: You are not allowed to view links. Register or Login to view.

(01-12-2024, 10:24 AM)nablator Wrote: You are not allowed to view links. Register or Login to view.
(01-12-2024, 09:48 AM)Mauro Wrote: You are not allowed to view links. Register or Login to view.Ahhhh... so the cryptic string @nablator posted...

Quote:ivtt -x7 RF1a-n.txt RF1a-n-x7.txt

.. was a command line for ivtt.exe, lol, I could not make sense of it . It looks quite interesting for a lot of useful things. I guess it runs under Windows command prompt, is it possible to get the .exe? I did not find a link to it on your website (may have missed it), and building it from the C file would not be easy for me. It's been years and years since I worked with C, and while I guess Visual Studio can build C programs, I have no idea of how to do it (and little will to learn how, tbh, a lot of head-scratching for a once-time use is not appealing, I'd rather port the whole thing to C# even! xD). Thanks!

Source and Windows exe are attached in my post #24: You are not allowed to view links. Register or Login to view.

On Windows I cannot help at all, but you can definitely rely on @nablator here.

For the Unix/Linux case one does not need any knowledge of C, as it just needs to be built using one short command.

(30-11-2024, 12:07 PM)Mauro Wrote: You are not allowed to view links. Register or Login to view.I don't know too: all the innumerable discussions and comparisons on different grammars (including those in this thread) serve this purpose. Using AI is a very nice idea, I actually had fancied to ask ChatGTP4 to generate for me a random sample of Voynichese and see what happens, never did it because I never subscribed to ChatGPT (I have a strong allergy for subscriptions) and because it works in a way that it could have easily generated text by simply assembling together actual Voynich snippets, which is rather trivial.

It's interesting that all LLMs tokenize the text. These tokens are not syllables. Training an LLM with a VM transliteration generates tokens similar to the chunks that a slot model produces, for example: ['f', 'a', 'chy', 's', 'ykal', 'ar', 'a', 'taiin', 'shol', 'shor', 'y', 'cth', 'r', 'es', 'yk', 'or', 'shol', 'dy', 's', 'or'].

From the partial ordering principle at work in slot sequence models it seems that Voynichese is made of ordered chunks. In my opinion these chunks were assembled by the scribes using various tricks, mostly: some spaces were removed, gallows added between words or inserted to make the words more believable: there would be many "oe", "oa" without them. So they are probably decorative, not part of the chunks generated by the slot sequence model. Ignoring them solves the problem of weird words like qko. Other red herrings probably exist. Is q simply a variant of ch? Sometimes unique words drop hints:

[attachment=9479]

Going a bit on a tangent here, but the most striking part in the above clip (to me) is not the weird word qedy, but the various different ways in which chedy have been written, some fluently, and some not all that fluently.

Interestingly, the word qedy appers to be written fluently to me.

(01-12-2024, 10:24 AM)nablator Wrote: You are not allowed to view links. Register or Login to view.Source and Windows exe are attached in my post #24: You are not allowed to view links. Register or Login to view.

Gosh.. have patience with me, my eyesight is not that good, and I'm not yet used to the forum layout. Thanks!

(01-12-2024, 12:12 PM)ReneZ Wrote: You are not allowed to view links. Register or Login to view.Going a bit on a tangent here, but the most striking part in the above clip (to me) is not the weird word qedy, but the various different ways in which chedy have been written, some fluently, and some not all that fluently.

The VM was written in a horse-drawn cart traveling on (badly) paved roads. Big Grin

This is great work, Mauro. I think this is the direction that will lead to decipherment. As Koen's great You are not allowed to view links. Register or Login to view. pointed out, it may be difficult to argue that the characters represent a plaintext natural language.

But what the objective statistics of the VMS make clear is:

Certain characters prefer different positions in a word (e.g. word-final y, word-initial q, etc.)
Certain characters prefer groupings with other characters (e.g. ok, or, etc.)
Certain characters do both of the above things at the same time. (e.g. qo-, qok-, -dy)

And that all fits with a cipher hypothesis.

I agree with you 100% that each vord is composed of a series of "fields" or "slots". My best guess is that the presence or absence of certain slots is "input" for a corresponding decipherment table.

For example, imagine a system where "qokcheedy" would be broken down into q (slot 1) - ok (2) - ch (3) - ee (4) - d (5) - y (Final), and then "12345F" translates into the plaintext letter "r" or "z" or whatever.

The problem (as you mentioned in post #14) is that nobody knows exactly what the slots are: maybe "ee" is one slot; maybe it's supposed to be two (e+e). Maybe final "-dy" is two units (d+y); maybe they're only meaning-bearing when they're read together as one unit (dy). The Voynich authors would have known how the system works because they invented it - but for us trying to figure it out after-the-fact, it's 100x harder.

Maybe the VMS encoders threw in "red herring" combinations that have no meaning to further disguise the code - if they did, it will make the code even harder to crack because we don't know which vords are meaningful and which are not.

And it may even be the case that the authors got so comfortable with the "slot" locations that they started to throw characters into slots where they don't normally belong (for example: maybe "qodchy" on You are not allowed to view links. Register or Login to view. is actually "qokchy"). Again, we don't know.

Well, please keep going Mauro. I hope you solve it Smile

(01-12-2024, 06:55 PM)ThomasCoon Wrote: You are not allowed to view links. Register or Login to view.This is great work, Mauro. I think this is the direction that will lead to decipherment. As Koen's great You are not allowed to view links. Register or Login to view. pointed out, it may be difficult to argue that the characters represent a plaintext natural language.

But what the objective statistics of the VMS make clear is:

Certain characters prefer different positions in a word (e.g. word-final y, word-initial q, etc.)

Certain characters prefer groupings with other characters (e.g. ok, or, etc.)

Certain characters do both of the above things at the same time. (e.g. qo-, qok-, -dy)

And that all fits with a cipher hypothesis.

I agree with you 100% that each vord is composed of a series of "fields" or "slots". My best guess is that the presence or absence of certain slots is "input" for a corresponding decipherment table.

For example, imagine a system where "qokcheedy" would be broken down into q (slot 1) - ok (2) - ch (3) - ee (4) - d (5) - y (Final), and then "12345F" translates into the plaintext letter "r" or "z" or whatever.

The problem (as you mentioned in post #14) is that nobody knows exactly what the slots are: maybe "ee" is one slot; maybe it's supposed to be two (e+e). Maybe final "-dy" is two units (d+y); maybe they're only meaning-bearing when they're read together as one unit (dy). The Voynich authors would have known how the system works because they invented it - but for us trying to figure it out after-the-fact, it's 100x harder.

Maybe the VMS encoders threw in "red herring" combinations that have no meaning to further disguise the code - if they did, it will make the code even harder to crack because we don't know which vords are meaningful and which are not.

And it may even be the case that the authors got so comfortable with the "slot" locations that they started to throw characters into slots where they don't normally belong (for example: maybe "qodchy" on You are not allowed to view links. Register or Login to view. is actually "qokchy"). Again, we don't know.

Well, please keep going Mauro. I hope you solve it

Wow, @ThomasCoon, you're one of my heroes, glad to meet you Smile

, and thanks for your kind words! Even if I think, with probabilty one, that it surely won't be me to solve the Voynich hehehe (in one way or the other: I don't discount at all the possibility of it being meaningless).

And you summarized up very nicely what I tried to say in post #14, using a lot more words than you lol. The search space for a solution is so big to be overwhelming, but who knows, one small step at time...

The difficult problem of evaluating slot grammars.

This post is a bit off-topic from the rest of the thread. It talks only about the difficult (but interesting) problem of evaluating slot grammars, some thoughts I made following the exchange of opinions we had in here. I’m quite less interested in this problem now than I was when I started the thread, but anyway, it’s an interesting problem, and I thought sharing some considerations and ask for help on some questions might be good things.

It’s actually a fairly obscure mathematical problem, related to the Voynich because ‘slot grammars’ are used as a tool to study it.

I used separators to better show what the chain of reasoning is, hoping it will make it more easy to read. It's actually shorter than at first sight Big Grin

-----------------------------------------
From what I know from literature (but I may have missed many references!), in addition to the number of unique word types in the text (a number I’ll call for short ‘NVoynichwords’, ~8400 depending on the transcription), two variables are used for the evaluation of slot grammars:

• The number of correct Voynich word types the grammar generates, I shall use ‘Nhits’ as a shortand
• The total number of word types the grammar can possibly generate: ‘Ngrammarspace’

These two variables are usually used as (exact names may vary):

• Coverage (‘C’ ) = Nhits / NVoynichwords
• Efficiency (‘E’) = Nhits / Ngrammarspace

And as a combination of the two, the ‘F1 score:

• F1 = 2 * C * E / (C + E)

----------------------------------------------

The rationale for using these metrics is as follows:

The primary requirement for a grammar is, obviously, its coverage. But coverage can’t be the only one, because it can be cheated around by using a trivial grammar: just use as many slots as the length of the longest word type, with each slot containing all the characters used to write the text, to get 100% coverage (I’ll call this grammar “Trivial LL*CS”, shorthand for “Longest word type length * Character set size”).

To safeguard against this degenerate case we use the efficiency, which gets terribly low when calculated on the “Trivial LL*CS” grammar.

The F1 score does not add any new information, but combines both coverage and efficiency in a single ‘figure of merit’. However, I think it suffers for a numerical problem (sorry for not being rigorous in this part, it’s not easy). For simplicity, instead of coverage and efficiency let’s just consider the variables from which they are derived: Nhits and Ngrammarspace; Ngrammarspace is a number which tends to increase ‘exponentially’ with the grammar, unbounded, because it’s the result of a products series (possible choices of slot 1 * possible choices of slot 2 * ….., excuse me for not writing it in the proper mathematical form). On the contrary, Nhits is fundamentally limited to a small number (the number of word types in the text, ~8000). The result is that efficiency tends to swamp coverage in the calculation of the F1 score (see also Note 2), so that grammars with high coverage (the primary requirement), which will have a low efficiency, are treated worse than grammars with a low coverage but higher efficiency.

-------------------------------------

But most importantly, these metrics are obviously not enough. This because it’s also possible to trivially cheat on efficiency and F1 score, by using a grammar with just one slot, but containing all the word types of the text (I’ll call this “Trivial 1*WS”, shorthand for “One slot * Number of word types”): 100% coverage and 100% efficiency, and the best possible F1 score in the world (one).

-------------------------------------

How can this situation be improved?

I thought a little about it, and it’s a difficult problem. The best idea I had is to add a third variable, which’ll just call X, and the simplest one I thought of is X = total size of the character set used by the grammar (‘Ncharset’). By character set I don’t mean just ‘c’, ‘d’, ‘e’ etc., but also all the unique combination of characters which are used by the grammar, ie. “ch”, “sh”, “eee”, “ol”, “lk” or even “teodyteytar” (this last one in the Trivial 1*WS grammar), plus the ‘null’ character. And of course, the same character used in different slots (‘nulls’ included) must be counted multiple times (see also picky Note 3).

Using this variable it’s possible, for instance, to multiply by (NVoynichwords - Ncharset), which is zero in the case of Trivial 1*WS, and get the desired result of giving a zero score also to this degenerate case.

------------------------------------------

Now we need the ‘figure of merit’ function. It’s easy to say which basic constraints it must satisfy: it must be zero if Nhits is zero, then increase monotonically. It must decrease monotonically with Ngrammarspace and be zero at infinity. It must decrease monotonically with Ncharset and be zero at Ncharset = NVoynichwords

Another important constraint it must satisfy, imho, is that a grammar should score better over another grammar with a much higher efficiency but much lower coverage, because it’s all too easy to trick on efficiency just by defining more and more complex ‘characters’ (see also Note 4), and/or by judiciously reducing the coverage. For lack of a better name, I call this propriety ‘consistency’. By the way, F1 score does not respect ‘consistency’, because it suffers the problem of ‘efficiency swamps coverage’ noted above (unless, of course, the grammars are first binned by coverage, with each bin containing grammars with a comparable coverage: then the F1 score inside each bin becomes ‘consistent’).

-------------------------------------------

The basic constraints for a figure of merit are easy to satisfy with a function, ie. F = (NVoynichwords - Ncharset) * Nhits / Ngrammarsize would already do (as you can easily verify), but it suffers from the same problem of the F1 score: at a certain point Ngrammarsize dominates everything else, so this formula lacks ‘consistency’ as (non-rigorously) defined above.

I have no clue on how to solve the problem of ‘consistency’. Yeah, I could put logarithms or exponents or anything else into the formula until it ‘behaves’, but those would all be ad-hoc fixes: that is to say, a probably useless/futile exercise. What would be needed is a formula which can be justified rationally, on basic principles (it might also use different ‘X’ variables, of course), but I fear that’s well beyond my math skills.
Should anyone be interested in this obscure problem: any ideas? Remarks? Maybe just settle for F1 score and grammars binned somehow by coverage? But in that case, how the binning should be done, without risking to introduce a personal bias? Thank you for any comments/answers!

Note 1 : (just to make sure I got it right and not have switched the consensus meanings instead…) A ‘word type’ is any unique word appearing in the text, ie. ‘daiin’ and ‘teodyteytar’ are two ‘word types’. This is contrasted with ‘word token’ (not used in this post), ie. there are 865 ‘daiin’ word tokens and one ‘teodyteytar’ word token in the text. Anyway, I used ‘word type’ in the sense defined here.
Note 2 : if E << C then F1 = 2*C*E/(C+E) =~ 2*C*E/C = 2*E, and the coverage even disappears from the calculation.
Note 3 (picky note): adding Ncharset seems should not add any more information because the very same numbers used to calculate it (the number of ‘characters’ in each slot) have been already used in calculating Ngrammarsize (just in a series of products instead than in a series of sums). But it’s also true that it would be impossible to calculate Ncharset from Ngrammarsize (this because in general there are multiple ways of factoring Ngrammarsize, ie. is it 2*3*5, or 6*5, or 2*15? they all give different Ncharset values). So Ncharset does add some more information about the slot grammar, which Ngrammarsize alone does not capture.
Note 4 : a posteriori, I realized that’s actually what I also did in the presentation in the first post of this thread, where I was completely focused on efficiency (following Zattera’s footsteps on this). To chase efficiency and get even better F1 scores I transformed a simple and regular grammar into more complex ones, just by adding more symbols. That was a totally unnecessary asinine work, I should have instead started directly with the LOOP grammar (which in effect was the key insight), worrying about efficiency much less and ditching BASIC, COMPACT and EXTENDED altogether. It’s been just a few days, but I’d re-write that presentation upside-down now, and I owe that to talking with the Voynich Ninja community Smile

. You rock!

If I am not wrong the following slot grammar increases the Efiicienty at the cost of a small recuction of the Coverage
[attachment=9485]

Pages: 1 2 3 4 5 6 7 8 9 10