The Voynich Ninja - Need advice for testing of hypotheses related to the self-citation method

Pages: 1 2 3 4 5 6 7 8 9 10

(26-06-2025, 08:05 AM)Mauro Wrote: You are not allowed to view links. Register or Login to view.
(26-06-2025, 01:47 AM)ReneZ Wrote: You are not allowed to view links. Register or Login to view.I have often wondered about the initialisation method. After all, the self-citation explains (in a way) how to set up the 'next' word from a previous section, but I never saw anything about how to start.

The seed sentence is in the 'metadata' at the top of Torsten's output files. For the one posted on GitHub:

Quote:#text.initial_line=pchal shal shorchdy okeor okain shedy pchedy qotchedy qotar ol lkar

Note: actually the last two words 'ol lkar' are then not used in generating the text, I don't know why the sentece is truncated, maybe a bug, but it's not important for the successive elaboration.

(26-06-2025, 01:47 AM)ReneZ Wrote: You are not allowed to view links. Register or Login to view.Another thing I have been curious about is how the most frequent words come about.

There are probabilty parameters inside the software. Ie. this instruction (file You are not allowed to view links. Register or Login to view.):

Quote:{{"k"} , { new Substitution( new String[] {"t"}, 77), new Substitution(new String[] {"p"}, 94), new Substitution(new String[] {"f"}, 100)}},

I think it determines what can be substituted for 'k': I guess 77% of times with 't', (94-77)= 17% of times with 'p' and the remaining times with 'f'.

I think a human being would behave more of less the same way, just with greater fuzziness, and with the 'parameters' varying in time.

Thanks!

Now these are the 'rules' for the text generated by the app.
These would be a model for the original Voynichese text in the MS.
Howevever, how does the real text behave?
That it is different for a few different sections isn't a major problem in my opinion, especially if there were indeed several scribes, whose roles may have been more than just copying...

(27-06-2025, 01:35 AM)ReneZ Wrote: You are not allowed to view links. Register or Login to view.I guess the auto-citation should allow for words to be simply copied from previous ones without change...

Of course, the copy without change limits the drift but there are other means to avoid too much drift. Circular rules can help: a -> b -> c -> ... -> a. Seeding the blank pages with copies of words from previous sections instead of just one recent page also helps.

Since drift happened, there is no need for a perfect set of rules that totally prevents drift.

New patterns were added without too much concern for consistency. For example words starting with "lk" and "ll" started appearing in Q13 and there are more of them in Q20. This drift suggests that Q13 is not the last section. I think they noticed that Q13 had degenerated too much (lower lexical diversity as measured by MATTR) and avoided that pitfall later by seeding Q20 with many (more) words copied from previous sections.

(27-06-2025, 01:40 AM)ReneZ Wrote: You are not allowed to view links. Register or Login to view.Now these are the 'rules' for the text generated by the app.

These would be a model for the original Voynichese text in the MS.

Howevever, how does the real text behave?

That it is different for a few different sections isn't a major problem in my opinion, especially if there were indeed several scribes, whose roles may have been more than just copying...

The differences in the sections of the real VMS are not a problem at all for Torsten's model, which on the contrary explains them quite neatly by having the parameters varying in time (*). Also, I agree with Torsten that many statistical differences are irrelevant, from his paper:

Quote:Of course, it is possible to pinpoint quantitative diﬀerences between the real VMS and the used facsimile text (most likely any facsimile text). An example is the quanti- tative deviation of the <q >-preﬁx distribution from the original VMS text. (...) We deliberately did not ﬁne-tune the algorithm to pick an “optimal” sample for this presentation. Such a strategy is by itself questionable.

However there are some other statistical differences which, in my opinion, are much harder to explain and decidedly not what the real Voynich looks like (not that I can prove it). I gave examples in You are not allowed to view links. Register or Login to view. of the "This Famous Medieval Book May be a Hoax" thread. All these differences cannot be accounted for by tuning the parameters, as instead is conceivable for the <q>-prefix distribution in the above quote (ie. by increasing the probability 'qo' will be added as a prefix). They instead need additional novel rules (additional parameters), which make the model less appealing the more they are (and I fear, but yet again cannot prove, a lot of rules would indeed be needed). Also, some of these parameters/rules seem to carry across with little change along the whole manuscript, and they are weird (see 'chodaiin' vs. 'oldaiin' in post #29), which is hard to explain on a background of freely (and widely) time-varying parameters. I understand a scribe can easily keep in his mind the rule < 'n' almost always terminates a word> (**) for a whole meaningless book, but I find it improbable for < 'chodaiin' is much preferred over 'oldaiin' >. I may be wrong of course.

(*) one could have time-varying parameters also if writing meaningless text using a proper rulebook (just change the rulebook), but Torsten's model fits better to a scenario where everything is done inside the mind of the scribe, the 'rulebook' being his intuitive feeling of what Voynichese words should look like.

(**) which also should be added to Torsten's model, it definetly generates too many 'aiinar', 'aiinal', 'danol' ...

(26-06-2025, 04:27 PM)nablator Wrote: You are not allowed to view links. Register or Login to view.Thanks, but I already wrote the program to count patterns of type 1a, 1b, 2a, 2b in each page a few months ago. Now I am looking for something to do next.

Sending encouragement. Do keep going. I got stuck about here, myself and haven't figured-out what would actually be useful to code. It's easy to count things, it's not easy reverse-engineer this process--and even if we can get close to the ruleset used, I wonder how that helps. Hmm, maybe it serves to distinguish regions with potential for deciphering and regions that seem less likely to yield plaintext. Producing a some kind of a score for that might help other people focus their efforts.

(30-06-2025, 07:03 AM)Eiríkur Wrote: You are not allowed to view links. Register or Login to view.It's easy to count things, it's not easy reverse-engineer this process-

A possible research topic would be to find out just how much the patterns for selection of sources and rules for evolving words can be restricted and still be able to generate all of the text (or a high percentage, if we ignore some errors, a few unreadable or unidentifiable glyphs where the scribes went crazy with boredom and just had to innovate).

The problem with rather lax patterns and evolution rules is that it is impossible to tell if a possible source-target match (even when there are several words in a pattern) is coincidental or not.

For example, to generate an entire line, a possible restriction would be to allow all sequential transfers (with any number of skips) from a maximum of two lines of the same page with a maximum of two differences per word in one chunk:

103v.3: qokeeor.chedy.qokey.dar.checthy.chor.qoty.shdy.okeedy.qokeey.qokain
                      A                 B    C                     D
103v.6: dain.shey.qokeedy.cheol.qoeeor.lshor.qoky.shedy.qokaiin.chedy.qokam
        1    2    A    3 4    B    C 5 6 7     D
103v.7: daiin.shey.chol.chey.oteey.lkeeor.okaiin.shedy.shedy.qokaiin.ol.chedydy
        1     2    3             4 5 5 6    7

It probably won't work with these limitations but we might be able to identify others that do. We need testable hypotheses to progress. If there are no strict rules we might be able to identify the preferences of each scribe.

I like this approach. One of the usual complications arises, namely that the number of permutations to test quickly increases to impractical levels.

The restrictions you put seem reasonable:
1) No need to achieve 100%
2) Limit the 'history' from which to pick new words
3) Limit the number (and type) of permutations to allow

The initialisation problem (per page) remains, but this can be 'delayed' by just ignoring the first part of each page for the time being.
The special characters (with preferences for certain locations) also remain as an issue. I have no quick solutions there, and since these happily appear in labels, I also don't yet see how labels can be addressed. (Also for other reasons).

In general, I would suggest to test on the basis of a few choices for (2) and a few choices for (3), and then see for each combination what is the percentage achieved for (1).

One approach might also be to start by replacing each character by its 'STA family'. This reduces an enormous amount of signal (but of course we can't be 100% sure if it is signal or noise).
If this leads to much higher percentages of coverage, at least we will have learned something...

(02-07-2025, 12:19 AM)ReneZ Wrote: You are not allowed to view links. Register or Login to view.The initialisation problem (per page) remains, but this can be 'delayed' by just ignoring the first part of each page for the time being.

Each paragraph could have been "seeded" from other pages, there is no reason for a strictly top-down writing order, and we already know there is evidence against it. Maybe they wrote odd-numbered lines before even-numbered lines, like 103v.3 and 103v.7 before 103v.6.

With a probably symmetrical "possible source-target" relation, there would be no way to identify which is the source and which is the target. If we are very lucky and manage to build a graph of these relations for each page without too many false positives, maybe with the help of additional (even more unlikely!) restrictions such as "do not reuse a source", the "seeds" would be found at the "center" of the graph for some (fuzzy?) definition of "center". Smile

Sounds awfully complicated ...

Another possible investigation is to try and find out if a specific set of evolution rules could be responsible for Massimiliano Zattera's slot sequence, or a similar (simplified) one, and generate this set to maximize their probability from a VM transliteration, instead of just guessing them.

For example, in the other thread:

(02-07-2025, 01:05 AM)Bluetoes101 Wrote: You are not allowed to view links. Register or Login to view.You end up running into "cheeo" however, which is fairly common.

This is an interesting one. It appears (as a word) only once in f66r, 15 times in Q20.
You are not allowed to view links. Register or Login to view.

I wonder which evolution rule could have caused its late appearance, or if a new rule was added for Q20.

An evolution rule like "ey." → "eo." is unlikely, it would probably have produced "cheeo" from "cheey" earlier.
You are not allowed to view links. Register or Login to view. (176 instances)

(02-07-2025, 12:19 AM)ReneZ Wrote: You are not allowed to view links. Register or Login to view.The initialisation problem (per page) remains, but this can be 'delayed' by just ignoring the first part of each page for the time being.

What do you mean exactly by "initialisation problem"?

If the "initialisation problem" is not knowing what/where the seeds are, I don't see why it is a problem.

To seed the page, I suppose it is reasonable to assume that they used the same procedure as they did for local self-citations, the only difference is that the sources are on a different page.

Pages: 1 2 3 4 5 6 7 8 9 10