The Voynich Ninja

Pages: 1 2 3

(05-06-2026, 02:44 PM)dashstofsk Wrote: You are not allowed to view links. Register or Login to view.Would it be possible for your generator to output its text in IVTFF format? And if it could generate 20 pages and about 7000 words total I would then be better able to compare it with quire 13.

A direct comparison to Q13 would require a specific training: some glyph trigrams don't exist, Q13 has a lower MATTR than any other section, m are always at the end of lines.

What I'd like to see is how a text, not generated by any kind of "self-citation" algorithm, but closely modeled on the real text, compares to the VMS in terms of local similarity metrics.

(05-06-2026, 02:44 PM)dashstofsk Wrote: You are not allowed to view links. Register or Login to view.Would it be possible for your generator to output its text in IVTFF format? And if it could generate 20 pages and about 7000 words total I would then be better able to compare it with quire 13.

will do next week if I can, thanks!

nablator dateline='[url=tel:1780673871' Wrote: You are not allowed to view links. Register or Login to view.1780673871[/url]']

dashstofsk dateline='[url=tel:1780667083' Wrote: You are not allowed to view links. Register or Login to view.1780667083[/url]']
Would it be possible for your generator to output its text in IVTFF format? And if it could generate 20 pages and about 7000 words total I would then be better able to compare it with quire 13.

A direct comparison to Q13 would require a specific training: some glyph trigrams don't exist, Q13 has a lower MATTR than any other section, m are always at the end of lines.

What I'd like to see is how a text, not generated by any kind of "self-citation" algorithm, but closely modeled on the real text, compares to the VMS in terms of local similarity metrics.

interesting,.. which kind of metrics do you have in mind?

(05-06-2026, 04:39 PM)Labyrinthinesecurity Wrote: You are not allowed to view links. Register or Login to view.interesting,.. which kind of metrics do you have in mind?

The one that Torsten Timm used: average edit distance between all words of different lines. See the comparison with other texts here: You are not allowed to view links. Register or Login to view.
I guess there should be no detectable effect (local inter-line similarity) on your generated text.

I wrote a babble detector very recently based on Byte Pair Encoding, not optimal compression I know, so the results are a bit random, but it does seem to find the babble-like sequences quite efficiently (in a window of n words) so it's good enough for a quick comparison of the amount of such sequences. I suppose your generated text should behave like a word-shuffled version of the section(s) it has been trained on, i.e. less babble-like.

For the comparison to be more significant I guess some LAAFU statistics and inter-word statistics would be helpful, they might add a little bias in local similarity. They probably don't matter for the average edit distance between all words of different lines but might impact a little the babble detector on very short sequences of words. For this I used character trigrams statistics with special characters for line start and line end in my simple Markov chain generator (without word model, so the word length distribution was very wrong).

Heres my big question. It it something a human scribe could do?

Ah, nevermind. Read the details.

How many unique word types are in the lexical buckets?

(05-06-2026, 02:29 PM)Labyrinthinesecurity Wrote: You are not allowed to view links. Register or Login to view.Interesting! Was it Currier A or B? In any case, it fails 2 of the 4 signatures:

It was on the whole text, so a mixture of A and B.

(05-06-2026, 04:37 PM)nablator Wrote: You are not allowed to view links. Register or Login to view.What I'd like to see is how a text, not generated by any kind of "self-citation" algorithm, but closely modeled on the real text, compares to the VMS in terms of local similarity metrics.

Sorry if I was a little cryptic with this. What I had in mind is a generator that would naturally produce locally similar words by randomly reducing, sometimes drastically, the set of generative components, at various scales. The reduced set could be modeled on actual paragraphs/pages/sections for maximum fidelity.

(06-06-2026, 04:16 AM)Dunsel Wrote: You are not allowed to view links. Register or Login to view.Heres my big question. It it something a human scribe could do?

Ah, nevermind. Read the details.

How many unique word types are in the lexical buckets?

on average, about 20 unique words per bucket (currier A), 40 for Currier B. This is a very rough estimate since the buckets are very unevenly distributed (the shape is a fat tail).

Information theoric background sustaining the design choices made in the generator:

Why we collapse sh with ch
sh and ch behave somehow as allomorphs: collapsing sh→ch reduces character bigram conditional entropy by 3.0-3.7% with zero change in stream length. The effect is consistent across both dialects and both stronger in H2 than in H1 (ratio 2.8-3.1x).

Why we move gallows to prefix position in cCh clusters
Gallows in cCh clusters seem to function as prefixes: of three tested reorderings (cCh→ch deletion, cCh→Cch prefix, cCh→chC suffix), the prefix reordering maximizes bigram mutual information and produces the best structural improvement in both dialects.

Gallows choice (t/k/p/f) within a given template position shows zero residual entropy after template and distributional mode conditioning in the slot-level decomposition.
However, gallows contribute substantially to template identity (the difference between CVX and GCVX is informationally significant) and show strong positional preferences (gallows-initial templates are overrepresented 50-80x at page starts).
Gallows are highly constrained and locally determined, but not informationally null at the template level.

(08-06-2026, 01:59 PM)Labyrinthinesecurity Wrote: You are not allowed to view links. Register or Login to view.Why we collapse sh with ch
sh and ch behave somehow as allomorphs: collapsing sh→ch reduces character bigram conditional entropy by 3.0-3.7% with zero change in stream length. The effect is consistent across both dialects and both stronger in H2 than in H1 (ratio 2.8-3.1x).

Collapsing all characters to X will reduce the entropy to zero with zero change in stream length. I'm not sure how this is an explanation for collapsing sh with ch. Also, isn't the problem with Voynichese that the conditional entropy is already too low?

(08-06-2026, 02:44 PM)oshfdk Wrote: You are not allowed to view links. Register or Login to view.
(08-06-2026, 01:59 PM)Labyrinthinesecurity Wrote: You are not allowed to view links. Register or Login to view.Why we collapse sh with ch
sh and ch behave somehow as allomorphs: collapsing sh→ch reduces character bigram conditional entropy by 3.0-3.7% with zero change in stream length. The effect is consistent across both dialects and both stronger in H2 than in H1 (ratio 2.8-3.1x).

Collapsing all characters to X will reduce the entropy to zero with zero change in stream length. I'm not sure how this is an explanation for collapsing sh with ch. Also, isn't the problem with Voynichese that the conditional entropy is already too low?

Indeed. I have a better explanation: the distributions of characters preceding and following 'ch' and 'sh' are very similar. These are the geometric (root-mean-squares) distances of the Voynich characters ('ch' = C, 'sh' = S) according to the distributions of the previous and following character on the whole RF1a-n text (min possible value = 0, max possible value = sqrt(2)):

[attachment=15968]

The distance between 'ch' and 'sh' according to the following character is 0.13, according to the previous character is 0.18, both low. There are other character couplets with lower distances, eg. 'm' and 'n' have a distance_following of 0.03 (they are both mostly followed by space), but none with a lower distance in both directions (i.e. 'm' and 'n' are 1.18 apart according to the previous character).

My conclusion is that this supports the idea that 'ch' and 'sh' might be the same character, but is far from proving it (and I'm actually not conmvinced by it). I'd be more confident to say that 'ch' and 'sh' have a very similar role, whichever that might be.

By the way, iirc the next most similar couplet is 'ckh' and 'cth' (K and T in the above diagrams), distance_following = 0.18 and distance_previous = 0.34

Pages: 1 2 3

nablator

Labyrinthinesecurity

nablator

Dunsel

Mauro

nablator

Labyrinthinesecurity

Labyrinthinesecurity

oshfdk

Mauro