The Voynich Ninja

Something that I very much want to do (I will have time next week) and was prompted by Torsten Timm's post here: You are not allowed to view links. Register or Login to view.

Quote:If it comes to statistics for the VMS it is important to keep in mind that the text isn't homogenous. The text differs from Currier A to Currier B, from quire to quire, from bifolio to bifolio, from page to page and even within a paragraph or line.

How abnormal are paragraphs/pages/quires? Does homogeneity improve at the end? All pages of Q13 and Q20 look normal to me relative to their quire, without all the weird spikes in frequencies seen in the early herbal pages, but this should be checked.

I need statistics / probability calculation advice. Smile

What would be the best way to "objectively" (for a given transliteration) measure how normal (or abnormal) a chunk of text is compared to its context (page / quire / entire Currier language)?

I guess it would be best to count mono-/bi-/tri-grams and compare (how?) these statistics to the expected binomial distribution. Of course large variations in frequencies are expected when the size of the sample is small, so the metric has to take the size of the sample into account somehow.

Or maybe the probability of the sample (paragraph or page) could be simply calculated (assuming independence between the probabilities of each n-gram, not true actually but maybe good enough for a quick approximation), the same way as in the Hill climbing algorithm: multiply probabilities (or better add logarithms of probabilities) of presence of each n-gram estimated from overall frequency, then normalize for size (divide the sum by the size of the sample)?

(19-12-2022, 04:28 PM)nablator Wrote: You are not allowed to view links. Register or Login to view.How abnormal are paragraphs/pages/quires? Does homogeneity improve at the end (Q20)?

Oh, wow. This would be execellent information to have. Lots of possibilities of results that could be really helpful to interpretation of both prior and future work.

I can imagine results that would help with trying to determine how often the "paradigm" is shifted -- because I am losing hope that it is just two (Currier A and B).

I don't have the background expertise to advise, but just wanted to cheer you on in such efforts!

One question could be whether the 5 hands identified by Lisa Fagin Davis are reflected in the homogeneity of individual sections / pages.

I expect the lowest probability should be reached by pages such as You are not allowed to view links. Register or Login to view. and You are not allowed to view links. Register or Login to view. (no e), You are not allowed to view links. Register or Login to view. (no n), or, more likely, by the combination of multiple less visible statistical anomalies on the same page, and, at the quire level, when some patterns are over-represented or some common pattern is missing (Q13 is amazing). Larger sample = less uncertain result. All these results could be compared to various texts (in any natural language) to show just how much they differ.

At the word level there is probably too much statistical noise to get meaningful results at the sample size that we want. I hope there is enough smaller chunks of text (length 1, 2, 3) to get a trustworthy result.

(19-12-2022, 06:21 PM)nablator Wrote: You are not allowed to view links. Register or Login to view.I expect the lowest probability should be reached by pages such as You are not allowed to view links. Register or Login to view. and You are not allowed to view links. Register or Login to view. (no e), You are not allowed to view links. Register or Login to view. (no n), and, at the quire level, when any pattern is over-represented or common pattern is totally missing (Q13 is amazing).

The You are not allowed to view links. Register or Login to view. and You are not allowed to view links. Register or Login to view. also have some otherwise unique sequences I think - You are not allowed to view links. Register or Login to view. has one of the two dydydy sequences; and it also has a very similar dydyodyd series below it in the very next line. You are not allowed to view links. Register or Login to view. has the only occurrence of a chschs sequence and an rrr (or sss) sequence.
These are not features per se when it comes to VMS, but it just gives the feeling that the text contains very little if any information, and the scribe kindof made it up according to certain rules (Something like a Cardan grille or a reverse Cardan grille).

(19-12-2022, 04:28 PM)nablator Wrote: You are not allowed to view links. Register or Login to view.How abnormal are paragraphs/pages/quires? Does homogeneity improve at the end? All pages of Q13 and Q20 look normal to me relative to their quire, without all the weird spikes in frequencies seen in the early herbal pages, but this should be checked.

Equally distributed glyphs or words doesn't exists in the VMS. (See also my You are not allowed to view links. Register or Login to view. about the distribution of vords containing the sequences 'ed', 'ho', and 'in' in the VMS.)

See for instance the vord <qokeey>. It is the most common vord on You are not allowed to view links. Register or Login to view. and the third most frequent vord in Quire 20 (see You are not allowed to view links. Register or Login to view.). However <qokeey> is only frequently used on some of the pages in Q20. On three pages <qokeey> ist even absent. You can choose every vord you like the behavior is always the same: "No obvious rule can be deduced which words form the top-frequency tokens at a specific location, since a token dominating one page might be rare or missing on the next one." (Timm & Schinner 2019, p. 3). See also You are not allowed to view links. Register or Login to view.@Github.

Instances of <qokeey> in Q20:

Code:
f103r 26 instances (most common word)

f103v 11 instances

f104r 1 instance

f104v 5 instances

f105r 3 instances

f105v missing

f106r 1 instance

f106v 4 instances

f107r 1 instance

f107v 12 instances (third frequent word)

f108r 9 instances

f108v 21 instances (second most common word)

f111r 21 instances (most common word)

f111v 10 instances

f112r 12 instances (most common word)

f112v 5 instances

f113r 4 instances

f113v 3 instances

f114r missing

f114v missing

f115r 1 instance

f115v 1 instance

f116r 9 instances

However, all pages containing at least some lines of text do have in common that pairs of frequently used words with high mutual similarity appear (see You are not allowed to view links. Register or Login to view., p. 3). For instance on You are not allowed to view links. Register or Login to view. alongside <qokeey> also the words <qokeedy>, <okeey> and <qokey> exists (see You are not allowed to view links. Register or Login to view.).

Hello Torsten,

Yes, the variations in counts per page are high for qokeey, but I would rather focus on shorter patterns, far more numerous. In a natural language short patterns are less dependent on meaning and therefore their frequency does not vary so much with topic. I want to calculate the probability of the actual statistics assuming the distribution is uniform. Of course it is not uniform, but this is only to measure how far we are from a uniform distribution.

The statistical anomalies at the page scale in Q13 and Q20 are less obvious it seems, they look relatively homogeneous. You need to count words to see the differences between pages, at the quire scale. On the other hand you have Herbal A pages like You are not allowed to view links. Register or Login to view. that have way too many "or", which is immediately visible at the line or paragraph scale. Variation of frequency of 'ed', 'ho', 'in' in Q13 and Q20 don't look as extreme, visually, on voynichese.com. 'or' looks even more unevenly distributed in Q20 than these three, but I haven't done the calculations.

Quoting Nick Pelling, Q13 and Q20 are "the two large relatively homogeneous blocks of text": You are not allowed to view links. Register or Login to view.

(19-12-2022, 11:40 PM)nablator Wrote: You are not allowed to view links. Register or Login to view.I would rather focus on shorter patterns, far more numerous.

One pattern for the VMS is that high-frequency tokens also tend to have high numbers of similar vords. This means 'isolated' words (i.e. without any similar ones) usually appear just once in the entire VMS, while the most frequent token <daiin> (836 occurrences) has 36 counterparts with edit distance 1 [see Timm & Schinner p. 6]. With other words instead of tracing a certain glyph sequence you can also trace the most common vord containing these sequence. For instance to distinguish between Currier A and Currier B you can either trace the glyph combination 'ed' or you can trace the vord <chedy> [see Timm & Schinner 2019, p. 6].

For quire 20 see also my You are not allowed to view links. Register or Login to view. about the distribution of vords containing the glyph combinations 'ed', 'ho', and 'in' in the VMS: "For the stars section it is even possible to point to pages dominated by vords containing 'ed' (see f111r) whereas the very next page is dominated by 'in'-vords. Even another page within the very same section contains an unusual high number of vords containing 'ho' (see f113r)."

The right tool for the job (I hope): the chi-squared test of homogeneity.

Voynich Manuscript: Q20 (in-)homogeneity and BAAFU…
You are not allowed to view links. Register or Login to view.
Yes

nablator

MichelleL11

bi3mw

nablator

Common_Man

Torsten

nablator

Torsten

nablator

nablator