The Voynich Ninja

(29-09-2025, 10:50 PM)SherriMM Wrote: You are not allowed to view links. Register or Login to view.Also my statistics only include the 18 line-initial characters, not any amount of random letters. Of the 18 characters, should I compute based on frequency?

If you assume that those 18 characters are equally likely then P(q) = P(o) = P(y) = 1/18 = ~0.056. Then, if each line-initial letter is just chosen at random, independently, in 1000 lines you would expect 998*(1/18)^3 = 0.18 occurrences of qoy; that is none, or maybe one or two.

But suppose that (say) P(q) = P(o) = P(y) = 0.30, with all the other 15 letters occurring only on 10% of the lines. Then in 1000 lines, with each initial being chosen at random and independently, you should expect 998*(0.30)^3 = ~27 occurrences of qoy.

Thus, in order to tell whether the number of qoy occurences is anomalous, you must consider the actual frequencies of the letters in line-initial positions.

And, again, note that some three-glyph sequences will be more common that what the formula says. For instance, in the second example above, the string qoy may occur only 25 times, but qyo may occur 31 times, oqy 26 times, oyq 33 times... If you pick the the most common three-letter pattern, that pattern will be more common than expected.

All the best, --jorge

Some numbers from RF1b-er.txt

You are not allowed to view links. Register or Login to view.

(30-09-2025, 12:25 AM)RobGea Wrote: You are not allowed to view links. Register or Login to view.
Code:
First_letter_of_a_line counts from RF1b-er.txt Total first_letter_of_a_line character Tokens: 4130 d 640 15.4964% rf: 0.155 s 624 15.109% rf: 0.1511 y 610 14.77% rf: 0.1477 o 544 13.1719% rf: 0.1317 q 531 12.8571% rf: 0.1286 ...

OK, so (n-2)P(q)P(o)P(y) = 4128 * 0.129 * 0.132 * 0.148 = ~10.4.

Meaning that, if the line-initial letters had been chosen randomly and independently with those frequencies, in those 4130 lines you should expect to get 10 qoy or thereabouts.

Also about a dozen dsy, a dozen sdy, a dozen yds, ten qyd, ...

All the best, --jorge

If intentionally interspersing by not repeating previous character, expected number of qoy occurrences ~13.6

(06-10-2025, 05:43 AM)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view.if the line-initial letters had been chosen randomly and independently with those frequencies

Two random variables X and Y are independent, by definition, if Prob(X=x and Y=y) = Prob(X=x)Prob(Y=y) for any possible values x,y

If each letter S[k] in the sequence is independent from all the others, in particular it must be independent from the previous one S[k-1]. You can test this condition by computing the frequency P(xy) of S[k-1]S[k] = xy and comparing it with the product of frequencies P(x)P(y). For example, since P(q) = 0.1286 and P(o) = 0.1317, the frequency P(qo) of qo in consecutive positions should be close to 0.1286*0.1317 = 0.0169. That is, P(qo)/(P(q)P(o)) should be close to 1.

However the frequencies are only an approximation to the probabilities, and the error is large when the letters and letter pairs occur only a few times in the sample. So this test should be applied only to the most common letters, like the top five in your list.

Also, this test only checks whether each letter is independent from the previous one. If the line-initial letters pass this test, there may still be more complicated dependencies. Like "S[k] is equal to S[k-3] 70% of the time, unless S[k-7] is q, in which case ..."

Also, if one computes a large number of statistics about something, some of the statistics will be "anomalous" just by chance. Like, if you ask 1024 financial gurus in the morning whether a certain stock will go up or down during the day, for 10 consecutive days, there is a good chance that one of them will be correct all 10 times. It does not follow that her predictions are better than the others. Or better than flipping a coin...

All the best, --jorge

Excuse the intrusion, I'm a programmer (a very bad one), I want to share my perspective if you don't mind.Is it a feasible theory that each paragraph is a block composed of lines that are in turn self-sufficient?Each line would represent a different but sequential process, meaning you could only complete the second line once the first line of the block was done and executed.

Thanks to the guy who started this thread, I was able to see the patterns at both the beginning and end of the lines.That would also perhaps explain the text's weak narrative structure, but it does have a certain component of logical processes... Like A-> B-> C

Here are some prefixes and labels that end on each line and are constantly repeated.

Prefixes qok–, ot–, d–, y– Which usually start the supposed instructions

Suffixes or tags ending sequences -dy, -ain, -al, -ol

In short, each block or paragraph is composed of different processes that begin and end on a single line. But this only applies to the botany section; I haven't explored the rest of the manuscript yet.

Thanks SherriMM,

I think you have raised something of genuine significance. But perhaps the way to analyse this further is not to look at 3 or higher character string repeats but to look at the distribution of 2-character repeats.

For instance, in the starting characters for You are not allowed to view links. Register or Login to view. ( using the GC transliteration ) a q line is followed 7 times by one of the ch variants, and there appears to be a pattern. This is curious.

[attachment=11595]

In f83r s lines seem to appear too often. Curious also.

[attachment=11594]

Taking only paragraph text from the GC transliteration and not including lines spanning page breaks, the matrix of occurrences for the character pairs is given here.

You are not allowed to view links. Register or Login to view.

( counts )

For instance, there are 80 lines starting d that are followed by a line staring y. But if you look at the matrix of affinities, the ratios of counts against expected values, then you get something rather unexpected. d followed by d occur 50 times, but this is 0.52 of what would be expected if the lines were randomly shuffled. But also look at o-o ( 0.25 ), q-q ( 0.3 ), t-t ( 0.59 ). These doubles all occur less often than expected. Yet if you look at s-s ( 1.42 ), it is high. s seems to have a liking for itself more than for other characters.

You are not allowed to view links. Register or Login to view.

( affinities )

Also look at some of the highs. q-Sh and q-ch occur 2.18 and 3.75 times what would be expected. o-q occurs 134 times but is 2.01 times what would be expected. Also look at some of the lows. s-ch hardly ever occurs. These are big swings from parity.

Applying statistics hypothesis testing methods to this data to obtain a confidence level in the hypothesis that the effects are not random would not be necessary. The swings from parity are just too big.

So, how to explain these anomalies.

If the manuscript were in some natural language then it would be expected that the narrative would just wrap to the next line if there was no more space for a word at the end of the previous line. Line starting characters would be independent of each other and there would be reasonable parity between observed and expected repeats.

Likewise with any cypher hypothesis that transforms words according to some algorithm.

So it just remains that these anomalies are showing evidence of human choice and selection, that the text of the VMS was artificially constructed and that the writers had some private method to generate text, and meaningless, had their favourite character strings, added sufficient variability in order to give the constructed text a semblance of genuineness, but with the exception of s did not want to repeat line first characters too often.

(08-10-2025, 11:49 AM)dashstofsk Wrote: You are not allowed to view links. Register or Login to view.I think you have raised something of genuine significance. But perhaps the way to analyse this further is not to look at 3 or higher character string repeats but to look at the distribution of 2-character repeats.

Even before that, one should compare the frequencies P(w) of line-initial words with the frequencies Q(w) of words in any position. They are very different. The statistics of line-initial glyphs is highly skewed because the statistics of line-initial words are skewed. Thus, one should try to understand and explain the latter rather that the former.

Suppose that on some other manuscript one finds that the letters "i", "x" and "v" seem to be very common at the beginning of the line, and there is an anomalously high frequency of repeated letters in consecutive lines, much higher than expected if they were chosen independently at random. Those anomalies might be easier to understand that if one notices that the most common words at the beginning of the line are "i", "ii", "iii", "iv", "v", "vi", ...

Let's look at page f83r, in particular. Say that a token (word occurrence) is "head" if it is the first of a line, "body" otherwise.

The following lists exclude words that occur only once, which we cannot tell whether they like to be head or body. The counts are nt = total occurrences on page, nh = occurrences as head, nb = occurrences in body. (Sorry for the leading zeros, but it was the only way I found to prevent the MyBB editor from messing up the alignment of the tables).

These words seems to occur as both head and body:

nt nh nb word
-- -- -- ---------
05 03 02 saiin
04 02 02 daiin
03 01 02 dain
03 01 02 qokchedy
03 01 02 sar
02 01 01 qokain
02 01 01 qokshedy
02 01 01 sy

These words occur (practically) ONLY as head:

nt nh nb word
-- -- -- ---------
07 06 01 sol
03 03 00 tchedy
02 02 00 solkeedy
02 02 00 sor

These words occur (practically) ONLY as body:

Code:
nt nh nb word

-- -- -- ---------

00 16 chedy

00 15 shedy

00 12 qokedy

00 09 lchedy

00 07 qoky

01 06 qokeedy

00 05 chey

00 05 qokaiin

00 04 aiin

01 04 dal

00 04 ol

00 04 qokal

00 04 qokeey

00 04 qoteedy

00 04 shey

00 03 otedy

00 03 qokchdy

00 03 r

00 03 s

00 02 checthy

00 02 cheey

00 02 dar

00 02 l

00 02 lkedy

00 02 lo

00 02 lsheedy

00 02 olchedy

00 02 oldy

00 02 otaiin

00 02 qokey

00 02 qotal

00 02 qotedy

00 02 sain

00 02 shckhedy

00 02 shcthy

00 02 shecthy

00 02 shedal

00 02 sheedy

And these are the words that occur only once on the page:

Code:
nt nh nb word

-- -- -- ---------

00 01 air

00 01 altedy

00 01 ar

00 01 atal

00 01 chary

00 01 chckhal

00 01 chckhdy

00 01 chcthdy

00 01 chdal

00 01 chdy

00 01 cheal

00 01 chealror

00 01 checkhy

00 01 checphedy

00 01 ched

00 01 chedaiin

00 01 chedain

00 01 chedchy

00 01 cheeb

00 01 cheedar

00 01 cheeety

00 01 cheeky

00 01 cheg

00 01 cheky

00 01 cheol

00 01 chepol

00 01 chety

00 01 chkain

00 01 chkedy

00 01 chkeedy

00 01 chky

00 01 chldaiin

00 01 ckal

00 01 cthal

00 01 cthalsaiin

00 01 daiiin

00 01 dairydy

00 01 dalchdy

01 00 dcheokedy

00 01 dolshed

01 00 dsheey

00 01 kain

00 01 kesd

00 01 lal

00 01 lchcphedy

00 01 lchdy

00 01 lchs

00 01 ldalor

00 01 ldar

00 01 ldy

00 01 lkeed

00 01 lol

00 01 lpchedy

00 01 lshedy

01 00 ocheol

00 01 ockhey

00 01 okain

00 01 okair

00 01 okedy

00 01 okeedyldy

00 01 okeeol

01 00 olkeedy

01 00 olkeey

00 01 ols

00 01 olsaly

00 01 opshcdy

00 01 opshedy

00 01 oqol

01 00 or

00 01 otal

00 01 otar

01 00 otchdy

00 01 otchedy

01 00 otchey

00 01 otor

00 01 otshdy

01 00 pchedal

00 01 pchedar

01 00 pchor

01 00 pdalshdy

00 01 qcphhedy

00 01 qeeedy

00 01 qekeiiin

00 01 qekey

00 01 qetal

00 01 qockhey

01 00 qockhol

00 01 qody

00 01 qofshdy

00 01 qokchy

00 01 qokeal

00 01 qokedal

00 01 qokol

00 01 qokylddy

00 01 qolchey

00 01 qolkain

00 01 qopchedy

00 01 qopy

00 01 qot

00 01 qotaiin

00 01 qotar

00 01 qotchedy

00 01 qotedaiin

00 01 qoty

00 01 raly

00 01 rchedy

00 01 rches

00 01 rchs

00 01 sam

01 00 schedair

01 00 schedy

00 01 schety

00 01 shckhey

00 01 shcthey

00 01 shdy

00 01 shechy

00 01 sheckhdy

00 01 sheckhy

00 01 shecthedchy

00 01 shecthedy

00 01 shedaiin

00 01 shedkedy

00 01 sheekchy

00 01 sheey

00 01 sheol

00 01 shetar

01 00 shky

00 01 shocphedy

00 01 shol

00 01 shy

00 01 skal

01 00 skar

01 00 soiiin

01 00 sokeedy

01 00 solche'dy

00 01 solchedy

01 00 solcheol

01 00 solchkal

00 01 soldy

01 00 solkeey

01 00 solkey

01 00 solshed

00 01 tain

00 01 tal

00 01 tar

01 00 techedy

00 01 tol

00 01 ytaiin

I have just recalled that user 'tavie' posted something similar about line starting characters. He used the term 'vertical pair' which sounds more suitable. In particular, he looked at the frequency of vertical pairs within different sections of the manuscript.

You are not allowed to view links. Register or Login to view.
You are not allowed to view links. Register or Login to view.

To this I would just like to add my matrix of affinities for vertical pairs in the language A pages

You are not allowed to view links. Register or Login to view.

( Language A affinities )

In particular you will see that o-o , q-q and t-t vertical pairs occur in language A even less often than would be expected.

(09-10-2025, 09:39 AM)dashstofsk Wrote: You are not allowed to view links. Register or Login to view.I have just recalled that user 'tavie' posted something similar about line starting characters. He used the term 'vertical pair' which sounds more suitable. In particular, he looked at the frequency of vertical pairs within different sections of the manuscript.

She has made presentations at the Voynich Conference in 2022 and both Voynich MS days in 2024 and 2025.
Especially the latter two are on this topic, and I recommend having a close look at them.

Jorge_Stolfi

RobGea

Jorge_Stolfi

RadioFM

Jorge_Stolfi

Magical Raven

dashstofsk

Jorge_Stolfi

dashstofsk

ReneZ