The Voynich Ninja

Full Version: The 490, and other starting character patterns
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Pages: 1 2 3
(29-09-2025, 10:50 PM)SherriMM Wrote: You are not allowed to view links. Register or Login to view.Also my statistics only include the 18 line-initial characters, not any amount of random letters. Of the 18 characters, should I compute based on frequency?

If you assume that those 18 characters are equally likely then P(q) = P(o) = P(y) = 1/18 = ~0.056.  Then, if each line-initial letter is just chosen at random, independently, in 1000 lines you would expect 998*(1/18)^3 = 0.18 occurrences of qoy; that is none, or maybe one or two.

But suppose that (say) P(q) = P(o) = P(y) = 0.30, with all the other 15 letters occurring only on 10% of the lines.  Then in 1000 lines, with each initial being chosen at random and independently, you should expect 998*(0.30)^3 = ~27 occurrences of qoy.  

Thus, in order to tell whether the number of qoy occurences is anomalous, you must consider the actual frequencies of the letters in line-initial positions.

And, again, note that some three-glyph sequences will be more common that what the formula says.  For instance, in the second example above, the string qoy may occur only  25 times, but qyo may occur 31 times, oqy 26 times, oyq 33 times...  If you pick the the most common three-letter pattern, that pattern will be more common than expected.

All the best, --jorge
Some numbers from RF1b-er.txt
You are not allowed to view links. Register or Login to view.
(30-09-2025, 12:25 AM)RobGea Wrote: You are not allowed to view links. Register or Login to view.
Code:
First_letter_of_a_line counts from RF1b-er.txt
Total first_letter_of_a_line character Tokens: 4130

d  640    15.4964%      rf: 0.155
s  624    15.109%      rf: 0.1511
y  610    14.77%        rf: 0.1477
o  544    13.1719%      rf: 0.1317
q  531    12.8571%      rf: 0.1286
...

OK, so (n-2)P(q)P(o)P(y) = 4128 * 0.129 * 0.132 * 0.148 = ~10.4.

Meaning that, if the line-initial letters had been chosen randomly and independently with those frequencies, in those 4130 lines you should expect to get 10 qoy or thereabouts.  

Also about a dozen dsy, a dozen sdy, a dozen yds, ten qyd, ...

All the best, --jorge
If intentionally interspersing by not repeating previous character, expected number of qoy occurrences ~13.6
(06-10-2025, 05:43 AM)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view.if the line-initial letters had been chosen randomly and independently with those frequencies

Two random variables X and Y are independent, by definition, if Prob(X=x and Y=y) = Prob(X=x)Prob(Y=y) for any possible values x,y

If each letter S[k] in the sequence is independent from all the others, in particular it must be independent from the previous one S[k-1].  You can test this condition by computing the frequency P(xy) of S[k-1]S[k] = xy and comparing it with the product of frequencies P(x)P(y).  For example, since P(q) = 0.1286 and P(o) = 0.1317, the frequency P(qo) of qo in consecutive positions should be close to 0.1286*0.1317 = 0.0169.  That is, P(qo)/(P(q)P(o)) should be close to 1.

However the frequencies are only an approximation to the probabilities, and the error is large when the letters and letter pairs occur only a few times in the sample.  So this test should be applied only to the most common letters, like the top five in your list.

Also, this test only checks whether each letter is independent from the previous  one.  If the line-initial letters pass this test, there may still be more complicated dependencies.  Like "S[k] is equal to S[k-3] 70% of the time, unless S[k-7] is q, in which case ..."

Also, if one computes a large number of statistics about something, some of the statistics will be "anomalous" just by chance.   Like, if you ask 1024 financial gurus in the morning whether a certain stock will go up or down during the day, for 10 consecutive days, there is a good chance that one of them will be correct all 10 times.  It does not follow that her predictions are better than the others.  Or better than flipping a coin...

All the best, --jorge
Excuse the intrusion, I'm a programmer (a very bad one), I want to share my perspective if you don't mind.Is it a feasible theory that each paragraph is a block composed of lines that are in turn self-sufficient?Each line would represent a different but sequential process, meaning you could only complete the second line once the first line of the block was done and executed.

Thanks to the guy who started this thread, I was able to see the patterns at both the beginning and end of the lines.That would also perhaps explain the text's weak narrative structure, but it does have a certain component of logical processes... Like A-> B-> C

Here are some prefixes and labels that end on each line and are constantly repeated. 

Prefixes qok–, ot–, d–, y– Which usually start the supposed instructions

Suffixes or tags ending sequences -dy, -ain, -al, -ol

In short, each block or paragraph is composed of different processes that begin and end on a single line. But this only applies to the botany section; I haven't explored the rest of the manuscript yet.
Thanks SherriMM,

I think you have raised something of genuine significance. But perhaps the way to analyse this further is not to look at 3 or higher character string repeats but to look at the distribution of 2-character repeats.

For instance, in the starting characters for You are not allowed to view links. Register or Login to view. ( using the GC transliteration ) a  line is followed 7 times by one of the ch variants, and there appears to be a pattern. This is curious.

[attachment=11595]

In f83r  s lines seem to appear too often. Curious also.

[attachment=11594]


Taking only paragraph text from the GC transliteration and not including lines spanning page breaks, the matrix of occurrences for the character pairs is given here.

You are not allowed to view links. Register or Login to view.
( counts )


For instance, there are 80 lines starting d that are followed by a line staring y. But if you look at the matrix of affinities, the ratios of counts against expected values, then you get something rather unexpected. d followed by d occur 50 times, but this is 0.52 of what would be expected if the lines were randomly shuffled. But also look at o-o ( 0.25 ), q-q ( 0.3 ), t-t ( 0.59 ). These doubles all occur less often than expected. Yet if you look at s-s ( 1.42 ), it is high. s seems to have a liking for itself more than for other characters.

You are not allowed to view links. Register or Login to view.
( affinities )


Also look at some of the highs. q-Sh and q-ch occur 2.18 and 3.75 times what would be expected. o-q  occurs 134 times but is 2.01 times what would be expected. Also look at some of the lows.  s-ch hardly ever occurs. These are big swings from parity.

Applying statistics hypothesis testing methods to this data to obtain a confidence level in the hypothesis that the effects are not random would not be necessary. The swings from parity are just too big.

So, how to explain these anomalies.

If the manuscript were in some natural language then it would be expected that the narrative would just wrap to the next line if there was no more space for a word at the end of the previous line. Line starting characters would be independent of each other and there would be reasonable parity between observed and expected repeats.

Likewise with any cypher hypothesis that transforms words according to some algorithm.

So it just remains that these anomalies are showing evidence of human choice and selection, that the text of the VMS was artificially constructed and that the writers had some private method to generate text, and meaningless, had their favourite character strings, added sufficient variability in order to give the constructed text a semblance of genuineness, but with the exception of did not want to repeat line first characters too often.
(08-10-2025, 11:49 AM)dashstofsk Wrote: You are not allowed to view links. Register or Login to view.I think you have raised something of genuine significance. But perhaps the way to analyse this further is not to look at 3 or higher character string repeats but to look at the distribution of 2-character repeats.

Even before that, one should compare the frequencies P(w) of line-initial words with the frequencies Q(w) of words in any position.  They are very different.  The statistics of line-initial glyphs is highly skewed because the statistics of line-initial words are skewed. Thus, one should try to understand and explain the latter rather that the former.

Suppose that on some other manuscript one finds that the letters "i", "x" and "v" seem to be very common at the beginning of the line, and there is an anomalously high frequency of repeated letters in consecutive lines, much higher than expected if they were chosen independently at random.  Those anomalies might be easier to understand that if one notices that the most common words at the beginning of the line are "i", "ii", "iii", "iv", "v", "vi", ...

Let's look at page f83r, in particular.  Say that a token (word occurrence) is "head" if it is the first of a line, "body" otherwise.

The following lists exclude words that occur only once, which we cannot tell whether they like to be head or body.  The counts are nt = total occurrences on page, nh = occurrences as head, nb = occurrences in body. (Sorry for the leading zeros, but it was the only way I found to prevent the MyBB editor from messing up the alignment of the tables).

These words seems to occur as both head and body:

nt nh nb word
-- -- -- ---------
05 03 02 saiin
04 02 02 daiin
03 01 02 dain
03 01 02 qokchedy
03 01 02 sar
02 01 01 qokain
02 01 01 qokshedy
02 01 01 sy


These words occur (practically) ONLY as head:

nt nh nb word
-- -- -- ---------
07 06 01 sol
03 03 00 tchedy
02 02 00 solkeedy
02 02 00 sor

These words occur (practically) ONLY as body:

Code:
nt nh nb word
-- -- -- ---------
16 00 16 chedy
15 00 15 shedy
12 00 12 qokedy
09 00 09 lchedy
07 00 07 qoky
07 01 06 qokeedy
05 00 05 chey
05 00 05 qokaiin
04 00 04 aiin
05 01 04 dal
04 00 04 ol
04 00 04 qokal
04 00 04 qokeey
04 00 04 qoteedy
04 00 04 shey
03 00 03 otedy
03 00 03 qokchdy
03 00 03 r
03 00 03 s
02 00 02 checthy
02 00 02 cheey
02 00 02 dar
02 00 02 l
02 00 02 lkedy
02 00 02 lo
02 00 02 lsheedy
02 00 02 olchedy
02 00 02 oldy
02 00 02 otaiin
02 00 02 qokey
02 00 02 qotal
02 00 02 qotedy
02 00 02 sain
02 00 02 shckhedy
02 00 02 shcthy
02 00 02 shecthy
02 00 02 shedal
02 00 02 sheedy


And these are the words that occur only once on the page:


Code:
nt nh nb word
-- -- -- ---------
01 00 01 air
01 00 01 altedy
01 00 01 ar
01 00 01 atal
01 00 01 chary
01 00 01 chckhal
01 00 01 chckhdy
01 00 01 chcthdy
01 00 01 chdal
01 00 01 chdy
01 00 01 cheal
01 00 01 chealror
01 00 01 checkhy
01 00 01 checphedy
01 00 01 ched
01 00 01 chedaiin
01 00 01 chedain
01 00 01 chedchy
01 00 01 cheeb
01 00 01 cheedar
01 00 01 cheeety
01 00 01 cheeky
01 00 01 cheg
01 00 01 cheky
01 00 01 cheol
01 00 01 chepol
01 00 01 chety
01 00 01 chkain
01 00 01 chkedy
01 00 01 chkeedy
01 00 01 chky
01 00 01 chldaiin
01 00 01 ckal
01 00 01 cthal
01 00 01 cthalsaiin
01 00 01 daiiin
01 00 01 dairydy
01 00 01 dalchdy
01 01 00 dcheokedy
01 00 01 dolshed
01 01 00 dsheey
01 00 01 kain
01 00 01 kesd
01 00 01 lal
01 00 01 lchcphedy
01 00 01 lchdy
01 00 01 lchs
01 00 01 ldalor
01 00 01 ldar
01 00 01 ldy
01 00 01 lkeed
01 00 01 lol
01 00 01 lpchedy
01 00 01 lshedy
01 01 00 ocheol
01 00 01 ockhey
01 00 01 okain
01 00 01 okair
01 00 01 okedy
01 00 01 okeedyldy
01 00 01 okeeol
01 01 00 olkeedy
01 01 00 olkeey
01 00 01 ols
01 00 01 olsaly
01 00 01 opshcdy
01 00 01 opshedy
01 00 01 oqol
01 01 00 or
01 00 01 otal
01 00 01 otar
01 01 00 otchdy
01 00 01 otchedy
01 01 00 otchey
01 00 01 otor
01 00 01 otshdy
01 01 00 pchedal
01 00 01 pchedar
01 01 00 pchor
01 01 00 pdalshdy
01 00 01 qcphhedy
01 00 01 qeeedy
01 00 01 qekeiiin
01 00 01 qekey
01 00 01 qetal
01 00 01 qockhey
01 01 00 qockhol
01 00 01 qody
01 00 01 qofshdy
01 00 01 qokchy
01 00 01 qokeal
01 00 01 qokedal
01 00 01 qokol
01 00 01 qokylddy
01 00 01 qolchey
01 00 01 qolkain
01 00 01 qopchedy
01 00 01 qopy
01 00 01 qot
01 00 01 qotaiin
01 00 01 qotar
01 00 01 qotchedy
01 00 01 qotedaiin
01 00 01 qoty
01 00 01 raly
01 00 01 rchedy
01 00 01 rches
01 00 01 rchs
01 00 01 sam
01 01 00 schedair
01 01 00 schedy
01 00 01 schety
01 00 01 shckhey
01 00 01 shcthey
01 00 01 shdy
01 00 01 shechy
01 00 01 sheckhdy
01 00 01 sheckhy
01 00 01 shecthedchy
01 00 01 shecthedy
01 00 01 shedaiin
01 00 01 shedkedy
01 00 01 sheekchy
01 00 01 sheey
01 00 01 sheol
01 00 01 shetar
01 01 00 shky
01 00 01 shocphedy
01 00 01 shol
01 00 01 shy
01 00 01 skal
01 01 00 skar
01 01 00 soiiin
01 01 00 sokeedy
01 01 00 solche'dy
01 00 01 solchedy
01 01 00 solcheol
01 01 00 solchkal
01 00 01 soldy
01 01 00 solkeey
01 01 00 solkey
01 01 00 solshed
01 00 01 tain
01 00 01 tal
01 00 01 tar
01 01 00 techedy
01 00 01 tol
01 00 01 ytaiin
I have just recalled that user 'tavie' posted something similar about line starting characters. He used the term 'vertical pair' which sounds more suitable. In particular, he looked at the frequency of vertical pairs within different sections of the manuscript.

You are not allowed to view links. Register or Login to view.
You are not allowed to view links. Register or Login to view.


To this I would just like to add my matrix of affinities for vertical pairs in the language A pages

You are not allowed to view links. Register or Login to view.
( Language A affinities )


In particular you will see that  o-o , q-q  and  t-t  vertical pairs occur in language A even less often than would be expected.
(09-10-2025, 09:39 AM)dashstofsk Wrote: You are not allowed to view links. Register or Login to view.I have just recalled that user 'tavie' posted something similar about line starting characters. He used the term 'vertical pair' which sounds more suitable. In particular, he looked at the frequency of vertical pairs within different sections of the manuscript.

She has made presentations at the Voynich Conference in 2022 and both Voynich MS days in 2024 and 2025.
Especially the latter two are on this topic, and I recommend having a close look at them.
Pages: 1 2 3