The Voynich Ninja

Full Version: Distribution of Q-Q gaps in paragraphs
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Pages: 1 2 3
Summary

A q-token is a token (word occurrence) whose EVA transcription begins with the EVA glyphs qo or oqo. This note examines the spacing between q-tokens in the paragraph text of the VMS. 

More specifically, a plain token is a token that is not a q-token; a q-gap is the list of plain tokens in a parag before the first q-token (a BQ gap), between two successive q-tokens (a QQ gap), or after the last q-token (a QE gap).

In particular, if the first token of a parag is a q-token, we have an empty BE gap. If the last token is a q-token, we have an empty QE gap. And if two q-tokens occur in consecutive positions, we have an empty QQ gap. This post reports statistics on the lengths of those three kinds of q-gaps. 

The motivation was to test the hypothesis that the qo and oqo glyphs could be start-of-sentence markers, or more generally could be clues to sentence structure like subject case markers, verbal tense markers, etc. The results were not what I had expected, but are intriguing nonetheless.

Results

The following histograms show how many q-gaps there are of each size and type, in the Starred Parags ('str') section:

[attachment=13574][attachment=13572][attachment=13573]

Discussion

One observation we can make from these plots is that the distribution of q-tokens in a paragraph is not random. If a token could be plain or q-token with the same probability at any position, independently (a Markov process of order zero), then the number of q-gaps (of any type) of size k would be a decaying exponential function A*p**k, shown in the plots as the blue line with dots.  For the 'str' section, the parameter p is 0.8222.

Compared to the random model, in the BQ plot we see a clear excess of parags with 2 to 6 plain tokens before the first q-token, and a scarcity of parags where the first q-token occurs at the beginning (gap size 0) or only after 7 to 12 plain tokens.

A similar pattern is visible in the QE plot. There is a relative excess of parags with 5 or 6 plain tokens after the last q-token, and a relative scarcity of parags that end with a q-token, or with exactly 2, 4, or 8 plain tokens.

On the other hand, pairs of /successive/ q-tokens (QQ gaps of size 0) are much more common than expected, and ditto for pairs separated by a single plain token; whereas pairs separated by three plain tokens are visibly less common than expected.

It is not obvious how one could reproduce these deviations from the zero-order Markov model with some other simple random generator.

Technical details

Input file

For this analysis, an EVA transcription of the parags text from selected pages was reformatted by joing all lines of each parag into a single sequence of tokens. Line breaks internal to the parag, EVA dubious space codes ',', and figure intrusion markers '-' were converted to EVA word spaces '.'.

For this analysis, a token was considered invalid if it contained a q glyph but did not start with qo or oqo; or if it started with '?' or 'o?', so that it could not be determined if it was a q-token or a plain one.  Any q-gaps that contained invalid tokens were excluded from the plot.

BE gaps

There were a few parags with no q-tokens at all. In such cases the entire parag is a q-gap, of a separate type (a BE gap). In the zero-order Markov model, these BE gaps too have an exponentially decaying distribution, with the same exponent. However, there were too few of them to yield a meaningful plot.

Handling of dubious spaces

For the plots above, the EVA dubious space codes ',' were mapped randomly to either '.' or nothing, at random, with equal probability.

This hack had relatively little impact on the statistics above. Changing the probability of conversion to dot from 50% to 0% or 100% only shifted the histograms a little, without affecting the qualitative conclusion -- that the q-gap sizes are far from random.

All the best, --stolfi
Again, from the previous post: a q-token is a token that starts with qo (or oqo, but there are very few of these).  A plain token is one that is not a q-token, unless it contains a q, or starts with '?' or 'o?', in which case its invalid.  

Let a BQ-phrase be a list of zero or more plain tokens that starts at the beginning of a parag, plus the first q-token of that parag.  That is, a BQ-phrase is what was defined as a BQ-gap plus the q-token that ended it.   We may think of a BQ-phrase as the opening sentence of the parag.  Because of this analogy, I will call the closing q-token the verb of the BQ-phrase. (German readers will surely agree that it is a good name. Big Grin )

As noted in the previous post, in the Starred  Parags section (SPS) there is an excess of BQ-gaps of lengths 2 to 6, compared to the distribution that would result if the words were chosen at random.  So I extracted all the BQ-phrases with gaps of those lengths and looked at them.  

There are 209 of them (out of 327 parags in the SPS) and they have 93 distinct verbs.  The most common ones, and their counts, are

    16 qokeedy
    13 qokeey
    12 qokain
    10 qokaiin
      7 qopchedy
      7 qoteedy
      6 qokey
      5 qokedy
      5 qoteey
      4 qokar
      4 qokchedy
      4 qokchey
      4 qopchdy
      4 qotedy


Note that the top entries deviate from the Zipf Law.
 
There may be some cases where t is substituted for k by mistake (of the Author, the Scribe, or the Transcriber).  Among the verbs that use t, the most common ones are:

      7 qoteedy
      5 qoteey
      4 qotedy
      3 qotaiin
      3 qotain
      3 qotal
      2 qotar
      2 qotchdy
      2 qotchedy
      2 qotol
      2 qotor


Here are the 16 BQ-phrases with 2 to 6 plain tokens and qokeedy or qoteedy as the verb:

BQ 2 daiin.sheeal.qokeedy
BQ 2 kshed.dsheol.qokeedy
BQ 2 pcheoor.olkeedy.qokeedy
BQ 2 pchodain.okeedy.qokeedy
BQ 2 polor.sheedy.qoteedy
BQ 2 polsairy.oteo.qokeedy
BQ 3 otes.lchey.lshedy.qoteedy
BQ 3 pcheo.keey.oeeeky.qoteedy
BQ 3 pchsed.sheefy.opchey.qoteedy
BQ 3 poaral.orar.ofchey.qoteedy
BQ 3 poaralchar.octhy.otedy.qokeedy
BQ 3 teeol.sheol.sho.qokeedy
BQ 3 tshodair.olkeees.odain.qokeedy
BQ 4 chd?in.checkhy.dar.shedy.qokeedy
BQ 4 p.y.chal.shedy.qoteedy
BQ 4 poeokeey.lkeedy.tedain.shecthy.qokeedy
BQ 4 polaiin.okedain.okal.otchedy.qokeedy
BQ 5 dsheey.oteey.cheol.teedar.okeedy.qokeedy
BQ 5 ofaral.olkaiin.okar.okeeedy.tshedy.qokeedy
BQ 5 polaiin.oteol.otedyar.aral.kedy.qokeedy
BQ 5 tchede.okeey.lky.shedaiiin.chdy.qokeedy
BQ 6 sar.aintey.chetain.sht?ey.okey.chedy.qoteedy
BQ 6 ycheolkeeor.olkeeey.chedain.ol.cheedaiin.sheedy.qokeedy


Here are those with verb qokeey or qoteey:

BQ 2 kche.shodaiin.qokeey
BQ 2 paiindar.chcphy.qokeey
BQ 2 teeody.chedain.qoteey
BQ 2 tolshey.ochey.qokeey
BQ 3 chol.sheky.shedy.qokeey
BQ 3 pcheodar.sholkain.okshchedy.qoteey
BQ 3 pcheor.okear.sheey.qokeey
BQ 3 pchojain.aiin.teeedy.qoeey
BQ 3 pchosos.cheoar.keeol.qokeey
BQ 3 pol.keeeo.kaiis.qokeey
BQ 4 kaiin.sheey.oaiin.sheol.qoteey
BQ 4 sal.sheal.shedy.okeedy.qokeey
BQ 4 so.shear.okeedy.oteeey.qokeey
BQ 4 to.lkeeedy.okeol.cheody.qokeey
BQ 4 tor.shor.sheeey.oteeol.qokeey
BQ 5 daiin.o.chedain.daiin.cheedy.qokeey
BQ 5 pchedal.shdy.yteechypchy.otey.alshey.qoteey
BQ 5 shodain.cheal.shedy.r.cheetey.qokeey
BQ 6 teodarody.opcheed.okaiin.chaiin.otam.oteedy.qoteey


Here are the one with qokain or qotain:

BQ 2 kchdpy.shey.qokain
BQ 2 polaiin.otar.qotain
BQ 2 poral.shokeeody.qokain
BQ 2 pshoair.lkeeshedy.qokain
BQ 2 sairol.sheey.qokain
BQ 2 tshoar.oeey.qokain
BQ 2 yshey.lkeey.qokain
BQ 3 pchody.odaiin.chcphy.qokchdain
BQ 3 polar.ar.okshey.qokain
BQ 3 polar.okar.teody.qokain
BQ 3 polchey.lkarshar.okain.qokain
BQ 5 polaiin.ksheeol.lkaiin.tair.shey.qotain
BQ 6 fairal.chkal.lky.otain.ar.kalkal.qotain
BQ 6 kcheor.cheol.orair.otchedar.lor.aiin.qokain
BQ 6 pochey.oteain.chekain.cheal.lain.chey.qokain
BQ 6 polkiin.cheopaiin.otain.shedy.pchedy.opcheedy.qokain


And here are those with qokaiin or qotaiin:

BQ 2 dorkcheky.cheoaiin.qotaiin
BQ 2 orain.chckhey.qokaiin
BQ 2 palar.shedy.qokaiin
BQ 2 pched.shedain.qokaiin
BQ 2 polarar.lshedy.qotolaiin
BQ 2 polchlsaiin.sheky.qokaiin
BQ 2 tcheodaiin.chaiin.qokaiin
BQ 2 tcho.ararshy.qokaiin
BQ 2 tolshosor.olkeedy.qotaiin
BQ 3 tcheoky.l.kshedy.qokaiin
BQ 3 tchodairos.or.chey.qotaiin
BQ 5 pair.aiickheedy.shalkaiin.kaiisy.okaral.qokaiin
BQ 5 tchoar.sheeodaiin.chkaiin.otchod.okchedy.qokaiin
BQ 6 polchedy.olkeey.lkey.chcthy.lkar.shedy.qokaiin

Can you see any special patterns or anomales?  Do you have any suggestions for other analyses on this topic??

All the best, --stolfi
It is just a consequence of irregular positional preferences. Spline transform of the line positions of words starting  q in quires 13 and 20 are given by

[attachment=13584]
[attachment=13585]
(20-01-2026, 11:14 AM)dashstofsk Wrote: You are not allowed to view links. Register or Login to view.It is just a consequence of irregular positional preferences. Spline transform of the line positions of words starting  q in quires 13 and 20 are given by

Thanks, but those are positions along each lines, right?  So those stats are dominated by the starts of body lines.  My stats above are essentially looking only at the head lines of parags.

And I would say that the positional distribution of q-tokens along a parag is a consequence of the meaning of individual q-words and of the semantic structure of parags.  Not the other way around...

And the frequencies of glyphs and digraphs are a consequence of the frequencies of words.  Not the other way around.  (Statistics of glyphs and digraphs are like statistics of animals classified by their colors.  They can be meaningful in some very specific situations, but otherwise they are lumping bears with beavers and bats...)

And let me insist that the anomalous frequency of q-words (or any words) at the start and end of lines may be a consequence of the line-breaking algorithm, that is more likely to break a line before a long word than a short one.  This effect is counter-intuitive but very real, and affects all glyph, digraph, and word statistics at both ends of the lines. Thus any study that claims to show "LAAFU effects"  should check how much of those effects can be attributed to this line-breaking bias.  (For instance, by piping the words of each parag through the trivial line-breaking algorithm with a 20% larger or smaller max line length, and observing how much impact this reformatting has on those "LAAFU effects".)

All the best, --stolfi
We all know that line first words and line last words are not fully representative of the paragraph text. However away from the first and last words it would be expected of a meaningful text that distributions would be level. But this is not always seen.

For instance the positional distribution in quire 20 of words starting  ot shows that these words mid-line prefer to come with increasing frequency towards the end.

[attachment=13588]
Here is the discussion for the symmetric situation, at the end of parags.

Again, a QE-gap is the list of plain tokens between the last q-token of a parag and the end of that parag.  The corresponding QE-phrase is those plain tokens prefixed with that q-token, which is the verb of the QE-phrase.  (Surely the Arabic, Welsh, and Maori speakers in this forum will find this name quite logical.)

Recall that the Starred Parags section (SPS) has an excess of QE-gaps of lengths 5 to 7, over the distribution expected from a zero-order Markov model.  I extracted the QE-phrases with those gap lengths; there are 73 of them (out of of 327 parags). They use 45 distinct verbs, of which only these are used in two or more of those phrases:

      9 qokaiin
      5 qokain
      5 qokeey
      4 qokal
      4 qokedy
      3 qokar
      3 qokeedy
      2 qokchdy
      2 qotar
      2 qoteey

The most popular QE verbs seem to be also the most popular BQ verbs, but with somewhat different frequency ranks.  There is only one of these phrases with qotain as the verb, and none with qotaiin.

Here are the nine QE phrases with 5 to 7 plain tokens and qokaiin as the verb:

QE 5 qokaiin.air.lo.r.chedy.otain
QE 7 qokaiin.checkhy.sho.lchal.sheey.shckhey.kshar.tar
QE 7 qokaiin.cho.okaiin.cheodaiin.aky.le.chody.chotaiin
QE 5 qokaiin.lkar.ytaiin.otcheo.chy.sarain
QE 7 qokaiin.olchedydy.sain.shey.olsheey.dair.chekeeal.okeey
QE 5 qokaiin.olkal.airody.okaiin.okalal.loary
QE 5 qokaiin.olkeedy.okar.ar.olkain.odain
QE 7 qokaiin.os.daiin.cheal.otain.okar.otaiin.olaiin
QE 7 qokaiin.otal.alkal.okain.cheey.lol.loeey.aiinal

Here are the five with qokain as the verb:

QE 5 qokain.char.ar.olar.aiiin.okar
QE 6 qokain.chckhy.rorol.chdy.raly.oraiin.chary
QE 5 qokain.chees.ykar.ain.ycheeo.lkeey
QE 5 qokain.chey.lkain.chal.ldy.llm
QE 7 qokain.chkal.chckhy.saiin.ar.lchal.she.otain

Here are the five with qokeey and the two with qoteey:

QE 6 qokeey.cheytain.olain.okeeedy.chekain.chckhey.lchar
QE 5 qokeey.okeedal.okeol.lkeey.lchedy.lchedy
QE 6 qokeey.okeeedy.okain.chedy.chedyteey.dal.lam
QE 6 qokeey.okey.chdar.ol.loty.chedar.aly
QE 6 qokeey.otaiin.olaiin.cheokain.lkeey.ltal.keedy

QE 5 qoteey.daiin.okched.y.sheos.aiin
QE 6 qoteey.lkeey.raiin.cheo.lor.otal.otchedy

Again, do you see any interesting patterns?  What other stats on this topic would you like to see?

All the best, --stolfi
Hello Jorge, I still don't quite understand what the goal of your analysis is. Do you see a way to categorize the manuscript more as natural language in the analysis or the results?
(22-01-2026, 06:33 AM)Petrasti Wrote: You are not allowed to view links. Register or Login to view.Hello Jorge, I still don't quite understand what the goal of your analysis is. Do you see a way to categorize the manuscript more as natural language in the analysis or the results?

I believe that it is natural language, but not an European one, or any that would be known in Europe at the time (such as Arabic, Hebrew, Georgian, Basque, Turkish,  Persian...).  And I believe that it is in the plain, or at most  encrypted with an encoding that is one-to-one on words.

The general goal of that analysis was to identify sentence boundaries in the paragraphs. This analysis specifically tested the hunch that the q-words could be indicators of start or end of sentence -- such as verbs in a VSO or SOV language, or nominative case nouns in a SVO language, etc..  

The results do not quite prove that hunch, but suggest that those words have a specific syntactic role that places them in specific places in a sentence.  That would explain why they tend to be absent from the beginning and end of paragraphs.

All the best, --stolfi
Have you checked if other quires show a similar pattern providing there is enough data for meaningful statistics?
(22-01-2026, 01:43 PM)Bernd Wrote: You are not allowed to view links. Register or Login to view.Have you checked if other quires show a similar pattern providing there is enough data for meaningful statistics?

I intend to do that on the sections that have significant parag text (Herbal-A, Herbal-B, and Bio).  But the transcription file that I am using still has bugs and holes in those sections.  I am checking it against the images and against Rene's transcription, but it takes time...

I did a run on the part of Herbal-A that I already have debugegd.  The results seemed qualitatively similar: the q-tokens tend to avoid the ends of parags and cluster at certain preferred distances from them. But that impression may still turn out to be false...

All the best, --stolfi
Pages: 1 2 3