Heap's Law - Printable Version

Heap's Law - Printable Version

+- The Voynich Ninja (https://www.voynich.ninja)
+-- Forum: Voynich Research (https://www.voynich.ninja/forum-27.html)
+--- Forum: Analysis of the text (https://www.voynich.ninja/forum-41.html)
+--- Thread: Heap's Law (/thread-2279.html)

Pages: 1 2 3 4 5

RE: Heap's Law - ReneZ - 11-02-2018

To characterise the text in this manner, what one could do is take a group of words (say 1000) anywhere in the MS, and then inspect the following words (a smaller group, e.g. 100) to see how many of them occur in the previous 1000 and how many are new. One can do this in two different ways, either counting all words of the 100, or only the unique words of the 100.

This is similar to checking Heap's law, but lets one put the start of the MS in different places. This way, it is easier to see changes that are at an advanced point in the text.

RE: Heap's Law - MarcoP - 11-02-2018

(11-02-2018, 12:44 PM)davidjackson Wrote: You are not allowed to view links. Register or Login to view.
Quote:MarcoP: David, what do you mean when you say that the ratio of creation is "constant"? Do you mean that the number of unique words increases linearly with text length?

Here I was referring to Torsten's text. Words in his corpus were created on a linear basis, whereas the Heap's line is a power line, and hence the two diverge.

If you look at the graph from the stars section you see how the creation rate rises and falls -something which I assume without proof is due to the introduction of new topics-, but still has an association with the power line.

Thank you, David!
A straight line is of course a very particular case of f(x)=A+x^B with B=1.

The attached curves correspond to random files of words (each word 4 characters long) based on different size alphabets: 8 characters, 12 characters, 26 characters.
With 8 characters, you can only have 8^4=4096 words: with 10000 random tries, we almost have the whole dictionary covered.
On the other hand, with 26 characters, the possible "dictionary" of words is much large than the x-range of this graph: we have only covered a tiny fraction of it and each next word is very unlikely to have occurred before. The corresponding blue line is very close to f(x)=x.
Obviously, with the 12 characters alphabet, we get something intermediate between the two.

These are samples from the random files:

Code:
==> rnd8.txt <==

fbba bgfc bhgc adba bced cded dfhc aceh egag fgaa ahge ecde ebge hhed egab edah heff efaf fddg hbhc

hhhh aace eegd bcag bcaf egbc fcfh accg cggh gfdh hdgg hcfd eede ccbg cceh aabd eheh bfeg hcdg dgbh

==> rnd12.txt <==

gjbl dglh jgkc hhke cjha babg idih ejge egjf ffef klhj fbgk jldh aajc fljj lcie ljga hleh lhda jhag

bjal lbde jeha lhfd efif afah iflj hfae lfel bklj bkca dlbk eleb bdfe ebck lehi figk ldag ahie ehga

==> rnd26.txt <==

lrdb revy asdn ajsn zhmy ajrs glve lhkr pywn kdoj jijz asbs xuku cpdk vsvz uwyw aacf okdy pgxa hsik

rlvl uagf qmrr rlmi qdmf zked luvj zhcz mhoa pqgw praj icxo wzdb dbie anmv dytl dvul vkea mmxh bxds

What happens with language is something similar, in my opinion. The number of legal combinations of characters is limited, not by the size of the alphabet, but by linguistic rules. Moreover (something my simple experiment doesn't capture) some words are more frequent than others (e.g. the function words you discussed a few moths ago).

Now, if Torsten's algorithm produces something close to f(x)=0.5*x, it could be that it copies half of the words from the already generated text (and these words obviously do not add to Heap's total of unique words) and randomly generates the other words (similarly to the blue line representing the rnd26 file). While copying already generated words is an excellent way to approximate Heap's law, one must also find a way to appropriately constrain the generation of new words.

Have you thought of also examining You are not allowed to view links. Register or Login to view.? My impression is that he has a good amount of "morphological rules" in place, so his dictionary should be more constrained and possibly produce a curve similar to that of the actual VMS.

RE: Heap's Law - Torsten - 11-02-2018

(11-02-2018, 02:59 PM)MarcoP Wrote: You are not allowed to view links. Register or Login to view.Now, if Torsten's algorithm produces something close to f(x)=0.5*x, it could be that it copies half of the words from the already generated text (and these words obviously do not add to Heap's total of unique words) and randomly generates the other words (similarly to the blue line representing the rnd26 file). While copying already generated words is an excellent way to approximate Heap's law, one must also find a way to appropriately constrain the generation of new words.

The generated text only contains words copied from each other.

The idea behind the auto-copy algorithm is that the scribe was generating the text by copying words already written. But to copy a word doesn't mean that a source word 'chol' is copied as 'chol'. Instead the words are modified. For instance instead of 'chol' the scribe could write for instance words like 'shol', 'chor' or 'cheol'. While doing so the scribe would normally copy words he was able to see. This is in my eyes the explanation for the observation that similar words co-occur on the same pages. For instance on a page containing many instances of 'chol' you will also find many instances of similar words. See for instance all the instances of 'chol' and 'chor' on page You are not allowed to view links. Register or Login to view. (see You are not allowed to view links. Register or Login to view.).

For each page the scribe could prefer other words. For instance on page f3r many words with a 'm'-glyph can be found (see You are not allowed to view links. Register or Login to view.). If you search for new words on page You are not allowed to view links. Register or Login to view. you will find for instance the words 'sheoldam', 'tsheoarom' and 'pcheoldom'. This three words are similar to each other and they occur only once within the VMS. Moreover 'tsheoarom' and 'pcheoldom' are used as paragraph initial words. There are only seven paragraph initial words using a final 'm'-glyph in the whole VMS. This is the way new words occur within the VMS.

RE: Heap's Law - davidjackson - 12-02-2018

Quote:ReneZ: To characterise the text in this manner, what one could do is take a group of words (say 1000) anywhere in the MS, and then inspect the following words (a smaller group, e.g. 100) to see how many of them occur in the previous 1000 and how many are new. One can do this in two different ways, either counting all words of the 100, or only the unique words of the 100.

Or, possibly, by comparing blocks of text to one another - either page by page, or even paragraph by paragraph. This could be a way to identify "subjects" within the corpus. Sudden bursts of new words would suggest that a topic change has been introduced - something the larger scale graphs that I have posted above seem to show.

Another use could be to find blocks of "null filler" text, which some people have suggested exists in the corpus. If we identify lines of text that appear to be filled with new unique tokens, then this could be null cover text designed to bulk out the real content.

Quote:MarcoP - While copying already generated words is an excellent way to approximate Heap's law, one must also find a way to appropriately constrain the generation of new words.

Indeed. One must of course assume that power laws were unknown to the scribes, and so such a constraint must make sense for the imagined mindset of the scribe. More to the point - why constrain the text in this way? It makes no sense within our imagined knowledge of the time when the text was written.

I haven't had time to examine Fisk's work yet, I will try to later this week.

Quote: The generated text only contains words copied from each other.

Torsten - this is certainly apparently true. However, I feel that your theory, which otherwise appears to account for much of the way the text was written, fails to explain the grammatical structure which appears to constrain Voynichese (see Stolfi's You are not allowed to view links. Register or Login to view.paradigm). What exactly are these structures, and why do they apply throughout the manuscript? Because if you could link those structures into your generation algorithm, we would be some way closed towards proving your theory. And why was the text created in such a way that unique words appear to be introduced in a topic like way? (OK, I don't expect you to prove that second question! Wink

)

RE: Heap's Law - -JKP- - 12-02-2018

Quote:davidjackson: This could be a way to identify "subjects" within the corpus. Sudden bursts of new words would suggest that a topic change has been introduced - something the larger scale graphs that I have posted above seem to show.

There are some patterns of self-similarity (almost fractal in nature, one might almost say echolalic) in some of the plant pages.

Look for example at the knapweed near the beginning (the one that resembles a dried specimen of Centaurea jacaea). In Latin (with abbreviations), one sees several variations of the word centaurus/centaura scattered through the text. If you wrote 9daur9 in Latin, it would be expanded as Contaurus or Centaurus or Centaurum, which would fit the identity of the plant. Latin allowed scribes to abbreviate in a variety of ways (more than one way to abbreviate a specific word and yet it would still be understood by different scribes) and it's almost as if the same thing has been abbreviated several times in slightly different ways.

Then if you look at the plant that resembles Ricinus (castor oil), you will see a lot of EVA-m. In Latin, EVA-m is the abbrevation for "ris" which is homophonic for "ric" in Ricinus, and it repeats more on this page than other pages. Once again, it has an echolalic feel to it.

.
I think these patterns might qualify as "sudden bursts" of word-patterns that seem possibly related to the content of the drawings on specific pages

RE: Heap's Law - Torsten - 13-02-2018

(12-02-2018, 07:05 PM)davidjackson Wrote: You are not allowed to view links. Register or Login to view.Torsten - this is certainly apparently true. However, I feel that your theory, which otherwise appears to account for much of the way the text was written, fails to explain the grammatical structure which appears to constrain Voynichese (see Stolfi's You are not allowed to view links. Register or Login to view.paradigm). What exactly are these structures, and why do they apply throughout the manuscript? Because if you could link those structures into your generation algorithm, we would be some way closed towards proving your theory. And why was the text created in such a way that unique words appear to be introduced in a topic like way? (OK, I don't expect you to prove that second question! )

> What exactly are these structures.

Normally the structure of a word is not change while copying it. Therefore copied words share the same structure.

See for instance page f10r. On page You are not allowed to view links. Register or Login to view. you can find beside 3 x 'chor' also 3 x 'shor' and 4 x 'chol' (see You are not allowed to view links. Register or Login to view.). The difference between 'ch' into 'sh' is one additional stroke and the difference between 'ol' into 'or' is a different shape of the last glyph. But this type of changes didn't affect the word structure.

Beside the word 'chor' also the words 'qokchor' and 'oykchor' exists and beside of 'chol' the words 'qokchol' and 'qokchol'. Also an additional "prefix" didn't affect the word sequence.
(see You are not allowed to view links. Register or Login to view.)

For 'qokchor' and 'oykchor' the structure of 'qok' has changed. In 'oyk-' the 'o'-glyph is used as first sign. This type of changes is very rare. There are only three instances of 'oyk-' but over 3100 instances of 'qok-'. Note: The connection between 'oyk-' and 'qok-' is confirmed by three instances of 'qoyk-'.

> , and why do they apply throughout the manuscript.

What happens if a word with a different structure is generated? In the case of 'oyk-' the new structure is not copied and 'oyk-' looks weird to us.

But what happens if the new structure is copied over and over again? In this case we would accept the new structure as a rule for the VMS. See for instance words like 'cheody' and 'sheody' in Currier A. On page f65v beside 'sheody' also one instance of 'chedy' exists (see You are not allowed to view links. Register or Login to view.). It is the only instance of 'chedy' on a page using Currier A. This switch from 'eod' to 'ed' is very interesting. Words using 'ed' are typical for Currier B but rarely used in Currier A. 'chedy' is the third most frequent word in the VMS. But if we would only know pages in Currier A a word like 'chedy' would look weird to us.

> And why was the text created in such a way that unique words appear to be introduced in a topic like way?

They are not introduced in a topic like way. We only use the topics to split the manuscript into sections. Herbal pages in Currier A and herbal pages in Currier B are using different word sets and share the same topic.

The main difference between Currier A and Currier B is the usage of 'ed'. Therefore we can use words like 'chedy' to reconstruct the original order for the sections:

"chedy" "qokeedy"
Herbal in Currier A 1 0
Pharmaceutical (A) 1 0
Astronomical 4 0
Cosmological 24 4
Herbal in Currier B 62 9
Stars (B) 190 137
Biological (B) 210 153

RE: Heap's Law - Wladimir D - 13-02-2018

Цитата: Wrote:ReneZ: To characterise the text in this manner, what one could do is take a group of words (say 1000) anywhere in the MS, and then inspect the following words (a smaller group, e.g. 100) to see how many of them occur in the previous 1000 and how many are new. One can do this in two different ways, either counting all words of the 100, or only the unique words of the 100.

DD
Or, possibly, by comparing blocks of text to one another - either page by page, or even paragraph by paragraph. This could be a way to identify "subjects" within the corpus.

I am currently engaged in such an analysis, but this is a huge amount of work.
I want to give one example, which is based on the intersection of words in texts corresponding to large and small plants. Take the pages You are not allowed to view links. Register or Login to view. (76 words) and f102v1 (64 words). The intersection is three (3) words.
Now compare the intersection with other arbitrary pages that are taken from other sections. I understand that for more correct results it is necessary to introduce a correction factor that takes into account the probability and is related to the difference in the sizes of the compared texts. But I simplify it, and still a clear illustration will be obtained. Results:
[font=Tahoma, sans-serif]F1r – f102v1    11 words[/font]
F84r – f102v1   11 words
[font=Tahoma, sans-serif]F105r– f102v1   11 words  [/font]
Such an increase can be explained, but three other results are surprising.
[font=Tahoma, sans-serif]F67r1 – f102v1   10 words[/font]
F68r2 – f102v1   6 words
[font=Tahoma, sans-serif]F70v2 – f102v1 7 words[/font]

For comparison, the text f19r, in which more words than in f102v1 gives a decrease in intersecting words:
F67r1 – You are not allowed to view links. Register or Login to view.   5 words
F68r2 – f19r   3 words

And only in fish there is an increase:
f70v2- You are not allowed to view links. Register or Login to view. 10 words
It turns out that the intersection of words on pages that have identical plants is less than on pages not related by identical patterns.
Therefore, it is necessary: 1 or to call into question the conclusions of Montemurro and Co. about the binding of the text to the figure.
2 or we do not correctly understand the astronomical section and the zodiacal section. For example - the stars are not stars, but symbols of flowers.

RE: Heap's Law - Koen G - 13-02-2018

Or Voynichese encodes different languages/dialects in different sections?

RE: Heap's Law - Anton - 13-02-2018

Quote:For example - the stars are not stars, but symbols of flowers.

This is my long term suspicion - that labels are not "labels", in the sense that they are not "identifiers". What would be the cause for this - either the deliberate intention to avoid giving out clues for decryption, or perhaps some intrinsic limitations of the "encoding" process, - is not quite clear.

What is obvious though is that, say, if stars labels are not identifiers then, - considering that for practical purposes the reader would need some means of identifying them, - the charts must follow some well-known visual pattern (well-known at the time of writing, of course). If the plants Ids' are conveyed by mnemonics, then for "stars" diagrams (whether they in fact are stars, stones or otherwise) the only means of identifying them would be each star's position on a well-known diagram.

RE: Heap's Law - -JKP- - 14-02-2018

I don't know if anyone has noticed, but the "star" labels are constructed almost entirely of Janus pairs, so I've also been suspicious for a long time as to whether they are actually star names (or actually stars).

Anton, considering how repetitive the "star" labels are (in terms of the position of the glpyhs and choice of glyphs), maybe the labels themselves refer to positions (rather than to names).