[split] The Zipf law and the Voynich Manuscript

[split] The Zipf law and the Voynich Manuscript - Printable Version

+- The Voynich Ninja (https://www.voynich.ninja)
+-- Forum: Voynich Research (https://www.voynich.ninja/forum-27.html)
+--- Forum: Analysis of the text (https://www.voynich.ninja/forum-41.html)
+--- Thread: [split] The Zipf law and the Voynich Manuscript (/thread-1555.html)

Pages: 1 2 3 4 5

RE: Some letters aren't letters - -JKP- - 26-02-2017

(25-02-2017, 02:50 PM)Sam G Wrote: You are not allowed to view links. Register or Login to view.You could say that a different process was used for the labels. But why would someone do that? Was someone in the 15th century aware of Zipf's law and where it should and should not apply in meaningful texts?

Why would someone do that? Here's one possible reason...

A label next to a drawing is too easy to decode. It's a clue if there's an underlying cipher. It's the first thing most people try to decode. If the creator was aware of this then maybe the VMS "labels" were created with some of the same elements, but in a different way (or in a meaningless way, but hopefully just in a different way).

If however, the text is meaningful and the labels are constructed the same way (I hate to say encoded), then the person creating them had great confidence in the system.

RE: Some letters aren't letters - ReneZ - 26-02-2017

When I wrote:

Quote:The interesting part is not that the main text follows Zipf law, but that the main text does while the labels do not.

This means that not the same process was followed for generating or writing the main text and for the labels.

I thought it was obvious, but a bit of explanation seems to be a good idea.

I use the term 'process' in a very general way.

The process involves a person, pen in hand, who writes one character after the other on the parchment. The resulting text is the output of the process.

One type of process could be: writing a running text in some language, with implied rules about grammar and syntax.

Another process could be: adding single words to illustrations indicating what the illustration is about.

Many of the properties of the two outputs will be different, even though the words in the second process are likely to occur in the the output of the first process.

A third process could be: moving a Cardan grille over a large table and copying the resulting character sequences to the parchment.
A fourth: the auto-copying hypothesis of Torsten Timm.

In the 'optimistic' scenario that that the Voynich MS contains a meaningful text that is just waiting to be retrieved, the first two processes could be the basis for the main text and the labels.
The fact that the observed differences exist do not prove that the text is meaningful, but at least it is compatible with that, and in my opinion it is a sign of planning and of non-arbitrariness.

In the case of the Cardan grille and the auto-copying hypothesis, it is not immediately clear why the Zipf law would be obeyed, but it is conceivable. However, what is not explained is that it appears in the main text but not in the labels.
It would require a dedicated effort by the author to 'do something different'.

One can safely exclude the possibility that the author understood the Zipf law and deliberately broke it for the labels.

Another interesting property of the label words (and I concentrate on the zodiac and pharma labels - I have barely looked at the others) is that they do largely occur in the main text (thanks to Marco for confirming this), but do not include some of the most frequent words. No label (from memory) just says chol , daiin or chedy. This is also a 'good sign' for the meaningful text scenario. The running text is likely to include words like articles, prepositions and verbs that are less likely candidates as labels.

Just as a historical footnote, the solution proposed by John Stojko works by ignoring all spaces in the manuscript, assigning consonants to all symbols, and inserting vowels and re-introducing spaces. The resulting text is proposed to be (old) Ukrainian. This solutions runs over the plain text, but also over the concatenated labels, without any distinction. This implies one and the same process for the running text and the labels, which is not compatible with what is observed. (The much bigger problem is that it does not explain the word structure of Voynichese).

RE: Some letters aren't letters - Torsten - 26-02-2017

(26-02-2017, 08:48 AM)ReneZ Wrote: You are not allowed to view links. Register or Login to view.In the case of the Cardan grille and the auto-copying hypothesis, it is not immediately clear why the Zipf law would be obeyed, but it is conceivable. However, what is not explained is that it appears in the main text but not in the labels.
It would require a dedicated effort by the author to 'do something different'.

For the auto-copying method it is something different if you write text or if you write labels. In one case you can copy words from previous lines in the other case you have some distance between text and the place you write. Most of the time you have only previous labels as source words available. Moreover if the labels are arranged in circular form you would probably turn the page while write the labels. This makes it harder to copy labels which are just up side down. Therefore it is expected for the auto-copying hypotheses that more unique words are used as labels.

Quote:One can safely exclude the possibility that the author understood the Zipf law and deliberately broke it for the labels.

Agreed.

Quote:Another interesting property of the label words (and I concentrate on the zodiac and pharma labels - I have barely looked at the others) is that they do largely occur in the main text (thanks to Marco for confirming this), but do not include some of the most frequent words. No label (from memory) just says chol , daiin or chedy. This is also a 'good sign' for the meaningful text scenario. The running text is likely to include words like articles, prepositions and verbs that are less likely candidates as labels.

Why do you pick [chol], [daiin] and [chedy]? Only [daiin] is common for the whole manuscript. [chol] is typical for Currier A and [chedy] is only frequently used in Currier B. But labels occur mostly in the Pharmaceutical section, the Astronomical section and the Cosmological section. Even if the Pharmaceutical section is counted as Currier A this are just the sections between Currier A and B. The pages in Currier B only rarely use labels. Therefore the only place where you can expect a word like [chedy] used as label is the Biological section. And in the Biological section you can find at least a label [otol shedy] in <f77v.L.1>.

There are three labels using [daiin]:
<f67r2.X.6> tol.daiin=
<f68v2.R.12> dchedal.daiin=
<f72r3.S1.9> oteey.daiin=

There are also labels similar to [daiin] [chol] and [chedy]:
<f68r1.S.17> ordaiin=
<f68r2.S.2> odaiin=
<f75r.L.7> dainy=

<f68r2.S.5> dchol=

RE: Some letters aren't letters - Sam G - 26-02-2017

(26-02-2017, 11:20 AM)Torsten Wrote: You are not allowed to view links. Register or Login to view.Only [daiin] is common for the whole manuscript.

...

There are three labels using [daiin]:
<f67r2.X.6> tol.daiin=
<f68v2.R.12> dchedal.daiin=
<f72r3.S1.9> oteey.daiin=

This is another point on which what we see in the VMS is consistent with a meaningful text. In an English text, for instance, it would be nonsensical for common words like "the", "for", "to", etc. to occur by themselves in labels in illustrations, but it would be perfectly normal for them to occur as a part of a compound phrase containing other words.

RE: Some letters aren't letters - Torsten - 27-02-2017

(26-02-2017, 02:15 PM)Sam G Wrote: You are not allowed to view links. Register or Login to view.
(26-02-2017, 11:20 AM)Torsten Wrote: You are not allowed to view links. Register or Login to view.Only [daiin] is common for the whole manuscript.

...

There are three labels using [daiin]:
<f67r2.X.6> tol.daiin=
<f68v2.R.12> dchedal.daiin=
<f72r3.S1.9> oteey.daiin=

This is another point on which what we see in the VMS is consistent with a meaningful text. In an English text, for instance, it would be nonsensical for common words like "the", "for", "to", etc. to occur by themselves in labels in illustrations, but it would be perfectly normal for them to occur as a part of a compound phrase containing other words.

The usage of [daiin] is not consistent with a common word like "the", "and", "for" or "to". There are pages full of text without any [daiin] like f75r, f79v, You are not allowed to view links. Register or Login to view. and You are not allowed to view links. Register or Login to view. (see You are not allowed to view links. Register or Login to view.). There is also a page like You are not allowed to view links. Register or Login to view. with only 57 words but with 11 instances of [daiin]. Moreover on most pages [daiin] do co-occur with similar words like [aiin], [dain] or [ain] (see You are not allowed to view links. Register or Login to view.).

RE: [split] The Zipf law and the Voynich Manuscript - Antonio García Jiménez - 13-08-2025

I'm writing in this old thread because it directly addresses Zipf's law. It's often said that the main text of the Voynch follows Zipf's law, but not the labels. This has been assumed, along with other things, in manuscript research, but I haven't seen a reasoned and convincing explanation. In fact, I don't think so.

Let's see, the most used word in the script is daiin, with 864 occurrences. According to Zipf's law, the next word should appear about half as often, but it doesn't because the second, ol , appears 538 times, 100 more than half.

According to Zipf's law, the next most common word should appear one-third as often as daiin, about 300 times rounded up, but this is not the case because the third word chedy appears 501 times. And the fourth word should appear just over 200 times, but it doesn't because aiin does 470 times.

In short, the alleged Voynich text does not follow Zipf's law, and it does not follow it because it is not a natural language.

RE: [split] The Zipf law and the Voynich Manuscript - Stefan Wirtz_2 - 13-08-2025

I'm writing to this new reply because I see a substantial problem in most statistic calculations and countings and all "conclusions" from those:
numbers just appear as page counts in some corners of the VMS pages; there is not any numerical digit in the whole texts. It is said that all pages' numbers have been applied later by some researcher (who had lost a good 20 of them afterwards).
This may allow the conclusion that the VMS language system has no own numerical digits' set, which was not quite unusual in earlier times.

So or so, all possible quantifications in VMS texts must be written in full words --
this immediately impacts text quality, repetitions and predictabilities in comparison to all modern texts and languages who "outsourced" their quantifications into (arabic) number systems and do not lead longer "text values" with them.
Was this really taken into account by Zipf, entropy or any other calculating approaches?

Or in other words(!):
if you had no idea of Latin and would find text passages filled with "MCCCLVIII", "MDCXXIV", "MDCCLXII", "DCCCLV" and such, what will be your judgement about
- Zipf distribution
- entropy and predictability
- Latin being a natural language or not?

As far as we know the VMS texts (exactly not), they might as well be catalogue pages crowded with quantities and even prices;
the most common words may mean "one" or "ten" or "thousand" as well as any other option.
And whoever imagines there may be "recipes" in it, urgently needs lots of quantifications for ingredients. How else?

So, apart from the more lower information value of "Zipf" and entropy acrobatics, I see not any valid proof for Voynichese being a real language or not within those statistics.

RE: [split] The Zipf law and the Voynich Manuscript - Gabriel L - 13-08-2025

(13-08-2025, 02:55 PM)Antonio García Jiménez Wrote: You are not allowed to view links. Register or Login to view.I'm writing in this old thread because it directly addresses Zipf's law. It's often said that the main text of the Voynch follows Zipf's law, but not the labels. This has been assumed, along with other things, in manuscript research, but I haven't seen a reasoned and convincing explanation.

Greetings,
I have been somewhat detached from the vms for several years, but every so often I take a look in the internet. René mentioned to me this site some time ago, so 'hello everybody'.
Yes I think labels should be best excluded (I am assuming labels to be names, or verbs or... ? which is not what Zipf's law is about (i.e. an observation that seems to happen in linguistic corpora, not words lists).
For the rest of the argument, I suggest reading the paper in Cryptology, Oct 2001, Vol xxv, 4, where it is described (among other things) that the vms "approximately follows" Zipf's law of word frequencies and also Zipf's law of word lengths (not in terms of phonemes as originally described by Zipf, but in number of characters).
To do this is relatively simple, compute the slope of the curve of log(rank) vs log(frequency). Since we are talking about an empirical law, we should expect data points deviating from the ideal slope of -1.
In addition one should be careful that in some places it is not completely clear if the spacing is a marker of different words, so such departures are expected in particular in the vms. Also, for low frequency words, the plot becomes 'stepped', so the rank-frequency relation cannot be strictly a straight line.
Claiming the second most common word is seen more often than what Zipf's law predicts cannot be taken as evidence that the text is not meaningful. Such departures happen in all plots. What is more important is the main trend in the data.

Regards
G.

RE: [split] The Zipf law and the Voynich Manuscript - Antonio García Jiménez - 13-08-2025

Sorry, Gabriel, but this explanation isn't convincing at all. Of course, there is a structure to the Voynich script, but not a linguistic structure, as you seem to assume. Once you've assumed this, you do your best to make Zipf's law fit this assumption.

Simply put, empirical data do not support Zipf's law, even taking into account the ambiguities in the spaces between groups of glyphs. It's not only that the second most common word is seen more often than what Zipf's law predicts since as you say the trend is more important, but as I said before, the third and fourth words in frequency deviate from the trend much more than expected by Zipf's law. The fourth most frequent word appears more than twice as often as predicted by theory, and that's because there is no language in the Voynich and therefore no word at all.

RE: [split] The Zipf law and the Voynich Manuscript - Gabriel L - 13-08-2025

I understand that it may not be convincing to some, but the inverse power relation between rank and frequency is there. Anybody can compute it. Maybe looking at the plots in the paper cited above would have been more persuasive. The departure for the high ranking words, is also well known, the wikipedia page on this shows many plots which do not follow strictly for the highest rank words (I am pretty sure that Stolfi would love the Tibetan, Chinese and Vietnamese curved Zipf plots Wink

). You are not allowed to view links. Register or Login to view.
Dismissing the relation of log(rank) vs. log(frequency) based on a few points of raw data (i.e. not on the expected log(data)) does not seem right.
There is also something called the Zipf's-Mandelbrot law which aims to fit that high frequency range better.

Just to be clear, I do not know what the vms is, whether a language, code, a cipher or something else. I would consider any suggestions based on testable evidence, something that few people provide.
One more important thing is that any attempts to explain a 'solution' should explain that relation too.
Regards,

G.