The Voynich Ninja - Discussion of "A possible generating algorithm of the Voynich manuscript"

Pages: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25

I'm going to stick mostly with the observations in section 2, about word co-occurrence, as that's the meat of this paper. I've said enough about the "self-citation" theory to make it clear that I reject it as being really unsatisfactory. But I think the observations might be useful to other researchers. At the very least they deserve exploring and understanding more.

I'm most interested in this claim:

Quote:when we look at the three most frequent words on each page, for more than half of the pages two of three will differ in only one detail.

This is quite a strong claim and yet has the greatest implications. It's certainly nothing I've ever seen before, though is perhaps not surprising given the general word structure of the Voynich text.

If we look at only the ten most common words in the text we can see some such pairs/groups: [daiin, aiin], [chedy, shedy], [chol, ol, or, ar], [dar, ar]. There are many, many more the further down the wordlist we go. This would suggest that, if we allow ourselves the ability to look at the most common ten words (not what they claim) then we would always find such a similar pairing. Figure 3 (page 6) shows just how many pairings some words can have.

Moreover, it would suggest that the similar pairs should tend to be the same words. If the common word pairs on any page tend to be common overall, then the import of the claim is reduced. It is almost like claiming that "a" and "an" (or "an" and "and" for that matter) occur frequently in an English language text.

When we look at You are not allowed to view links. Register or Login to view.this is indeed what we find. Under section 1.3 you can look at the manuscript page by page and the most common words pairs are listed. From a casual browse we can see that in general the same words do occur as pairs time and again. Sometimes we see some word pairs which are uncommon, but they are not the majority. We also see an obvious change from Currier A to Currier B, though some word pairs are still shared.

My worry is that this observation about the occurrence of similar word pairs is a very simple and obvious fact arising from the word structure, rather than anything deeper. It illustrates the way in which the rigid word structure in the Voynich text, coupled with most "structurally valid" words actually occurring, creates lots of similar words in general. It doesn't provide useful support for the "autocopying" theory presented later on in the paper.

I think it's best now to turn to Figure 4 (page 8) as a lot of the argument hinges on this. The graph shows that the edit distance between words is lower the nearer the words occur. Words are more similar by up to 0.3 of an edit distance within 30 (Herbal A) or 60 lines (Quire 20), before reaching a stable edit distance beyond those numbers of lines. A comparison with a text in English (Alice in Wonderland) is made.

I don't feel that the comparison text is really fair, as a coherent story with a single topic is likely to be much more "flat" in terms of word choice. The Voynich text is likely to switch topic on a regular basis, and would have more profitably been compared with other herbals. Indeed, I note that many pages in the Herbal A section have about 10 to 15 lines of text, which could account for much of the difference in edit distance if the similar words are actually different morphological forms of the same word. Likewise, Quire 20 pages have around 30 to 60 lines, meaning that some kind of thematic ordering could influence the similarity of the vocabulary.

I will leave my commentary here, though there's much more to be said, to allow others to join the discussion and perhaps the authors to respond.

Quote:I'm most interested in this claim:

Quote:when we look at the three most frequent words on each page, for more than half of the pages two of three will differ in only one detail.

That's a very interesting angle. In particular, it is interesting in which detail they do differ, and what is their relation to most frequent words on the whole. If the page-frequent words are the same as the corpus-frequent words, then that would be a bit trivial.

What I noticed, the word frequency in the "self-citation text generator" has a different course than in the VMS (Top 30).

[Image: frequency_02.png]

You are not allowed to view links. Register or Login to view.

(25-05-2019, 10:20 PM)Emma May Smith Wrote: You are not allowed to view links. Register or Login to view.
(25-05-2019, 10:03 PM)Torsten Wrote: You are not allowed to view links. Register or Login to view.
(25-05-2019, 08:54 PM)Emma May Smith Wrote: You are not allowed to view links. Register or Login to view.It seems like the language network graphs are also wonky. Vietnamese look particularly bad, but it's easier to point out the errors on the Greek graph:
Multiple single letters are shown as being unconnected to anything despite two letter words which contain that letter existing elsewhere in the graph.

[ll] occurs at least twice unconnected by any chain.

[mn] occurs in two different networks.

[ma] and [mo] are unconnected to either of the two [mn] despite an "edit distance" of one.

[ot'] is unconnected to [tot'] despite an "edit distance" of one.

[de] is connected to, um, [de], but not to anything else.

[o] is connected to both [oe] and [ok], but not [ot].

Why is it full of errors like this?

Please look into the greek You are not allowed to view links. Register or Login to view.. The text is using stress marks. What you interpret as [de] is in fact written as δὲ and δέ. Anyway, do you really believe that the picture for Greek becomes different if the marks are removed?

Hi Torsten, there are no such stress marks represented on the words in question, though they are represented on other words. Besides, that only accounts for one or my objections. Nor does it account for the weirdness in the Vietnamese graph.

This could all be very quickly cleared up if you gave us access to your paper. I'm keen to know if you're made any advance from five years ago, as You are not allowed to view links. Register or Login to view. I'm sorry to judge your paper by your old research and the peripheral information, but you leave us no other choice.

Cryptologia will have given you a certain number of free to access papers you can share with your peers, and you also have the right to share preprint versions of the paper. Can you at least state that you will do this at some point in the future?

I have checked the graph for Greek. Indeed, not all similar words were connected. There was at least a rule in place to prevent the comparison for words with two glyphs. Therefore, I have recreated the graphs for Greek.

(26-05-2019, 03:00 PM)bi3mw Wrote: You are not allowed to view links. Register or Login to view.What I noticed, the word frequency in the "self-citation text generator" has a different course than in the VMS (Top 30).

You are not allowed to view links. Register or Login to view.

Our goal was to demonstrate that even a simple implementation of the algorithm reproduces some of the statistical key properties of the Voynich manuscript. This is what we say: "... it is possible to pinpoint quantitative differences between the real VMS and the used facsimile text (most likely any facsimile text). ... We deliberately did not fine tune the algorithm to pick an 'optimal' sample for this presentation."

It is interesting to compare the frequencies for Currier A and Currier B. This is the result:

[Image: CurrierAvsB.png]

[attachment=2967]

Did you use numbers for the whole VMS or for the VMS without labels? For the whole VMS I would expect 836 [daiin]-tokens, 537 [ol]-tokens, 501 [chedy]-tokens etc.

(26-05-2019, 01:00 PM)Anton Wrote: You are not allowed to view links. Register or Login to view.
Quote:I'm most interested in this claim:

Quote:when we look at the three most frequent words on each page, for more than half of the pages two of three will differ in only one detail.

That's a very interesting angle. In particular, it is interesting in which detail they do differ, and what is their relation to most frequent words on the whole. If the page-frequent words are the same as the corpus-frequent words, then that would be a bit trivial.

There is much more said in section 2. For instance, we also say that "a token dominating one page might be rare or missing on the next one".

See for instance the pages You are not allowed to view links. Register or Login to view. and f1v. On page You are not allowed to view links. Register or Login to view. the most frequent tokens are [daiin] and [dain] whereas on page You are not allowed to view links. Register or Login to view. only one instance of [daiin] exists:
You are not allowed to view links. Register or Login to view. daiin (7) / dain (6)
You are not allowed to view links. Register or Login to view. chol (5) / shol (4) / ... / daiin (1)

(26-05-2019, 08:52 PM)Torsten Wrote: You are not allowed to view links. Register or Login to view.....

Did you use numbers for the whole VMS or for the VMS without labels? For the whole VMS I would expect 836 [daiin]-tokens, 537 [ol]-tokens, 501 [chedy]-tokens etc.

I have redesigned the top 30 (VMS). It should fit now.
You are not allowed to view links. Register or Login to view. ( Page 47-48, Appendix 3 )

(26-05-2019, 09:12 PM)Torsten Wrote: You are not allowed to view links. Register or Login to view.
(26-05-2019, 01:00 PM)Anton Wrote: You are not allowed to view links. Register or Login to view.
Quote:I'm most interested in this claim:

Quote:when we look at the three most frequent words on each page, for more than half of the pages two of three will differ in only one detail.

That's a very interesting angle. In particular, it is interesting in which detail they do differ, and what is their relation to most frequent words on the whole. If the page-frequent words are the same as the corpus-frequent words, then that would be a bit trivial.

There is much more said in section 2. For instance, we also say that "a token dominating one page might be rare or missing on the next one".

See for instance the pages You are not allowed to view links. Register or Login to view. and f1v. On page You are not allowed to view links. Register or Login to view. the most frequent tokens are [daiin] and [dain] whereas on page You are not allowed to view links. Register or Login to view. only one instance of [daiin] exists:
You are not allowed to view links. Register or Login to view. daiin (7) / dain (6)
You are not allowed to view links. Register or Login to view. chol (5) / shol (4) / ... / daiin (1)

The occurrences of [daiin] for each page of the first quire are: 7 1 3 5 2 1 5 4 2 2 5 5 4. Considering that You are not allowed to view links. Register or Login to view. has more words than any of the other pages, I don't see how this is anomalous. Indeed, You are not allowed to view links. Register or Login to view. has 5 [daiin] over eight lines compared with You are not allowed to view links. Register or Login to view. having 7 [daiin] over 24 lines. Yet you give the signature word pair for You are not allowed to view links. Register or Login to view. as [chol, chor], which also have 5 tokens, simply because that page doesn't have any strong matches to [daiin].

Why did the writer choose to copy some words more than others on any given page? Why does the choice of words and word pairs seem unpredictable aside from their overall frequency? Why did the writer choose to write [daiin] a lot, but [dain] rather less, and [daiir] or [daiiin] a lot less?

The page f89r2 has a massive 19 tokens of [daiin], yet zero [dain], one [daiiin], and zero [daiir]. He wrote [daiin] three times in a row! Linguistic theories are expected to account for this, but how can you? Did the writer not catch what he was writing? Was he suffering from monotony? Did he forget that he could spice things up a bit by dropping or adding an [i]? Was he simply insane? Blind? Experiencing dementia?

(26-05-2019, 11:18 AM)Emma May Smith Wrote: You are not allowed to view links. Register or Login to view.I'm most interested in this claim:

Quote:when we look at the three most frequent words on each page, for more than half of the pages two of three will differ in only one detail.

This is quite a strong claim and yet has the greatest implications. It's certainly nothing I've ever seen before, though is perhaps not surprising given the general word structure of the Voynich text.

There is much more said in this section then this sentence.

(26-05-2019, 11:18 AM)Emma May Smith Wrote: You are not allowed to view links. Register or Login to view.If we look at only the ten most common words in the text we can see some such pairs/groups: [daiin, aiin], [chedy, shedy], [chol, ol, or, ar], [dar, ar]. There are many, many more the further down the wordlist we go.[/qote]
This would suggest that, if we allow ourselves the ability to look at the most common ten words (not what they claim) then we would always find such a similar pairing.

This is exactly what we argue in the paper. See "for each common word, there is at least another one differing from it by only a single quill stroke." and "Starting with the most frequent token, one can recursively
search for other words differing by just a single glyph and connect these new nodes with an edge." The example with the three most frequent words only illustrates that this is typical for the VMS.

(26-05-2019, 11:18 AM)Emma May Smith Wrote: You are not allowed to view links. Register or Login to view.Figure 3 (page 6) shows just how many pairings some words can have.

Figure 3 illustrates the observation that "high-frequency tokens also tend to have high numbers of similar words".

(26-05-2019, 11:18 AM)Emma May Smith Wrote: You are not allowed to view links. Register or Login to view.Moreover, it would suggest that the similar pairs should tend to be the same words. If the common word pairs on any page tend to be common overall, then the import of the claim is reduced. It is almost like claiming that "a" and "an" (or "an" and "and" for that matter) occur frequently in an English language text.

There is a whole network of similar words. See for instance figure 1. Moreover, similar word pairs are not equally distributed, they can be found in close vicinity to each other.

(26-05-2019, 11:18 AM)Emma May Smith Wrote: You are not allowed to view links. Register or Login to view.When we look at the additional material on Github this is indeed what we find. Under section 1.3 you can look at the manuscript page by page and the most
common words pairs are listed. From a casual browse we can see that in general the same words do occur as pairs time and again. Sometimes we see some word pairs which are uncommon, but they are not the majority.
We also see an obvious change from Currier A to Currier B, though some word pairs are still shared.

Some tokens are typical for Currier A and B. But it is always possible that a "token dominating one page might be rare or missing on the next one".

(26-05-2019, 11:18 AM)Emma May Smith Wrote: You are not allowed to view links. Register or Login to view.My worry is that this observation about the occurrence of similar word pairs is a very simple and obvious fact arising from the word structure, rather than anything deeper.
It illustrates the way in which the rigid word structure in the Voynich text, coupled with most "structurally valid" words actually occurring, creates lots of similar words in general.

The distribution of similar words is a feature all pages have in common. This feature is even true for words with an untypical word structure. See for instance You are not allowed to view links. Register or Login to view. p. 13.

(26-05-2019, 11:18 AM)Emma May Smith Wrote: You are not allowed to view links. Register or Login to view.I think it's best now to turn to Figure 4 (page 8) as a lot of the argument hinges on this.
The graph shows that the edit distance between words is lower the nearer the words occur. Words are more similar by up to 0.3 of an edit distance within 30 (Herbal A) or 60 lines (Quire 20),
before reaching a stable edit distance beyond those numbers of lines. A comparison with a text in English (Alice in Wonderland) is made.
I don't feel that the comparison text is really fair, as a coherent story with a single topic is likely to be much more "flat" in terms of word choice.
The Voynich text is likely to switch topic on a regular basis, and would have more profitably been compared with other herbals.
Indeed, I note that many pages in the Herbal A section have about 10 to 15 lines of text, which could account for much of the difference in edit distance if the similar words are actually different morphological forms of the same word. Likewise, Quire 20 pages have around 30 to 60 lines, meaning that some kind of thematic ordering could influence the similarity of the vocabulary.

This would mean that every page has it's own topic.

Quote:This would mean that every page has it's own topic.

Now you mention it.

Pages: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25